Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Enable timeout in dist training #877

Merged
merged 10 commits into from
Feb 3, 2023

Conversation

apacha
Copy link
Contributor

@apacha apacha commented Jan 13, 2023

This PR fixes #873:

  • Adds option to specify a runtime timeout for distributed training
  • Safe conversion of the specified timeout in seconds to timedelta with type-check
  • Adding documentation on how to use it

@CLAassistant
Copy link

CLAassistant commented Jan 13, 2023

CLA assistant check
All committers have signed the CLA.

@codecov
Copy link

codecov bot commented Jan 16, 2023

Codecov Report

❗ No coverage uploaded for pull request base (main@9d3f5b2). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head b9c239d differs from pull request most recent head cdd1a36. Consider uploading reports for the commit cdd1a36 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #877   +/-   ##
=======================================
  Coverage        ?   78.15%           
=======================================
  Files           ?      132           
  Lines           ?     9974           
  Branches        ?     1993           
=======================================
  Hits            ?     7795           
  Misses          ?     1842           
  Partials        ?      337           
Flag Coverage Δ
unittests 78.15% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@HAOCHENYE HAOCHENYE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contributions, he

mmengine/dist/utils.py Outdated Show resolved Hide resolved
mmengine/dist/utils.py Outdated Show resolved Hide resolved
mmengine/dist/utils.py Outdated Show resolved Hide resolved
requirements/tests.txt Outdated Show resolved Hide resolved
apacha and others added 4 commits January 16, 2023 11:46
Adding an explicit `is not None` to the check

Co-authored-by: Mashiro <57566630+HAOCHENYE@users.noreply.github.com>
…of assuming it is the right type and handling the exception if the type doesn't match.
@apacha apacha requested review from HAOCHENYE and removed request for RangiLyu, zhouzaida and C1rN09 January 16, 2023 10:57
HAOCHENYE
HAOCHENYE previously approved these changes Jan 16, 2023
Copy link
Collaborator

@HAOCHENYE HAOCHENYE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved!!!
image

C1rN09
C1rN09 previously approved these changes Jan 16, 2023
@HAOCHENYE HAOCHENYE added the ready ready to merge label Jan 19, 2023
@HAOCHENYE HAOCHENYE added this to the 0.5.1 milestone Jan 19, 2023
@zhouzaida zhouzaida dismissed stale reviews from C1rN09 and HAOCHENYE via cdd1a36 February 3, 2023 07:35
@zhouzaida zhouzaida changed the title Enable timeout in dist training [Enhancement] Enable timeout in dist training Feb 3, 2023
@zhouzaida zhouzaida merged commit 1aa14b4 into open-mmlab:main Feb 3, 2023
@apacha apacha deleted the enable_timeout_in_dist_training branch February 3, 2023 08:15
@jason102811
Copy link

Hi @apacha !First of all, we want to express our gratitude for your significant PR in the MMEngine. Your contribution is highly appreciated, and we are grateful for your efforts in helping improve this open-source project during your personal time. We believe that many developers will benefit from your PR.

We would also like to invite you to join our Special Interest Group (SIG) private channel on Discord, where you can share your experiences, ideas, and build connections with like-minded peers. To join the SIG channel, simply message moderator— OpenMMLab on Discord or briefly share your open-source contributions in the #introductions channel and we will assist you. Look forward to seeing you there! Join us :https://discord.gg/UjgXkPWNqA

If you have WeChat account,welcome to join our community on WeChat. You can add our assistant :openmmlabwx. Please add "mmsig + Github ID" as a remark when adding friends:)
Thank you again for your contribution❤

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] MMEngine doesn't allow to set a timeout for distributed training
6 participants