-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] [Example] Multi-node example #3398
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the multi-node support @MaoZiming ! Left some comments. Also, it will be great to add a smoke test to show that multi-node is gang scheduling (maybe manually terminate one of them and see the serve controller's behaviour)
cloud: gcp | ||
ports: 8000 | ||
accelerators: A100:2 | ||
use_spot: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are using spot, maybe we should add a comment saying the multi-node is gang scheduling? actually I'm not sure if we should use spot here. cc @Michaelvll for a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both spot and non-spot should be gang-scheduled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just want to say that spot is more error-prone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg, added a comment at the first line
fb56291
to
1a798dd
Compare
This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR was closed because it has been stalled for 10 days with no activity. |
Added a YAML example for multi-node serving Llama-2-70b-hf, on 2 node A100:2.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh