-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Use versions in database for autoscaler and replica_manager (#3301) #3349
[Serve] Use versions in database for autoscaler and replica_manager (#3301) #3349
Conversation
@Michaelvll @MaoZiming Please review this PR for issue #3301 , thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting the PR @SwiftSeal03! Left several comments.
if version <= self.latest_version: | ||
logger.error(f'Invalid version: {version}, ' | ||
f'latest version: {self.latest_version}') | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still keep this behavior to avoid concurrent sky update
. We should only adopt the latest update.
@@ -120,8 +120,12 @@ async def update_service(request: fastapi.Request): | |||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we set the version in the database? Should we set the version/spec in the database in the controller, before we call into the update_version
for replica manager and autoscaler, both for readability and correctness in concurrent setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that local process (in serve/core.py) would create an entry in the version_spec
table before sending a request to the controller.
sky/serve/replica_managers.py
Outdated
removed_version = info.version | ||
replica_infos = serve_state.get_replica_infos( | ||
self._service_name) | ||
no_replica_of_removed_version = all([ | ||
info.version != removed_version | ||
for info in replica_infos | ||
]) | ||
if (no_replica_of_removed_version and | ||
removed_version != latest_version): | ||
task_yaml = serve_utils.generate_task_yaml_file_name( | ||
self._service_name, removed_version) | ||
# Delete old version metadata. | ||
serve_state.delete_version(self._service_name, | ||
removed_version) | ||
# Delete storage buckets of older versions. | ||
service.cleanup_storage(task_yaml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not equivalent to the original version handling. We should only remove the versions smaller than the least_recent_version.
Original:
in db: 2 3 4 5
replicas: 3 5
we remove: 2
Now:
in db: 2 3 4 5
replica: 3 5
we remove: 2 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean version numbers in the example? I assumed that replicas of older versions would always be removed before newer versions, and that only the latest version could have new replicas launched. It seems that this is not true?
sky/serve/controller.py
Outdated
# TODO(Xuanlin Jiang): This assertion is disabled because of | ||
# possibility of race condition. | ||
# assert version == self._replica_manager.get_latest_version( | ||
# self._service_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? Can we elaborate this or is it necessary to have this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about what your first comment pointed out. I didn't realize that the if
code was to handle concurrent updates. Now it looks appropriate to use a similar if
statement here as in deleted code mentioned in your first comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commented assertion is incorrect now by the way. It should be serve_state.get_latest_version(...)
on the right hand side.
@Michaelvll I've commit changes according to your advice. Could you please review again? |
This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR was closed because it has been stalled for 10 days with no activity. |
Remove
latest_version
field inAutoScaler
classRemove
latest_version
andleast_recent_version
fields inReplicaManager
classAdd a function in
serve_state.py
to retrieve version information from databaseTested (run the relevant ones):
(Run on AWS)
pytest tests/test_smoke.py::test_skyserve_update
pytest tests/test_smoke.py::test_skyserve_rolling_update
pytest tests/test_smoke.py::test_skyserve_fast_update
pytest tests/test_smoke.py::test_skyserve_update_autoscale
pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update
bash format.sh
pytest tests/test_smoke.py
bash tests/backward_comaptibility_tests.sh