Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Use versions in database for autoscaler and replica_manager (#3301) #3349

Closed

Conversation

SwiftSeal03
Copy link

  • Remove latest_version field in AutoScaler class

  • Remove latest_version and least_recent_version fields in ReplicaManager class

  • Add a function in serve_state.py to retrieve version information from database

Tested (run the relevant ones):
(Run on AWS) pytest tests/test_smoke.py::test_skyserve_update pytest tests/test_smoke.py::test_skyserve_rolling_update pytest tests/test_smoke.py::test_skyserve_fast_update pytest tests/test_smoke.py::test_skyserve_update_autoscale pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@SwiftSeal03
Copy link
Author

@Michaelvll @MaoZiming Please review this PR for issue #3301 , thanks!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting the PR @SwiftSeal03! Left several comments.

Comment on lines -82 to -85
if version <= self.latest_version:
logger.error(f'Invalid version: {version}, '
f'latest version: {self.latest_version}')
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still keep this behavior to avoid concurrent sky update. We should only adopt the latest update.

@@ -120,8 +120,12 @@ async def update_service(request: fastapi.Request):
logger.info(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we set the version in the database? Should we set the version/spec in the database in the controller, before we call into the update_version for replica manager and autoscaler, both for readability and correctness in concurrent setting?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that local process (in serve/core.py) would create an entry in the version_spec table before sending a request to the controller.

Comment on lines 886 to 901
removed_version = info.version
replica_infos = serve_state.get_replica_infos(
self._service_name)
no_replica_of_removed_version = all([
info.version != removed_version
for info in replica_infos
])
if (no_replica_of_removed_version and
removed_version != latest_version):
task_yaml = serve_utils.generate_task_yaml_file_name(
self._service_name, removed_version)
# Delete old version metadata.
serve_state.delete_version(self._service_name,
removed_version)
# Delete storage buckets of older versions.
service.cleanup_storage(task_yaml)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not equivalent to the original version handling. We should only remove the versions smaller than the least_recent_version.

Original:
in db: 2 3 4 5
replicas: 3 5
we remove: 2

Now:
in db: 2 3 4 5
replica: 3 5
we remove: 2 4

Copy link
Author

@SwiftSeal03 SwiftSeal03 Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean version numbers in the example? I assumed that replicas of older versions would always be removed before newer versions, and that only the latest version could have new replicas launched. It seems that this is not true?

Comment on lines 123 to 126
# TODO(Xuanlin Jiang): This assertion is disabled because of
# possibility of race condition.
# assert version == self._replica_manager.get_latest_version(
# self._service_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Can we elaborate this or is it necessary to have this?

Copy link
Author

@SwiftSeal03 SwiftSeal03 Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about what your first comment pointed out. I didn't realize that the if code was to handle concurrent updates. Now it looks appropriate to use a similar if statement here as in deleted code mentioned in your first comment.

Copy link
Author

@SwiftSeal03 SwiftSeal03 Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented assertion is incorrect now by the way. It should be serve_state.get_latest_version(...) on the right hand side.

@SwiftSeal03
Copy link
Author

@Michaelvll I've commit changes according to your advice. Could you please review again?

Copy link

This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Jul 20, 2024
Copy link

This PR was closed because it has been stalled for 10 days with no activity.

@github-actions github-actions bot closed this Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants