-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Lambda] Lambda Cloud SkyPilot provisioner #3865
base: master
Are you sure you want to change the base?
[Lambda] Lambda Cloud SkyPilot provisioner #3865
Conversation
3b00b53
to
4048b32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this amazing PR @kmushegi ! 🚀 It would be really useful to move Lambda to the new provisioner and speed up provisioning a lot. Left some comments to discuss!
Thanks @kmushegi! Trying this out, ran into this error when trying to launch
|
thanks for the reviews folks, will try to address asap |
Fixed this, missed a change to commit initially.
up down works but ray failing to start, i'll keep debugging. error
update: single node works, multi-node still struggling but root-caused update: multi-node fixed as well |
cbbb07c
to
baf0951
Compare
6897ab9
to
2de3d04
Compare
Hi @kmushegi I'm trying this today and encountered the following error. What does sky launch --cloud lambda --num-nodes 3 -c lmd-3node
I 09-11 14:16:56 optimizer.py:719] == Optimizer ==
I 09-11 14:16:56 optimizer.py:730] Target: minimizing cost
I 09-11 14:16:56 optimizer.py:742] Estimated cost: $2.2 / hour
I 09-11 14:16:56 optimizer.py:742]
I 09-11 14:16:56 optimizer.py:867] Considered resources (3 nodes):
I 09-11 14:16:56 optimizer.py:937] ------------------------------------------------------------------------------------------
I 09-11 14:16:56 optimizer.py:937] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 09-11 14:16:56 optimizer.py:937] ------------------------------------------------------------------------------------------
I 09-11 14:16:56 optimizer.py:937] Lambda gpu_1x_a10 30 200 A10:1 us-east-1 2.25 ✔
I 09-11 14:16:56 optimizer.py:937] ------------------------------------------------------------------------------------------
I 09-11 14:16:56 optimizer.py:937]
Launching a new cluster 'lmd-3node'. Proceed? [Y/n]:
I 09-11 14:16:57 cloud_vm_ray_backend.py:4397] Creating a new cluster: 'lmd-3node' [3x Lambda(gpu_1x_a10, {'A10': 1})].
I 09-11 14:16:57 cloud_vm_ray_backend.py:4397] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 09-11 14:16:57 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /home/memory/sky_logs/sky-2024-09-11-14-16-56-536636/provision.log
I 09-11 14:16:58 provisioner.py:65] Launching on Lambda us-east-1 (all zones)
W 09-11 14:17:02 instance.py:117] run_instances error: global/invalid-parameters: quantity: Input should be less than or equal to 1
W 09-11 14:17:05 cloud_vm_ray_backend.py:2003] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in us-east-1. Try changing resource requirements or use another region.
W 09-11 14:17:05 cloud_vm_ray_backend.py:2012]
W 09-11 14:17:05 cloud_vm_ray_backend.py:2012] Provision failed for 3x Lambda(gpu_1x_a10, {'A10': 1}) in us-east-1. Trying other locations (if any). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @kmushegi ! I tested this PR and all of launch/terminate/multinode works well. Left some nits and after that it should be ready to go!
|
||
|
||
def _filter_instances(cluster_name_on_cloud: str, | ||
status_filters: Optional[List[str]]) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status_filters: Optional[List[str]]) -> Dict[str, Any]: | |
status_filters: Optional[List[str]]) -> Dict[str, Dict[str, Any]]: |
nit
created_instance_ids = [] | ||
ssh_key_name = _get_ssh_key_name() | ||
|
||
def launch_nodes(node_type: str, quantity: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def launch_nodes(node_type: str, quantity: int): | |
def launch_nodes(node_type: str, quantity: int) -> xxx: |
return value type?
if len(instance_ids) != 1: | ||
raise RuntimeError( | ||
f'Expected exactly one instance, got {len(instance_ids)}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(instance_ids) != 1: | |
raise RuntimeError( | |
f'Expected exactly one instance, got {len(instance_ids)}') | |
assert len(instance_ids) == 1, instance_ids |
I think it is safe to use an assertion here?
try: | ||
logger.debug( | ||
f'Terminating instances {", ".join(instance_ids_to_terminate)}') | ||
lambda_client.remove_instances(*instance_ids_to_terminate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lambda_client.remove_instances(*instance_ids_to_terminate) | |
lambda_client.remove_instances(instance_ids_to_terminate) |
nit: How about we make the function to accept a list of str instead of unpack here?
}) | ||
response = _try_request_with_backoff( | ||
'post', | ||
f'{API_ENDPOINT}/instance-operations/launch', | ||
data=data, | ||
headers=self.headers) | ||
headers=self.headers, | ||
) | ||
return response.json().get('data', []).get('instance_ids', []) | ||
|
||
def remove_instances(self, *instance_ids: str) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def remove_instances(self, *instance_ids: str) -> Dict[str, Any]: | |
def remove_instances(self, instance_ids: List[str]) -> Dict[str, Any]: |
As mentione previously, can we make this function to take a list of str instead?
custom_ray_options={ | ||
'use_external_ip': True, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this option for?
'terminating': status_lib.ClusterStatus.STOPPED, | ||
'terminated': status_lib.ClusterStatus.STOPPED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a little bit strange to see STOPPED
status in a cloud that does not support stop. Why is an instance with terminated
status shown in the instance list? Shouldn't it just disappear from the list? And maybe we can let the terminating
statue mapped to INIT
?
This PR implements the SkyPilot provisioner for Lambda Cloud.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh