Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lambda] Lambda Cloud SkyPilot provisioner #3865

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kmushegi
Copy link

@kmushegi kmushegi commented Aug 22, 2024

This PR implements the SkyPilot provisioner for Lambda Cloud.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Ran multiple tests against Lambda Cloud.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@kmushegi kmushegi changed the title feat: Lambda Cloud SkyPilot provisioner [Lambda] Lambda Cloud SkyPilot provisioner Aug 22, 2024
@kmushegi kmushegi marked this pull request as ready for review August 22, 2024 23:18
@kmushegi kmushegi force-pushed the feat/oss-lambda-cloud-new-provisioner branch 2 times, most recently from 3b00b53 to 4048b32 Compare August 22, 2024 23:39
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this amazing PR @kmushegi ! 🚀 It would be really useful to move Lambda to the new provisioner and speed up provisioning a lot. Left some comments to discuss!

sky/provision/lambda_cloud/instance.py Outdated Show resolved Hide resolved
sky/provision/lambda_cloud/instance.py Outdated Show resolved Hide resolved
sky/provision/lambda_cloud/instance.py Outdated Show resolved Hide resolved
sky/provision/lambda_cloud/instance.py Outdated Show resolved Hide resolved
sky/provision/lambda_cloud/instance.py Outdated Show resolved Hide resolved
sky/provision/lambda_cloud/instance.py Outdated Show resolved Hide resolved
sky/provision/lambda_cloud/instance.py Show resolved Hide resolved
sky/provision/lambda_cloud/instance.py Outdated Show resolved Hide resolved
sky/provision/lambda_cloud/lambda_utils.py Outdated Show resolved Hide resolved
sky/provision/lambda_cloud/lambda_utils.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator

Thanks @kmushegi!

Trying this out, ran into this error when trying to launch

$ sky launch -c lamb --num-nodes 2 --cloud lambda -- echo hi
...
I 08-27 16:16:02 provisioner.py:65] Launching on Lambda us-east-1 (all zones)
E 08-27 16:16:02 provisioner.py:80] Failed to configure 'lamb' on Lambda Region(name='us-east-1') (all zones) with the following error:
E 08-27 16:16:02 provisioner.py:80] AssertionError: Unknown provider: lambda
D 08-27 16:16:02 provisioner.py:171] Failed to provision 'lamb' on Lambda (all zones).
D 08-27 16:16:02 provisioner.py:173] bulk_provision for 'lamb' failed. Stacktrace:
D 08-27 16:16:02 provisioner.py:173] Traceback (most recent call last):
D 08-27 16:16:02 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 165, in bulk_provision
D 08-27 16:16:02 provisioner.py:173]     return _bulk_provision(cloud, region, zones, cluster_name,
D 08-27 16:16:02 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 76, in _bulk_provision
D 08-27 16:16:02 provisioner.py:173]     config = provision.bootstrap_instances(provider_name, region_name,
D 08-27 16:16:02 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/__init__.py", line 44, in _wrapper
D 08-27 16:16:02 provisioner.py:173]     assert module is not None, f'Unknown provider: {module_name}'
D 08-27 16:16:02 provisioner.py:173] AssertionError: Unknown provider: lambda

@kmushegi
Copy link
Author

thanks for the reviews folks, will try to address asap

@kmushegi
Copy link
Author

kmushegi commented Aug 30, 2024

Thanks @kmushegi!

Trying this out, ran into this error when trying to launch

$ sky launch -c lamb --num-nodes 2 --cloud lambda -- echo hi
...
I 08-27 16:16:02 provisioner.py:65] Launching on Lambda us-east-1 (all zones)
E 08-27 16:16:02 provisioner.py:80] Failed to configure 'lamb' on Lambda Region(name='us-east-1') (all zones) with the following error:
E 08-27 16:16:02 provisioner.py:80] AssertionError: Unknown provider: lambda
D 08-27 16:16:02 provisioner.py:171] Failed to provision 'lamb' on Lambda (all zones).
D 08-27 16:16:02 provisioner.py:173] bulk_provision for 'lamb' failed. Stacktrace:
D 08-27 16:16:02 provisioner.py:173] Traceback (most recent call last):
D 08-27 16:16:02 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 165, in bulk_provision
D 08-27 16:16:02 provisioner.py:173]     return _bulk_provision(cloud, region, zones, cluster_name,
D 08-27 16:16:02 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 76, in _bulk_provision
D 08-27 16:16:02 provisioner.py:173]     config = provision.bootstrap_instances(provider_name, region_name,
D 08-27 16:16:02 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/__init__.py", line 44, in _wrapper
D 08-27 16:16:02 provisioner.py:173]     assert module is not None, f'Unknown provider: {module_name}'
D 08-27 16:16:02 provisioner.py:173] AssertionError: Unknown provider: lambda

Fixed this, missed a change to commit initially.

Moving onto some testing

up down works but ray failing to start, i'll keep debugging. error

RuntimeError: Failed to start ray on the worker node (exit code 1).
Detailed Error:
===== stdout =====
2024-08-30 19:21:56,699	INFO scripts.py:1163 -- Did not find any active Ray processes.
2024-08-30 19:21:57,524	INFO scripts.py:926 -- Local node IP: 127.0.0.1
Traceback (most recent call last):
  File "/home/ubuntu/skypilot-runtime/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2498, in main
    return cli()
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/scripts/scripts.py", line 928, in start
    node = ray._private.node.Node(
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/_private/node.py", line 153, in __init__
    self._init_gcs_client()
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/_private/node.py", line 730, in _init_gcs_client
    raise RuntimeError(
RuntimeError: Failed to connect to GCS.

update: single node works, multi-node still struggling but root-caused

update: multi-node fixed as well

@kmushegi kmushegi force-pushed the feat/oss-lambda-cloud-new-provisioner branch from cbbb07c to baf0951 Compare August 30, 2024 22:45
@kmushegi kmushegi force-pushed the feat/oss-lambda-cloud-new-provisioner branch from 6897ab9 to 2de3d04 Compare September 11, 2024 20:16
@cblmemo
Copy link
Collaborator

cblmemo commented Sep 11, 2024

Hi @kmushegi I'm trying this today and encountered the following error. What does quantity: Input should be less than or equal to 1 mean here? Can we have a more informative error message here?

sky launch --cloud lambda --num-nodes 3 -c lmd-3node
I 09-11 14:16:56 optimizer.py:719] == Optimizer ==
I 09-11 14:16:56 optimizer.py:730] Target: minimizing cost
I 09-11 14:16:56 optimizer.py:742] Estimated cost: $2.2 / hour
I 09-11 14:16:56 optimizer.py:742] 
I 09-11 14:16:56 optimizer.py:867] Considered resources (3 nodes):
I 09-11 14:16:56 optimizer.py:937] ------------------------------------------------------------------------------------------
I 09-11 14:16:56 optimizer.py:937]  CLOUD    INSTANCE     vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 09-11 14:16:56 optimizer.py:937] ------------------------------------------------------------------------------------------
I 09-11 14:16:56 optimizer.py:937]  Lambda   gpu_1x_a10   30      200       A10:1          us-east-1     2.25          ✔     
I 09-11 14:16:56 optimizer.py:937] ------------------------------------------------------------------------------------------
I 09-11 14:16:56 optimizer.py:937] 
Launching a new cluster 'lmd-3node'. Proceed? [Y/n]: 
I 09-11 14:16:57 cloud_vm_ray_backend.py:4397] Creating a new cluster: 'lmd-3node' [3x Lambda(gpu_1x_a10, {'A10': 1})].
I 09-11 14:16:57 cloud_vm_ray_backend.py:4397] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 09-11 14:16:57 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /home/memory/sky_logs/sky-2024-09-11-14-16-56-536636/provision.log
I 09-11 14:16:58 provisioner.py:65] Launching on Lambda us-east-1 (all zones)
W 09-11 14:17:02 instance.py:117] run_instances error: global/invalid-parameters: quantity: Input should be less than or equal to 1
W 09-11 14:17:05 cloud_vm_ray_backend.py:2003] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in us-east-1. Try changing resource requirements or use another region.
W 09-11 14:17:05 cloud_vm_ray_backend.py:2012] 
W 09-11 14:17:05 cloud_vm_ray_backend.py:2012] Provision failed for 3x Lambda(gpu_1x_a10, {'A10': 1}) in us-east-1. Trying other locations (if any).

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @kmushegi ! I tested this PR and all of launch/terminate/multinode works well. Left some nits and after that it should be ready to go!



def _filter_instances(cluster_name_on_cloud: str,
status_filters: Optional[List[str]]) -> Dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
status_filters: Optional[List[str]]) -> Dict[str, Any]:
status_filters: Optional[List[str]]) -> Dict[str, Dict[str, Any]]:

nit

created_instance_ids = []
ssh_key_name = _get_ssh_key_name()

def launch_nodes(node_type: str, quantity: int):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def launch_nodes(node_type: str, quantity: int):
def launch_nodes(node_type: str, quantity: int) -> xxx:

return value type?

Comment on lines +122 to +124
if len(instance_ids) != 1:
raise RuntimeError(
f'Expected exactly one instance, got {len(instance_ids)}')
Copy link
Collaborator

@cblmemo cblmemo Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(instance_ids) != 1:
raise RuntimeError(
f'Expected exactly one instance, got {len(instance_ids)}')
assert len(instance_ids) == 1, instance_ids

I think it is safe to use an assertion here?

try:
logger.debug(
f'Terminating instances {", ".join(instance_ids_to_terminate)}')
lambda_client.remove_instances(*instance_ids_to_terminate)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lambda_client.remove_instances(*instance_ids_to_terminate)
lambda_client.remove_instances(instance_ids_to_terminate)

nit: How about we make the function to accept a list of str instead of unpack here?

})
response = _try_request_with_backoff(
'post',
f'{API_ENDPOINT}/instance-operations/launch',
data=data,
headers=self.headers)
headers=self.headers,
)
return response.json().get('data', []).get('instance_ids', [])

def remove_instances(self, *instance_ids: str) -> Dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def remove_instances(self, *instance_ids: str) -> Dict[str, Any]:
def remove_instances(self, instance_ids: List[str]) -> Dict[str, Any]:

As mentione previously, can we make this function to take a list of str instead?

Comment on lines +221 to +223
custom_ray_options={
'use_external_ip': True,
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this option for?

Comment on lines +240 to +241
'terminating': status_lib.ClusterStatus.STOPPED,
'terminated': status_lib.ClusterStatus.STOPPED,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little bit strange to see STOPPED status in a cloud that does not support stop. Why is an instance with terminated status shown in the instance list? Shouldn't it just disappear from the list? And maybe we can let the terminating statue mapped to INIT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants