-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Azure] Support fractional A10 instance types #3877
base: master
Are you sure you want to change the base?
Conversation
# Filter out instance types that only contain a fractional of GPU. | ||
df_filtered = _df.loc[~_df['InstanceType'].isin(_FILTERED_A10_INSTANCE_TYPES | ||
)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of excluding the instances directly, can we print out some hints like the one when we specify sky launch --gpus L4
:
Multiple AWS instances satisfy L4:1. The cheapest AWS(g6.xlarge, {'L4': 1}) is considered among:
I 08-27 06:09:54 optimizer.py:922] ['g6.xlarge', 'g6.2xlarge', 'g6.4xlarge', 'gr6.4xlarge', 'g6.8xlarge', 'gr6.8xlarge', 'g6.16xlarge'].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hint is used to print instances w/ same accelerator number. I'm thinking if we should do this to fractional GPUs..
…ployment group works for fractional one
instance_type=xxx
I support launching from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @cblmemo! Mostly looks good to me with some slight issue.
# Manually update the GPU count for fractional A10 instance types. | ||
df_ret['AcceleratorCount'] = df_ret.apply(_upd_a10_gpu_count, | ||
axis='columns') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we say more in the comment for why we need to do it manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Added. PTAL
@@ -67,7 +67,7 @@ def get_default_instance_type( | |||
|
|||
|
|||
def get_accelerators_from_instance_type( | |||
instance_type: str) -> Optional[Dict[str, int]]: | |||
instance_type: str) -> Optional[Dict[str, Union[int, float]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the clouds that won't return float
, we probably don't need to update it to Union[int, float]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, here is what I got:
./format.sh
SkyPilot Black:
All done! ✨ 🍰 ✨
4 files left unchanged.
SkyPilot yapf: Done
SkyPilot isort:
Skipped 5 files
SkyPilot mypy:
sky/clouds/service_catalog/aws_catalog.py:248: error: Incompatible return value type (got "Optional[Dict[str, Union[int, float]]]", expected "Optional[Dict[str, int]]") [return-value]
Found 1 error in 1 file (checked 225 source files)
@@ -252,7 +251,7 @@ def zones_provision_loop( | |||
def get_accelerators_from_instance_type( | |||
cls, | |||
instance_type: str, | |||
) -> Optional[Dict[str, int]]: | |||
) -> Optional[Dict[str, Union[int, float]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the clouds that will not return float
, we can avoid using Union
, as mypy should support a inherited method that only uses a subset of the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried and this is what I got:
./format.sh
SkyPilot Black:
All done! ✨ 🍰 ✨
4 files left unchanged.
SkyPilot yapf: Done
SkyPilot isort:
Skipped 5 files
SkyPilot mypy:
sky/clouds/aws.py:365: error: Return type "Optional[Dict[str, int]]" of "get_accelerators_from_instance_type" incompatible with return type "Optional[Dict[str, Union[int, float]]]" in supertype "Cloud" [override]
sky/clouds/aws.py:369: error: Incompatible return value type (got "Optional[Dict[str, Union[int, float]]]", expected "Optional[Dict[str, int]]") [return-value]
sky/clouds/aws.py:397: error: Argument 1 to "make_ray_custom_resources_str" has incompatible type "Optional[Dict[str, int]]"; expected "Optional[Dict[str, Union[int, float]]]" [arg-type]
Found 3 errors in 1 file (checked 225 source files)
sky/resources.py
Outdated
if isinstance(self.accelerators[acc], float) or isinstance( | ||
other_accelerators[acc], float): | ||
# If the requested accelerator count is a float, we only | ||
# allow strictly equal counts since all of the float point | ||
# accelerator counts are less than 1 (e.g., 0.1, 0.5), and | ||
# we want to avoid semantic ambiguity (e.g. launching | ||
# with --gpus A10:0.25 on a A10:0.75 cluster). | ||
if not math.isclose(self.accelerators[acc], | ||
other_accelerators[acc]): | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should allow requested resources to be a float, while the existing accelerators to be a int, as long as requested resources is <= existing resources.
That said, the
if isinstance(self.accelerators[acc], float) or isinstance( | |
other_accelerators[acc], float): | |
# If the requested accelerator count is a float, we only | |
# allow strictly equal counts since all of the float point | |
# accelerator counts are less than 1 (e.g., 0.1, 0.5), and | |
# we want to avoid semantic ambiguity (e.g. launching | |
# with --gpus A10:0.25 on a A10:0.75 cluster). | |
if not math.isclose(self.accelerators[acc], | |
other_accelerators[acc]): | |
return False | |
if isinstance(other_accelerators[acc], float) and not other_accelerators[acc].is_integer(): | |
# If the requested accelerator count is a float, we only | |
# allow strictly equal counts since all of the float point | |
# accelerator counts are less than 1 (e.g., 0.1, 0.5), and | |
# we want to avoid semantic ambiguity (e.g. launching | |
# with --gpus A10:0.25 on a A10:0.75 cluster). | |
if not math.isclose(self.accelerators[acc], | |
other_accelerators[acc]): | |
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Updated. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, after a second thought, I think we should still keep the original isinstance(self.accelerators[acc], float) or isinstance(other_accelerators[acc], float)
condition. Considering the following case: user submit the jobs with --gpus A10:0.5 and the cluster has A10:1. In fact the requirements 0.5 will be translated to 1 and thus the user can only have one A10:0.5 job instead of 2, which is confusing. The original condition capture such case but the updated one (isinstance(other_accelerators[acc], float) and not other_accelerators[acc].is_integer()
) does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did allow having two A10:0.5
running on a single cluster with A10:1
. Do you know when did we change the behavior of this? or did we ever change this behavior before this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to fix this before we merge the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the support @cblmemo! It seems good to me. Please do some tests to make sure the changes do not cause issues with other clouds and other ACC types (considering we have changed significant amount of places)
sky/backends/cloud_vm_ray_backend.py
Outdated
'Task requested resources with fractional ' | ||
'accelerator counts. For fractional ' | ||
'counts, the required count must match the ' | ||
'existing cluster. Got required accelerator' | ||
f' {acc}:{self_count} but the existing ' | ||
f'cluster has {acc}:{existing_count}.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message is not accurate? Our check is for ACC count of existing cluster instead of the task requested resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the above comments 🤔
sky/resources.py
Outdated
if isinstance(self.accelerators[acc], float) or isinstance( | ||
other_accelerators[acc], float): | ||
# If the requested accelerator count is a float, we only | ||
# allow strictly equal counts since all of the float point | ||
# accelerator counts are less than 1 (e.g., 0.1, 0.5), and | ||
# we want to avoid semantic ambiguity (e.g. launching | ||
# with --gpus A10:0.25 on a A10:0.75 cluster). | ||
if not math.isclose(self.accelerators[acc], | ||
other_accelerators[acc]): | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to fix this before we merge the PR
sky/resources.py
Outdated
# If the existing accelerator count is a float, we only | ||
# allow strictly equal counts since all of the float point | ||
# accelerator counts are less than 1 (e.g., 0.167, 0.5), and | ||
# we want to avoid semantic ambiguity (e.g. launching | ||
# with --gpus A10:0.25 on a A10:0.75 cluster). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rephrase this comment? Not sure what we are talking about here.
Just identified another bug: for a A10:0.5 cluster, previous implementation would force using |
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Actually, after a second thought, I think we could even allow requiring |
Another TODO: I just found that there are 1/6 and 1/3 A10 instance types. We need to figure out a precision to display such decimals. |
All todos is done. After smoke test it should be able to merge ;) |
return int(value) | ||
return float(value) | ||
|
||
return {acc_name: _convert(acc_count)} | ||
|
||
|
||
def get_instance_type_for_accelerator_impl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update the acc_count
type here? Also, when we are comparing the acc_count
should we make sure every number within math.abs(df['AcceleratorCount'] - acc_count) <= 0.01
should work. Otherwise, a user running sky launch --gpus A10:0.16
or sky launch --gpus A10:0.1666
would fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update the acc_count type here?
Sry could you elaborate on this...? Are you saying there are a better place to update the type or what..?
Also, when we are comparing the
acc_count
should we make sure every number withinmath.abs(df['AcceleratorCount'] - acc_count) <= 0.01
should work.
For this, I'm slightly concerned about the case when user:
sky launch -c a10-frac --gpus A10:0.16 # detected as 0.167 in catalog, then launch cluster with 0.167 gpu
sky exec a10-frac --gpus A10:0.16 sleep 100000 # user would think the cluster is full
sky exec a10-frac --gpus A10:0.007 sleep 100000 # however, this still running as the cluster is actually launched with 0.167 gpu.
To deal with the failing, we currently shows all valid instance type as fuzzy candidate and user could then modify their acc count:
sky launch --gpus A10:0.16
I 09-12 22:34:50 optimizer.py:1301] No resource satisfying <Cloud>({'A10': 0.16}) on [AWS, GCP, Azure, RunPod].
I 09-12 22:34:50 optimizer.py:1305] Did you mean: ['A100-80GB-SXM:1', 'A100-80GB-SXM:2', 'A100-80GB-SXM:4', 'A100-80GB-SXM:8', 'A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100-80GB:8', 'A100:1', 'A100:16', 'A100:2', 'A100:4', 'A100:8', 'A10:0.167', 'A10:0.333', 'A10:0.5', 'A10:1', 'A10:2', 'A10G:1', 'A10G:4', 'A10G:8']
sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
resources: <Cloud>({'A10': 0.16}).
To fix: relax or change the resource requirements.
Try one of these offered accelerators: ['A100-80GB-SXM:1', 'A100-80GB-SXM:2', 'A100-80GB-SXM:4', 'A100-80GB-SXM:8', 'A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100-80GB:8', 'A100:1', 'A100:16', 'A100:2', 'A100:4', 'A100:8', 'A10:0.167', 'A10:0.333', 'A10:0.5', 'A10:1', 'A10:2', 'A10G:1', 'A10G:4', 'A10G:8']
Hint: sky show-gpus to list available accelerators.
sky check to check the enabled clouds.
Does that sounds good to you?
Closes #3708
This PR support fractional A10 instance types from
instance_type=xxx
andaccelerators=A10:{0.25,0.5,0.75}
.Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh