Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't launch VMs on Azure #283

Closed
infwinston opened this issue Feb 8, 2022 · 9 comments
Closed

Can't launch VMs on Azure #283

infwinston opened this issue Feb 8, 2022 · 9 comments
Assignees

Comments

@infwinston
Copy link
Member

Since our azure subscription is back, I was testing whether Sky can launch azure VMs.
But looks like we're still blocked by some incompatibility issue of Ray Autoscaler and Azure CLI.
ray-project/ray#19523
Azure/azure-sdk-for-python#22073

Minimal reproducible command:

sky cpunode --cloud azure
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/azure/common/credentials.py", line 72, in get_token
    _, token, fulltoken = credentials._token_retriever()  # pylint:disable=protected-access
TypeError: _retrieve_token() missing 1 required positional argument: 'token_resource'

The Azure CLI version is 2.29.0. Ray is 1.9.1.

@infwinston
Copy link
Member Author

@suquark How did you launch VMs on Azure when fixing this issue #148?

@concretevitamin
Copy link
Member

Hotfix: pip install azure-cli-core==2.22.0 (this will make Ray work but at the cost of making the az CLI tool unusable).

This is mentioned in README. If it resolved it for you, maybe we should add to https://github.com/concretevitamin/sky-experiments/blob/master/prototype/setup.py#L27.

Another issue is we manually commented out https://github.com/concretevitamin/sky-experiments/blob/master/prototype/sky/registry.py#L13-L14. Should be uncommented.

@infwinston
Copy link
Member Author

Oops I missed that hotfix in README. Now with azure-cli-core==2.22.0 it works!
Yes I think it should be added to Sky's dependency until Ray fixes this?

@concretevitamin
Copy link
Member

concretevitamin commented Feb 9, 2022 via email

@infwinston infwinston self-assigned this Feb 9, 2022
@infwinston
Copy link
Member Author

I just tried Azure again with ray==1.10.0 and azure-cli-core==2.33.0 and was able to launch a cpunode.
However, looks like some new errors popped up.

  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 52, in __init__                                  
    subscription_id = provider_config["subscription_id"]
KeyError: 'subscription_id'

This leads to another question: How to select the Azure subscription I want to use for Sky if I have many? What's the one Sky chose by default in this case? @franklsf95 @concretevitamin

More complete error message:

...
  [6/7] Running setup commands                                    
    (0/3) pip3 install -U ray[default]==...
    (1/3) pip3 install ~/.sky/sky_wheels...
    (2/3) pip install -U azure-cli-core=...
  [7/7] Starting the Ray runtime
  New status: up-to-date

Useful commands
  Monitor autoscaling with
    ray exec /home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/config/user/sky-cpunode-weichiang.yml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
  Connect to a terminal on the cluster head:
    ray attach /home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/config/user/sky-cpunode-weichiang.yml
  Get a remote shell to the cluster manually:
    ssh -o IdentitiesOnly=yes -i ~/.ssh/sky-key azureuser@40.117.97.201
Traceback (most recent call last):
  File "/data/weichiang/miniconda3/envs/sky/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1938, in main
    return cli()
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1326, in get_head_ip
    click.echo(get_head_node_ip(cluster_config_file, cluster_name))
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 1149, in get_head_node_ip
    provider = _get_node_provider(config["provider"], config["cluster_name"])
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/ray/autoscaler/_private/providers.py", line 217, in _get_node_provider
    new_provider = provider_cls(provider_config, cluster_name)
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 52, in __init__
    subscription_id = provider_config["subscription_id"]
KeyError: 'subscription_id'
Traceback (most recent call last):
  File "/data/weichiang/miniconda3/envs/sky/bin/sky", line 33, in <module>
    sys.exit(load_entry_point('sky', 'console_scripts', 'sky')())
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/sky/cli.py", line 1196, in cpunode
    _create_and_ssh_into_node(
  File "/home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/sky/cli.py", line 307, in _create_and_ssh_into_node
    handle = backend.provision(task,
  File "/home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/sky/backends/cloud_vm_ray_backend.py", line 1061, in provision
    config_dict = provisioner.provision_with_retries(
  File "/home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/sky/backends/cloud_vm_ray_backend.py", line 875, in provision_with_retries
    config_dict = self._retry_region_zones(
  File "/home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/sky/backends/cloud_vm_ray_backend.py", line 678, in _retry_region_zones
    self._ensure_cluster_ray_started(handle, log_abs_path)
  File "/home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/sky/backends/cloud_vm_ray_backend.py", line 837, in _ensure_cluster_ray_started
    proc, _, _ = backend.run_on_head(
  File "/home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/sky/backends/cloud_vm_ray_backend.py", line 1644, in run_on_head
    head_ip = self._get_head_ip(handle, use_cached_head_ip)
  File "/home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/sky/backends/cloud_vm_ray_backend.py", line 1617, in _get_head_ip
    head_ip = self._get_node_ips(handle.cluster_yaml,
  File "/home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/sky/backends/cloud_vm_ray_backend.py", line 1544, in _get_node_ips
    out = backend_utils.run(f'ray get-head-ip {yaml_handle}',
  File "/home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/sky/backends/backend_utils.py", line 743, in run
    return subprocess.run(cmd,
  File "/data/weichiang/miniconda3/envs/sky/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'ray get-head-ip /home/eecs/weichiang/repos/bert-sign/sky-experiments/prototype/config/user/sky-cpunode-weichiang.yml' returned non-zero
exit status 1.

@infwinston
Copy link
Member Author

infwinston commented Feb 10, 2022

I tried the hint in README but looks like it does not switch to the given c7 subscription correctly when using cpunode. I'm looking into this issue.

az account set --subscription c721f523-3577-40bc-846a-e8bf4d139ed6
sky cpunode --cloud azure

@concretevitamin
Copy link
Member

sky cpunode --cloud azure works for me without errors, after running the following:

# ray 1.9.2
pip install "setuptools<58"
pip install azure-cli==2.22.0

@infwinston does this issue still exist for you?

@infwinston
Copy link
Member Author

The issue is gone after the below versions installed. Looks like we can stick with ray==1.9.2 for now. Let's revisit 1.10 once it's more mature.

setuptools                              57.5.0
azure-cli                               2.22.0
ray                                     1.9.2

@Michaelvll
Copy link
Collaborator

The issue is gone after the below versions installed. Looks like we can stick with ray==1.9.2 for now. Let's revisit 1.10 once it's more mature.

setuptools                              57.5.0
azure-cli                               2.22.0
ray                                     1.9.2

For ray==1.10.0, we need to upgrade the azure-cli>=2.25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants