Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sky init and cloud credentials for storage #102

Merged
merged 52 commits into from
Feb 11, 2022
Merged

Sky init and cloud credentials for storage #102

merged 52 commits into from
Feb 11, 2022

Conversation

franklsf95
Copy link
Contributor

@franklsf95 franklsf95 commented Dec 9, 2021

The sky init command will check credentials for all three clouds, and write the list of available clouds into the global user state. The optimizer will then only generate plans using the enabled clouds.

Demo:

$ sky init
Sky will use the following clouds to run jobs. To change this, configure
cloud access credentials, and rerun sky init.

  AWS: enabled
  Azure: enabled
  GCP: enabled

If some cloud is not logged in:

$ sky init
Sky will use the following clouds to run jobs. To change this, configure
cloud access credentials, and rerun sky init.

  AWS: enabled
  Checking Azure...ERROR: Please run 'az login' to setup account.
  Azure: disabled
    Reason: Azure CLI returned error.
  GCP: enabled

When a cloud is considered as "enabled", it is also ensured that the user's node, as well as any node launched with Sky, will have the access credentials to that cloud's storage.

When running jobs, the optimizer will skip the clouds that you are not logged in with. If you are not logged into GCP but specifies Resources(GCP, ...), it would show an error before provisioning, like so:

> sky run examples/resnet_app.yaml
I 12-28 21:08:57 execution.py:92] Optimizer target is set to COST.
Traceback (most recent call last):
  File "/Users/lsf/opt/miniconda3/bin/sky", line 33, in <module>
    sys.exit(load_entry_point('sky', 'console_scripts', 'sky')())
  File "/Users/lsf/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/lsf/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/lsf/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/lsf/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/lsf/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/lsf/projects/sky-experiments/prototype/sky/cli.py", line 196, in run
    sky.execute(dag, dryrun=dryrun, stream_logs=True, cluster_name=cluster)
  File "/Users/lsf/projects/sky-experiments/prototype/sky/execution.py", line 93, in execute
    dag = sky.optimize(dag, minimize=optimize_target)
  File "/Users/lsf/projects/sky-experiments/prototype/sky/optimizer.py", line 77, in optimize
    optimized_dag, unused_best_plan = Optimizer._optimize_cost(
  File "/Users/lsf/projects/sky-experiments/prototype/sky/optimizer.py", line 203, in _optimize_cost
    raise exceptions.ResourcesUnavailableError(
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task resnet-app. Try relaxing its resource requirements, and run sky init to make sure the cloud you specified (if any) is enabled.

Tasks and tests:

  • AWS node can access a private AWS bucket
  • Azure node can access a private AWS bucket (cannot test for now because our Azure account is disabled)
  • GCP node can access a private AWS bucket
  • AWS node can access a private GCS bucket
  • Azure node can access a private GCS bucket
  • GCP node can access a private GCS bucket
  • Come up with a credentials design that should also work once Azure Blob is supported (Azure is supported, Azure Blob not yet)
  • From a node with only AWS credentials: Run all examples in smoke tests. Examples that force-use other clouds should throw a nice error. All other examples should work.
  • Now run sky init, fill in the other 2 clouds’ details, sky init again
  • minimal.yaml
  • using_file_mounts.yaml (Azure is down atm, changed to GCP; something failed during setup step)
  • sky run -c huggingface "$DIR"/huggingface_glue_imdb_app.yaml
  • sky exec -c huggingface "$DIR"/huggingface_glue_imdb_app.yaml
  • sky run -c tpu "$DIR"/tpu_app.yaml
  • sky run -c mh "$DIR"/multi_hostname.yaml
  • sky exec -c mh "$DIR"/multi_hostname.yaml
  • python "$DIR"/resnet_distributed_tf_app.py (Azure does not work; removed Azure requirement and succeeded in AWS)
  • python "$DIR"/resnet_distributed_tf_app.py (again)
  • python "$DIR"/multi_echo.py

Notes for reviewers

  1. ~~I changed CloudStorage.is_directory to CloudStorage.is_file because the latter is easier. ~~

    1. In GCS, using gsutil ls -d <url> to check if <url> is a directory is hard: if is a non-root directory (e.g. gs://some-bucket/some-dir/), it would return one line with just this dir with a trailing slash; if is a bucket root directory (e.g. gs://some-bucket), then it would list all subdirectories in the bucket, which could be zero, one or more. Reverted this and special-case handled GCS bucket URLs.
    2. On the other hand, gsutil -q stat <url> reliably returns 0 iff is a file, and 1 otherwise. Same goes for head_object in AWS S3.
    3. This change will make it not able to check if <url> exists. But later the aws sync / gsutil rsync commands will surface this error, so I think it's okay.
  2. The easiest way to authenticate gsutil is by installing it with google-cloud-sdk; authenticating the standalone gsutil is no longer recommended (and more complicated). So I changed GcsCloudStorage._GET_GSUTIL to now install the full google-cloud-sdk.

@michaelzhiluo michaelzhiluo self-assigned this Dec 10, 2021
@michaelzhiluo
Copy link
Collaborator

Will take a closer look at GCS verification (i.e. gcloud init, gcloud auth login)

@michaelzhiluo michaelzhiluo changed the title Sky init [WIP] Sky init Dec 24, 2021
@franklsf95 franklsf95 changed the title [WIP] Sky init Sky init and cloud credentials for storage Dec 29, 2021
@franklsf95
Copy link
Contributor Author

Still have smoke tests to run, but the code should be ready for a first-pass review.

prototype/sky/execution.py Outdated Show resolved Hide resolved
prototype/tests/test_enabled_clouds.py Outdated Show resolved Hide resolved
prototype/sky/task.py Outdated Show resolved Hide resolved
prototype/sky/clouds/gcp.py Outdated Show resolved Hide resolved
prototype/sky/clouds/gcp.py Outdated Show resolved Hide resolved
prototype/sky/cli.py Outdated Show resolved Hide resolved
prototype/sky/task.py Outdated Show resolved Hide resolved
Comment on lines 990 to 994
else:
sync = storage.make_sync_dir_command(source=src,
destination=wrapped_dst)
# It is a directory so make sure it exists.
mkdir_for_wrapped_dst = f'mkdir -p {wrapped_dst}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test a src that doesn't exist, which will hit this branch:

<dst>: gs://nonexist-bucket/nonexist

There probably will be some error; is it reasonable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, this will hit an error while doing gsutil ls; now this will hit an error while doing aws sync / gsutil rsync. I think this is okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@franklsf95 Did you actually run the requested test? I requested it because it does not look trivial.

It is not okay / equivalent to before, because with this change, we would have created zombie dirs from L994 mkdir_for_wrapped_dst = f'mkdir -p {wrapped_dst}' if the src does not exist.

prototype/sky/cloud_stores.py Outdated Show resolved Hide resolved
prototype/sky/global_user_state.py Outdated Show resolved Hide resolved
@concretevitamin
Copy link
Member

GCP node can access a private AWS bucket (getting error with get_or_copy_to_gcs, debugging)
AWS node can access a private GCS bucket (depends on [Sky Data] GCS -> S3 #127)

It's good to actually test Sky Data. However, this PR is not blocked by #127 -- a simple test is to log into the AWS node and run "gsutil ls private_bucket" or similar.

authenticating the standalone gsutil is no longer recommended (and more complicated). So I changed GcsCloudStorage._GET_GSUTIL to now install the full google-cloud-sdk.

Is there a big difference in speed installing gsutil vs. the full google-cloud-sdk?

prototype/sky/clouds/aws.py Show resolved Hide resolved
prototype/sky/clouds/aws.py Outdated Show resolved Hide resolved
from typing import Dict, Iterator, List, Optional, Tuple

from sky import clouds
from sky.clouds.service_catalog import azure_catalog


def _run_output(cmd):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, move to backend_util.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any clues on how to fix this circular import? @franklsf95

How did Ray solve this type of issue?

prototype/sky/clouds/azure.py Outdated Show resolved Hide resolved
prototype/sky/clouds/aws.py Show resolved Hide resolved
prototype/sky/cloud_stores.py Outdated Show resolved Hide resolved
prototype/sky/cloud_stores.py Outdated Show resolved Hide resolved
prototype/sky/execution.py Outdated Show resolved Hide resolved
prototype/sky/task.py Outdated Show resolved Hide resolved
prototype/sky/task.py Outdated Show resolved Hide resolved
prototype/sky/cli.py Outdated Show resolved Hide resolved
@concretevitamin
Copy link
Member

Is this ready for review?

I just launched a fresh sky cpunode (AWS), checked out this PR and installed Sky. Ran:

$ python examples/example_app.py
I 01-05 05:47:58 resources.py:52] Missing tf_version in accelerator_args, using default (2.5.0)
I 01-05 05:47:58 resources.py:56] Missing tpu_name in accelerator_args, using default (sky_tpu)
ERROR: Please run 'az login' to setup account.

$ sky run -c min examples/minimal.yaml
I 01-05 05:49:14 execution.py:82] Optimizer target is set to COST.
ERROR: Please run 'az login' to setup account.

This should've worked, according to the test ticked in the PR description:

From a node with only AWS credentials: Run all examples in smoke tests. Examples that force-use other clouds should throw a nice error. All other examples should work.

@franklsf95
Copy link
Contributor Author

Cloud credentials will not be synced up to any node launched with Sky. I've also rerun all the tests as ticked.

@@ -208,6 +209,7 @@ def _create_and_ssh_into_node(
dag = sky.optimize(dag)
task = dag.tasks[0]
backend.register_info(dag=dag)
task.update_file_mounts(sky_init.get_cloud_credential_file_mounts())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this to task.py so it will be done independent of Sky being ran via yaml or python script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both yaml or python will go to execution.py, so that part is fine. This is indeed duplicated in execution.py and cli.py. I'm not sure where best to put it in task.py - it might also be surprising to user to, for example, automatically add the three file mounts when any task is constructed. It feels to me it belongs more with the execution/backend logic. Open to suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just have it once in execution.py in that case? The cli.py one is duplicated right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea but cli.py is currently a separate code path than execution.py. @gmittal is this deliberately kept separate, or is there plan to merge this with execution.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our setup.py/click require an entry script for the CLI, which is what cli.py is doing right now. If we were to merge with execution.py there would be many things we could not support, such as passing dags to launch. I think it makes sense for cli.py to call functions in execution.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we need cli.py as a CLI entry point. I think Michael's question is why we have a function here in cli.py that seems to do similar things as in execution.py:_execute(), and whether we can merge these two code paths.

prototype/sky/cli.py Show resolved Hide resolved
from typing import Dict, Iterator, List, Optional, Tuple

from sky import clouds
from sky.clouds.service_catalog import azure_catalog


def _run_output(cmd):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any clues on how to fix this circular import? @franklsf95

How did Ray solve this type of issue?

'$ gcloud auth application-default set-quota-project <proj>')
return True, None

def get_credential_file_mounts(self) -> Dict[str, str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it will sync to the remote's home directory? Had a small bug last time where Ray autoscaler made a /~/ directory instead which was very confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About circular import: the solution would be to create another file e.g. run_utils.py, and have both backend_utils, aws, azure import run_output from it. Since this is only one small function, I fear that might be a bit unnecessary.

About syncing ~: In my tests they did sync to the user home directory on the VMs. I haven't run into the issue you described. using_file_mounts.yaml seems to also use this format. Is there an alternative way of writing this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sgtm, lets just leave the circular import for now and fix in a future PR.

Maybe try $HOME instead of ~ for aws, gcp, azure configuration file mounting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try $HOME.

prototype/sky/execution.py Show resolved Hide resolved
prototype/sky/global_user_state.py Show resolved Hide resolved
prototype/sky/optimizer.py Show resolved Hide resolved
@gmittal gmittal linked an issue Feb 1, 2022 that may be closed by this pull request
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this! It would be great if the user do not need to worry about their credentials. Just have several small comments.

@@ -56,7 +56,7 @@ ray attach config/gcp.yml
ray down config/gcp.yml
```

**Azure**. Install the Azure CLI (`pip install azure-cli`) then login using `az login`. Set the subscription to use from the command line (`az account set -s <subscription_id>`) or by modifying the provider section of the Azure template (`config/azure.yml.j2`). Ray Autoscaler does not work with the latest version of `azure-cli`. Hotfix: `pip install azure-cli-core==2.22.0` (this will make Ray work but at the cost of making the `az` CLI tool unusable).
**Azure**. Install the Azure CLI (`pip install azure-cli==2.22.0`) then login using `az login`. Set the subscription to use from the command line (`az account set -s <subscription_id>`). Ray Autoscaler does not work with the latest version of `azure-cli` as of 1.9.1, hence the fixed Azure version.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ray Autoscaler does not work with the latest version of azure-cli as of 1.9.1, hence the fixed Azure version.

What is the meaning of this sentence?

'azure': ['azure-cli'],
# ray <= 1.9.1 requires an older version of azure-cli. We can get rid of
# this version requirement once ray 1.10 is released.
'azure': ['azure-cli==2.22.0'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#283 @infwinston , this line seems fixed the version problem.

click.echo(
click.style(
'No cloud is enabled. Sky will not be able to run any task. '
'Please setup access to a cloud, and rerun `sky init`.',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the link to the doc for setting up cloud access here?


def set_enabled_clouds(enabled_clouds: List[str]) -> None:
_CURSOR.execute('INSERT OR REPLACE INTO config VALUES (?, ?)',
(_ENABLED_CLOUDS_KEY, json.dumps(enabled_clouds)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we use the database for the config, as we always only have one row?

for file in [
'~/.config/gcloud/access_tokens.db',
'~/.config/gcloud/credentials.db'
]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the $GOOGLE_APPLICATION_CREDENTIALS and $GCLOUD_PROJECT in the environment, instead? It seems to me the google credential can be placed anywhere?

@@ -89,6 +90,8 @@ def _execute(dag: sky.Dag,
backend = backend if backend is not None else backends.CloudVmRayBackend()
backend.register_info(dag=dag, optimize_target=optimize_target)

task.update_file_mounts(init.get_cloud_credential_file_mounts())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the filemount enough? Or, should we add them to the environment as well?

@concretevitamin
Copy link
Member

I'll test and merge this PR.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made many fixes. Tested:

  • a machine with no credentials (blaze@)

    • sky launch examples/minimal.yaml / sky cpunode
      • No cloud is enabled. Sky will not be able to run any task. Please setup access to a cloud, and rerun sky init.
    • sky launch examples/minimal.yaml --docker
  • a machine with AWS only credentials (manually launch)

    • rsync -Pvar ~/.aws ubuntu@18.234.187.2:~/
    • pip3 install -e .[aws]
    • ok: sky cpunode --cloud aws -c aws
    • correctly failed: sky cpunode --cloud gcp -c gcp
    • correctly failed: sky cpunode --cloud azure -c azure
  • laptop (all 3 clouds)

    • time bash examples/run_smoke_tests.sh 2>&1 | tee run.log

@concretevitamin concretevitamin merged commit a21505e into master Feb 11, 2022
@concretevitamin concretevitamin deleted the sky-init branch February 11, 2022 14:08
gmittal pushed a commit that referenced this pull request Mar 15, 2022
* sky init

* fixes and lint

* Create test_enabled_clouds.py

* Optimizer is now aware of enabled_clouds

* Fix pytest

* Update registry.py

* Support GCS buckets

* Make GCS on GCP work

* yapf behaves differently across versions...

* yapf pls

* Fix Azure

* tweak messages

* tweak

* Apply hotfix from #127

* Simple fixes

* Use monkeypatch

* Address comments

* get rid of Task.enabled_clouds

* fix test

* Always install aws and gcloud utils

* Address comments

* oops

* Revert "Always install aws and gcloud utils"

This reverts commit a4630b1.

* Refactor and trigger `sky init` automatically

* Reverted to `is_directory`

* nits

* better check

* usability

* fix tests

* nits

* Address latest comments

* Update init.py

* Fix CLI messages

* Sync credentials regardless of storage mounts

* Fix cpunode and more docs

* Apply changes to requirements

* Update setup.py

* Fix tests

* Fixes

* Add links, fix test

Co-authored-by: Michael Luo <michael.luo@berkeley.edu>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cloud storage file mounts for gs://
5 participants