-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sky init and cloud credentials for storage #102
Conversation
Will take a closer look at GCS verification (i.e. |
Still have smoke tests to run, but the code should be ready for a first-pass review. |
else: | ||
sync = storage.make_sync_dir_command(source=src, | ||
destination=wrapped_dst) | ||
# It is a directory so make sure it exists. | ||
mkdir_for_wrapped_dst = f'mkdir -p {wrapped_dst}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test a src that doesn't exist, which will hit this branch:
<dst>: gs://nonexist-bucket/nonexist
There probably will be some error; is it reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, this will hit an error while doing gsutil ls
; now this will hit an error while doing aws sync / gsutil rsync
. I think this is okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@franklsf95 Did you actually run the requested test? I requested it because it does not look trivial.
It is not okay / equivalent to before, because with this change, we would have created zombie dirs from L994 mkdir_for_wrapped_dst = f'mkdir -p {wrapped_dst}'
if the src does not exist.
It's good to actually test Sky Data. However, this PR is not blocked by #127 -- a simple test is to log into the AWS node and run "gsutil ls private_bucket" or similar.
Is there a big difference in speed installing gsutil vs. the full google-cloud-sdk? |
from typing import Dict, Iterator, List, Optional, Tuple | ||
|
||
from sky import clouds | ||
from sky.clouds.service_catalog import azure_catalog | ||
|
||
|
||
def _run_output(cmd): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, move to backend_util.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any clues on how to fix this circular import? @franklsf95
How did Ray solve this type of issue?
Is this ready for review? I just launched a fresh $ python examples/example_app.py
I 01-05 05:47:58 resources.py:52] Missing tf_version in accelerator_args, using default (2.5.0)
I 01-05 05:47:58 resources.py:56] Missing tpu_name in accelerator_args, using default (sky_tpu)
ERROR: Please run 'az login' to setup account.
$ sky run -c min examples/minimal.yaml
I 01-05 05:49:14 execution.py:82] Optimizer target is set to COST.
ERROR: Please run 'az login' to setup account. This should've worked, according to the test ticked in the PR description:
|
Cloud credentials will not be synced up to any node launched with Sky. I've also rerun all the tests as ticked. |
prototype/sky/cli.py
Outdated
@@ -208,6 +209,7 @@ def _create_and_ssh_into_node( | |||
dag = sky.optimize(dag) | |||
task = dag.tasks[0] | |||
backend.register_info(dag=dag) | |||
task.update_file_mounts(sky_init.get_cloud_credential_file_mounts()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move this to task.py so it will be done independent of Sky being ran via yaml
or python
script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both yaml or python will go to execution.py
, so that part is fine. This is indeed duplicated in execution.py
and cli.py
. I'm not sure where best to put it in task.py
- it might also be surprising to user to, for example, automatically add the three file mounts when any task is constructed. It feels to me it belongs more with the execution/backend logic. Open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just have it once in execution.py
in that case? The cli.py
one is duplicated right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea but cli.py
is currently a separate code path than execution.py
. @gmittal is this deliberately kept separate, or is there plan to merge this with execution.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our setup.py
/click require an entry script for the CLI, which is what cli.py
is doing right now. If we were to merge with execution.py
there would be many things we could not support, such as passing dag
s to launch
. I think it makes sense for cli.py
to call functions in execution.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand we need cli.py
as a CLI entry point. I think Michael's question is why we have a function here in cli.py that seems to do similar things as in execution.py:_execute()
, and whether we can merge these two code paths.
from typing import Dict, Iterator, List, Optional, Tuple | ||
|
||
from sky import clouds | ||
from sky.clouds.service_catalog import azure_catalog | ||
|
||
|
||
def _run_output(cmd): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any clues on how to fix this circular import? @franklsf95
How did Ray solve this type of issue?
'$ gcloud auth application-default set-quota-project <proj>') | ||
return True, None | ||
|
||
def get_credential_file_mounts(self) -> Dict[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure it will sync to the remote's home directory? Had a small bug last time where Ray autoscaler made a /~/
directory instead which was very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About circular import: the solution would be to create another file e.g. run_utils.py
, and have both backend_utils
, aws
, azure
import run_output
from it. Since this is only one small function, I fear that might be a bit unnecessary.
About syncing ~
: In my tests they did sync to the user home directory on the VMs. I haven't run into the issue you described. using_file_mounts.yaml
seems to also use this format. Is there an alternative way of writing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sgtm, lets just leave the circular import for now and fix in a future PR.
Maybe try $HOME
instead of ~
for aws, gcp, azure configuration file mounting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try $HOME.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this! It would be great if the user do not need to worry about their credentials. Just have several small comments.
@@ -56,7 +56,7 @@ ray attach config/gcp.yml | |||
ray down config/gcp.yml | |||
``` | |||
|
|||
**Azure**. Install the Azure CLI (`pip install azure-cli`) then login using `az login`. Set the subscription to use from the command line (`az account set -s <subscription_id>`) or by modifying the provider section of the Azure template (`config/azure.yml.j2`). Ray Autoscaler does not work with the latest version of `azure-cli`. Hotfix: `pip install azure-cli-core==2.22.0` (this will make Ray work but at the cost of making the `az` CLI tool unusable). | |||
**Azure**. Install the Azure CLI (`pip install azure-cli==2.22.0`) then login using `az login`. Set the subscription to use from the command line (`az account set -s <subscription_id>`). Ray Autoscaler does not work with the latest version of `azure-cli` as of 1.9.1, hence the fixed Azure version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray Autoscaler does not work with the latest version of
azure-cli
as of 1.9.1, hence the fixed Azure version.
What is the meaning of this sentence?
'azure': ['azure-cli'], | ||
# ray <= 1.9.1 requires an older version of azure-cli. We can get rid of | ||
# this version requirement once ray 1.10 is released. | ||
'azure': ['azure-cli==2.22.0'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#283 @infwinston , this line seems fixed the version problem.
prototype/sky/init.py
Outdated
click.echo( | ||
click.style( | ||
'No cloud is enabled. Sky will not be able to run any task. ' | ||
'Please setup access to a cloud, and rerun `sky init`.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the link to the doc for setting up cloud access here?
|
||
def set_enabled_clouds(enabled_clouds: List[str]) -> None: | ||
_CURSOR.execute('INSERT OR REPLACE INTO config VALUES (?, ?)', | ||
(_ENABLED_CLOUDS_KEY, json.dumps(enabled_clouds))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we use the database for the config, as we always only have one row?
for file in [ | ||
'~/.config/gcloud/access_tokens.db', | ||
'~/.config/gcloud/credentials.db' | ||
]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check the $GOOGLE_APPLICATION_CREDENTIALS
and $GCLOUD_PROJECT
in the environment, instead? It seems to me the google credential can be placed anywhere?
@@ -89,6 +90,8 @@ def _execute(dag: sky.Dag, | |||
backend = backend if backend is not None else backends.CloudVmRayBackend() | |||
backend.register_info(dag=dag, optimize_target=optimize_target) | |||
|
|||
task.update_file_mounts(init.get_cloud_credential_file_mounts()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the filemount enough? Or, should we add them to the environment as well?
I'll test and merge this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made many fixes. Tested:
-
a machine with no credentials (blaze@)
- sky launch examples/minimal.yaml / sky cpunode
- No cloud is enabled. Sky will not be able to run any task. Please setup access to a cloud, and rerun
sky init
.
- No cloud is enabled. Sky will not be able to run any task. Please setup access to a cloud, and rerun
- sky launch examples/minimal.yaml --docker
- sky launch examples/minimal.yaml / sky cpunode
-
a machine with AWS only credentials (manually launch)
rsync -Pvar ~/.aws ubuntu@18.234.187.2:~/
pip3 install -e .[aws]
- ok: sky cpunode --cloud aws -c aws
- correctly failed: sky cpunode --cloud gcp -c gcp
- correctly failed: sky cpunode --cloud azure -c azure
-
laptop (all 3 clouds)
- time bash examples/run_smoke_tests.sh 2>&1 | tee run.log
Should've been addressed
* sky init * fixes and lint * Create test_enabled_clouds.py * Optimizer is now aware of enabled_clouds * Fix pytest * Update registry.py * Support GCS buckets * Make GCS on GCP work * yapf behaves differently across versions... * yapf pls * Fix Azure * tweak messages * tweak * Apply hotfix from #127 * Simple fixes * Use monkeypatch * Address comments * get rid of Task.enabled_clouds * fix test * Always install aws and gcloud utils * Address comments * oops * Revert "Always install aws and gcloud utils" This reverts commit a4630b1. * Refactor and trigger `sky init` automatically * Reverted to `is_directory` * nits * better check * usability * fix tests * nits * Address latest comments * Update init.py * Fix CLI messages * Sync credentials regardless of storage mounts * Fix cpunode and more docs * Apply changes to requirements * Update setup.py * Fix tests * Fixes * Add links, fix test Co-authored-by: Michael Luo <michael.luo@berkeley.edu> Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
The
sky init
command will check credentials for all three clouds, and write the list of available clouds into the global user state. The optimizer will then only generate plans using the enabled clouds.Demo:
If some cloud is not logged in:
When a cloud is considered as "enabled", it is also ensured that the user's node, as well as any node launched with Sky, will have the access credentials to that cloud's storage.
When running jobs, the optimizer will skip the clouds that you are not logged in with. If you are not logged into GCP but specifies
Resources(GCP, ...)
, it would show an error before provisioning, like so:Tasks and tests:
Notes for reviewers
~~I changed
CloudStorage.is_directory
toCloudStorage.is_file
because the latter is easier. ~~In GCS, usingReverted this and special-case handled GCS bucket URLs.gsutil ls -d <url>
to check if<url>
is a directory is hard: if is a non-root directory (e.g.gs://some-bucket/some-dir/
), it would return one line with just this dir with a trailing slash; if is a bucket root directory (e.g.gs://some-bucket
), then it would list all subdirectories in the bucket, which could be zero, one or more.gsutil -q stat <url>
reliably returns 0 iff is a file, and 1 otherwise. Same goes forhead_object
in AWS S3.<url>
exists. But later theaws sync / gsutil rsync
commands will surface this error, so I think it's okay.The easiest way to authenticate
gsutil
is by installing it with google-cloud-sdk; authenticating the standalonegsutil
is no longer recommended (and more complicated). So I changedGcsCloudStorage._GET_GSUTIL
to now install the full google-cloud-sdk.