Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix tpu bug #2350

Merged
merged 3 commits into from
Aug 4, 2023
Merged

fix tpu bug #2350

merged 3 commits into from
Aug 4, 2023

Conversation

infwinston
Copy link
Member

@infwinston infwinston commented Aug 4, 2023

Fix a bug for TPU pod where the actual number of node ips != launched_nodes. (from port PR #2210)
the bug caused failures in setup and workdir sync on worker nodes.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh
  • tested tpuvm_mnist.yaml on TPU pod v2-32

@concretevitamin
Copy link
Member

Thanks @infwinston! Could we add a quick workdir test to an existing TPU smoke test? To ensure no future regressions.

@cblmemo
Copy link
Collaborator

cblmemo commented Aug 4, 2023

The code looks good to me. 🙂 Just to make sure I haven't missed anything, how is this related to the port PR?

@Michaelvll
Copy link
Collaborator

Michaelvll commented Aug 4, 2023

It should be introduced by #2096, instead. Sorry for not identifying that issue during the review. : )

@infwinston
Copy link
Member Author

infwinston commented Aug 4, 2023

oh yes #2096. Ah the reason we didn't find this bug was due to we disable tpu pod smoke test by default. We should probably bring it back to default as people are using it.
looks like it's enabled. maybe we just didn't run it correctly. @romilbhardwaj do you recall any failure when running tpu-pod smoke test?

skypilot/tests/test_smoke.py

Lines 1355 to 1369 in ca2a092

@pytest.mark.gcp
def test_tpu_vm_pod():
name = _get_cluster_name()
test = Test(
'tpu_pod',
[
f'sky launch -y -c {name} examples/tpu/tpuvm_mnist.yaml --gpus tpu-v2-32 --use-spot --zone europe-west4-a',
f'sky logs {name} 1', # Ensure the job finished.
f'sky logs {name} 1 --status', # Ensure the job succeeded.
],
f'sky down -y {name}',
timeout=30 * 60, # can take 30 mins
)
run_one_test(test)

@romilbhardwaj
Copy link
Collaborator

Some of the TPU tests were failing and I don't fully remember the reason for failure, but I remember it looked unrelated to this (IIRC it was related to availability). The same tests had passed during a previous round of testing so I assumed its a transient error. I should have double checked - my bad 🙏

@infwinston
Copy link
Member Author

no worries! just wanted to make sure our smoke test can catch this. thanks

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@infwinston infwinston merged commit e9be60a into master Aug 4, 2023
15 checks passed
@infwinston infwinston deleted the fix-tpu-bug branch August 4, 2023 23:03
@infwinston infwinston restored the fix-tpu-bug branch August 4, 2023 23:03
@infwinston infwinston deleted the fix-tpu-bug branch August 4, 2023 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants