Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MOD-3251] Stop heartbeats before sending the container snapshot RPC #2004

Merged
merged 28 commits into from
Jul 16, 2024

Conversation

mattnappo
Copy link
Contributor

@mattnappo mattnappo commented Jul 12, 2024

Heartbeats make a connection to the gRPC proxy (modal.sock) which causes the container to open a FD. This open FD causes gVisor to crash during runsc checkpoint. I ran the repro over 300 times with 0 flakes.

@mattnappo mattnappo requested a review from gongy July 12, 2024 13:29
@mattnappo mattnappo changed the title Stop heartbeats before sending the container snapshot RPC [MOD-3251] Stop heartbeats before sending the container snapshot RPC Jul 12, 2024
Copy link
Member

@luiscape luiscape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to verify or test?

Copy link
Contributor

@thundergolfer thundergolfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need a pause-and-restart structure here. This looks like just pause, which means a restored container will be killed for not heartbeating.

modal/_container_io_manager.py Outdated Show resolved Hide resolved
@mattnappo
Copy link
Contributor Author

Update: I have landed on a solution I am happy with, with almost no overhead (@thundergolfer and @luiscape I removed the bottleneck I spoke of). Here is a summary:

  • I introduce a asyncio.Condition variable that guards a boolean called waiting_for_memory_snapshot (same as the old name).
  • Before a ContainerCheckpoint or ContainerHeartbeat RPC, the Condition must be acquired, and the heartbeat will only run if the container is NOT waiting_for_memory_snapshot. This prevents heartbeat requests from executing while the checkpoint-restore process in in progress.
  • If the function is a snapshotting function (function_def.is_checkpointing_fn), then heartbeats will only turn on after the entire checkpoint-restore process
  • If it's not a snapshotting function, then heartbeats are enabled immediately, and will run during the container init phase.
  • Remove the old ContainerIOManager::_waiting_for_memory_snapshot logic, as it was broken and did not prevent heartbeats from happening when they shouldn't

@mattnappo
Copy link
Contributor Author

The following invariants hold:

  • If snapshotting fn, then heartbeats run after the snap-restore phase
  • If snapshotting fn, then heartbeats do not run before the snap-restore phase
  • If not snapshotting fn, then heartbeats are always running

Copy link
Member

@luiscape luiscape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment re disabling heartbeats.

test/container_app_test.py Outdated Show resolved Hide resolved
test/container_app_test.py Outdated Show resolved Hide resolved
test/container_app_test.py Outdated Show resolved Hide resolved
test/container_app_test.py Outdated Show resolved Hide resolved
modal/_container_io_manager.py Outdated Show resolved Hide resolved
test/container_app_test.py Outdated Show resolved Hide resolved
modal/_container_io_manager.py Outdated Show resolved Hide resolved
modal/_container_io_manager.py Outdated Show resolved Hide resolved
modal/_container_io_manager.py Show resolved Hide resolved
modal/_container_io_manager.py Outdated Show resolved Hide resolved
Copy link
Contributor

@thundergolfer thundergolfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@mattnappo mattnappo merged commit aebfc72 into main Jul 16, 2024
22 checks passed
@mattnappo mattnappo deleted the matt/gvisor-flake-fix branch July 16, 2024 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants