Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU-sampling] Avoid access/init cuda instances at each sampling (child) process #6561

Closed
chang-l opened this issue Nov 14, 2023 · 1 comment
Assignees

Comments

@chang-l
Copy link
Collaborator

chang-l commented Nov 14, 2023

In PyTorch dataloader (cpu sampling), worker processes will never initialize CUDA context, as CUDA runtime does not support fork start method (https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing). I think DGL dataloader should also follow this convention, if possible.

However, here, when creating coo/csr matrix, it would actually call this cudaPointerGetAttributes CUDA runtime API inside IsPinned function. As a result, each worker would call this constructor and start init/access cuda instances.

is_pinned = (aten::IsNullArray(row) || row.IsPinned()) &&
(aten::IsNullArray(col) || col.IsPinned()) &&
(aten::IsNullArray(data) || data.IsPinned());
}

It's not a bug and will not error out as such behavior is guarded here by clearing the cuda error msg (see below):

// We don't want to fail in these particular cases since this function
// can be called when users only want to run on CPU even if CUDA API is
// enabled, or in a forked subprocess where CUDA context cannot be
// initialized. So we just mark the CUDA context to unavailable and
// return.
is_available_ = false;
cudaGetLastError(); // clear error

Nevertheless, I believe it would still be preferable to adhere to PyT's convention by removing the IsPinned function from the constructor of the coo/csr matrix.

I can come up with a PR for the fix later. cc. @nv-dlasalle @yaox12 @frozenbugs @TristonC

@chang-l chang-l self-assigned this Nov 14, 2023
@nv-dlasalle
Copy link
Collaborator

This has caused a lot of trouble in the past, so really glad you've caught this and will fix it, @chang-l. It will also save a lot of developer time as the resulting bugs from this issue take a while to track down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants