-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] 0.5.x taking too much shared memory during multiprocess training #2137
Comments
The 0.5 version copies the feature tensor of the parent graph to shared memory while 0.4.3 does not. It is due to the lazy-feature-copy behavior -- when a subgraph/block is created, it will hold a reference to the feature tensor of the parent graph and slice out subfeatures upon feature access. When the subgraph/block is transmitted to another process, the current behavior will copy the parent graph feature to shared memory, causing an excessive amount of shared memory usage. Three viable solutions:
It looks like No.3 is a better solution. |
How can we tell user to avoid such problem if we adopt No.3 as the solution? I prefer the No.1 solution. Also I'm wondering what's the benefit of the lazy features? And also what's the unnecessary data transmission? |
After some testing I confirm that the tensors returned by subprocesses consume shared memory; the original tensor from the main process and the usage of that tensor in subprocesses do not consume shared memory. For instance, consider the following code: import torch
import torch.utils.data
import tqdm
I = 10
N = 4
class Dataset(torch.utils.data.Dataset):
def __init__(self, X):
self.X = X
def __getitem__(self, i):
max_ = self.X[:, 0].long()
return self.X[max_[i]:max_[i]+I]
def __len__(self):
return self.X.shape[0]
x = torch.zeros(20000, 50000) # This takes 3.7G memory
X = Dataset(x)
dataloader = torch.utils.data.DataLoader(X, batch_size=20, num_workers=N)
for _ in tqdm.tqdm(dataloader):
pass The amount of shared memory consumed v.s.
Note that even if the original feature tensor is 3.7G, the shared memory consumed can go below that, meaning that the feature tensor transmitted from the main process to the subprocesses does not consume shared memory. So I think we should be good if we can somehow avoid returning the feature tensors of the original graph altogether. Another insight is that the linear scaling of shared memory consumption w.r.t. number of subprocesses is still there even if all the subprocesses return the very same tensor. |
🐛 Bug
After upgrading to 0.5.x, training with multiprocessing now requires much larger shared memory, like, much larger, especially compared to 0.4.x.
Also the consumption of shared memory scales linearly with the number of worker processes, but that's also a problem in 0.4.x.
To Reproduce
Steps to reproduce the behavior:
examples/pytorch/graphsage/train_sampling.py
as an example.df
. The shared memory consumption can be found as/dev/shm
.Here is my result on Reddit with feature size of 50000:
Expected behavior
Environment
conda
,pip
, source):Additional context
The text was updated successfully, but these errors were encountered: