[Bug] 0.5.x taking too much shared memory during multiprocess training #2137

BarclayII · 2020-09-01T07:28:13Z

🐛 Bug

After upgrading to 0.5.x, training with multiprocessing now requires much larger shared memory, like, much larger, especially compared to 0.4.x.

Also the consumption of shared memory scales linearly with the number of worker processes, but that's also a problem in 0.4.x.

To Reproduce

Steps to reproduce the behavior:

Take examples/pytorch/graphsage/train_sampling.py as an example.
Replace the node features with a much bigger dimension, like 50000.
Train and observe the shared memory consumption via df. The shared memory consumption can be found as /dev/shm.

Here is my result on Reddit with feature size of 50000:

Version	2 workers	4 workers
0.4.x branch	39M	67M
master	92G	184G

Expected behavior

Environment

DGL Version (e.g., 1.0):
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
OS (e.g., Linux):
How you installed DGL (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version (if applicable):
GPU models and configuration (e.g. V100):
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

jermainewang · 2020-09-01T08:50:17Z

The 0.5 version copies the feature tensor of the parent graph to shared memory while 0.4.3 does not. It is due to the lazy-feature-copy behavior -- when a subgraph/block is created, it will hold a reference to the feature tensor of the parent graph and slice out subfeatures upon feature access. When the subgraph/block is transmitted to another process, the current behavior will copy the parent graph feature to shared memory, causing an excessive amount of shared memory usage. Three viable solutions:

For lazy features, instead of copying the parent graph feature to shared memory, first slice out subfeatures and then copy the subfeatures to shared memory. However, this may cause many unnecessary data transmission.
Try to deduce that the parent graph feature is transmitted back to the main process so there is no need to put it in shared memory. It is not clear whether it is feasible as the serializer/pickler has no knowledge about the receiver of the data.
In the high-level interfaces (e.g., NodeDataLoader and EdgeDataLoader), do not pass the parent features to the sampler process so that the sampled subgraphs/blocks will have no features. Once the main process gets the subgraphs/blocks, restore the lazy features.

It looks like No.3 is a better solution.

VoVAllen · 2020-09-01T10:01:46Z

How can we tell user to avoid such problem if we adopt No.3 as the solution? I prefer the No.1 solution. Also I'm wondering what's the benefit of the lazy features? And also what's the unnecessary data transmission?

BarclayII · 2020-09-02T02:19:20Z

After some testing I confirm that the tensors returned by subprocesses consume shared memory; the original tensor from the main process and the usage of that tensor in subprocesses do not consume shared memory.

For instance, consider the following code:

import torch
import torch.utils.data
import tqdm

I = 10
N = 4

class Dataset(torch.utils.data.Dataset):
    def __init__(self, X):
        self.X = X

    def __getitem__(self, i):
        max_ = self.X[:, 0].long()
        return self.X[max_[i]:max_[i]+I]

    def __len__(self):
        return self.X.shape[0]

x = torch.zeros(20000, 50000)    # This takes 3.7G memory
X = Dataset(x)

dataloader = torch.utils.data.DataLoader(X, batch_size=20, num_workers=N)
for _ in tqdm.tqdm(dataloader):
    pass

The amount of shared memory consumed v.s. N and I goes as follows:

	2	4
10	~100M	~202M
100	~1.2G	~2G
directly returning `self.X`	~100G	>200G

Note that even if the original feature tensor is 3.7G, the shared memory consumed can go below that, meaning that the feature tensor transmitted from the main process to the subprocesses does not consume shared memory.

So I think we should be good if we can somehow avoid returning the feature tensors of the original graph altogether.

Another insight is that the linear scaling of shared memory consumption w.r.t. number of subprocesses is still there even if all the subprocesses return the very same tensor.

yzh119 added the bug:confirmed Something isn't working label Sep 1, 2020

BarclayII mentioned this issue Sep 2, 2020

[Bug] Fixes issues in feature pickling and transmission for subgraphs and blocks #2139

Merged

BarclayII linked a pull request Sep 2, 2020 that will close this issue

[Bug] Fixes issues in feature pickling and transmission for subgraphs and blocks #2139

Merged

BarclayII closed this as completed in #2139 Sep 3, 2020

BarclayII mentioned this issue Sep 11, 2020

[Patch Release] 0.5.2 #2178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 0.5.x taking too much shared memory during multiprocess training #2137

[Bug] 0.5.x taking too much shared memory during multiprocess training #2137

BarclayII commented Sep 1, 2020

jermainewang commented Sep 1, 2020

VoVAllen commented Sep 1, 2020 •

edited

Loading

BarclayII commented Sep 2, 2020 •

edited

Loading

[Bug] 0.5.x taking too much shared memory during multiprocess training #2137

[Bug] 0.5.x taking too much shared memory during multiprocess training #2137

Comments

BarclayII commented Sep 1, 2020

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

jermainewang commented Sep 1, 2020

VoVAllen commented Sep 1, 2020 • edited Loading

BarclayII commented Sep 2, 2020 • edited Loading

VoVAllen commented Sep 1, 2020 •

edited

Loading

BarclayII commented Sep 2, 2020 •

edited

Loading