Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DGL DataLoader does not maintain example order with shuffle=False when using multiple workers. #7695

Open
mr-mateusz opened this issue Aug 13, 2024 · 1 comment
Labels
bug:confirmed Something isn't working

Comments

@mr-mateusz
Copy link

🐛 Bug

DGL DataLoader does not maintain order of examples with shuffle=False when num_workers > 1 and batch_size * num_workers <= dataset_size.
Elements in a single batch are in order but they are not in order across batches.
This behavior seems inconsistent with the expected operation of a DataLoader when shuffle=False.

To Reproduce

Code:

import dgl
import torch

num_layers = 4

# Example dataset
random_embeddings = torch.randn(10000, 128)
target_values = torch.rand(10000)

# Indices of test nodes
test_start_index = 1000
test_size = 4096
test_mask = torch.zeros(10000, dtype=torch.bool)
test_mask[test_start_index:test_start_index + test_size] = True

# Create graph
dgl_graph = dgl.knn_graph(random_embeddings, 10, exclude_self=True)

dgl_graph.ndata['features'] = random_embeddings
dgl_graph.ndata['target'] = target_values

# Indices of 'test' elements
_nids = torch.where(test_mask)[0]


print("Number of rows:", dgl_graph.ndata['target'][test_mask].shape)


# Example 1
_sampler = dgl.dataloading.MultiLayerFullNeighborSampler(num_layers)
_loader = dgl.dataloading.DataLoader(dgl_graph, _nids, _sampler, batch_size=1024, shuffle=False, drop_last=False,
                                     num_workers=4)

print('Example 1. 4 workers, batch size 1024. -> 4 * 1024 = 4096 (equal to test_size)')

_targets_iterated = []
for in_nodes, out_nodes, blocks in _loader:
    print(out_nodes[:5], out_nodes[-5:])
    _targets_iterated.append(blocks[-1].dstdata['target'])

_targets_iterated = torch.cat(_targets_iterated)

print(torch.equal(dgl_graph.ndata['target'][test_mask], _targets_iterated))

print('---')

# Example 2

print('Example 2. 4 workers, batch size 512. -> 4 * 512 = 2048 (less than test_size)')

_sampler = dgl.dataloading.MultiLayerFullNeighborSampler(num_layers)
_loader = dgl.dataloading.DataLoader(dgl_graph, _nids, _sampler, batch_size=512, shuffle=False, drop_last=False,
                                     num_workers=4)

_targets_iterated = []
for in_nodes, out_nodes, blocks in _loader:
    print(out_nodes[:5], out_nodes[-5:])
    _targets_iterated.append(blocks[-1].dstdata['target'])

_targets_iterated = torch.cat(_targets_iterated)

print(torch.equal(dgl_graph.ndata['target'][test_mask], _targets_iterated))

print('---')

# Example 3

print('Example 3. 1 worker, batch size 512')

_sampler = dgl.dataloading.MultiLayerFullNeighborSampler(num_layers)
_loader = dgl.dataloading.DataLoader(dgl_graph, _nids, _sampler, batch_size=512, shuffle=False, drop_last=False,
                                     num_workers=1)

_targets_iterated = []
for in_nodes, out_nodes, blocks in _loader:
    print(out_nodes[:5], out_nodes[-5:])
    _targets_iterated.append(blocks[-1].dstdata['target'])

_targets_iterated = torch.cat(_targets_iterated)

print(torch.equal(dgl_graph.ndata['target'][test_mask], _targets_iterated))

# Torch Dataloader

print('---')
print('pytorch')



class MyDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]
    
    
dataset = MyDataset(torch.where(test_mask)[0])

dataloader = torch.utils.data.DataLoader(dataset, batch_size=512, shuffle=False, num_workers=4)


for batch in dataloader:
    print(batch[:5], batch[-5:])

Output:

Number of rows: torch.Size([4096])
Example 1. 4 workers, batch size 1024. -> 4 * 1024 = 4096 (equal to test_size)
tensor([1000, 1001, 1002, 1003, 1004]) tensor([2019, 2020, 2021, 2022, 2023])
tensor([2024, 2025, 2026, 2027, 2028]) tensor([3043, 3044, 3045, 3046, 3047])
tensor([3048, 3049, 3050, 3051, 3052]) tensor([4067, 4068, 4069, 4070, 4071])
tensor([4072, 4073, 4074, 4075, 4076]) tensor([5091, 5092, 5093, 5094, 5095])
True
---
Example 1. 4 workers, batch size 512. -> 4 * 512 = 2048 (less than test_size)    <= here elements are not in order
tensor([1000, 1001, 1002, 1003, 1004]) tensor([1507, 1508, 1509, 1510, 1511])
tensor([2024, 2025, 2026, 2027, 2028]) tensor([2531, 2532, 2533, 2534, 2535])
tensor([3048, 3049, 3050, 3051, 3052]) tensor([3555, 3556, 3557, 3558, 3559])
tensor([4072, 4073, 4074, 4075, 4076]) tensor([4579, 4580, 4581, 4582, 4583])
tensor([1512, 1513, 1514, 1515, 1516]) tensor([2019, 2020, 2021, 2022, 2023])
tensor([2536, 2537, 2538, 2539, 2540]) tensor([3043, 3044, 3045, 3046, 3047])
tensor([3560, 3561, 3562, 3563, 3564]) tensor([4067, 4068, 4069, 4070, 4071])
tensor([4584, 4585, 4586, 4587, 4588]) tensor([5091, 5092, 5093, 5094, 5095])
False
---
Example 1. 1 worker, batch size 512
tensor([1000, 1001, 1002, 1003, 1004]) tensor([1507, 1508, 1509, 1510, 1511])
tensor([1512, 1513, 1514, 1515, 1516]) tensor([2019, 2020, 2021, 2022, 2023])
tensor([2024, 2025, 2026, 2027, 2028]) tensor([2531, 2532, 2533, 2534, 2535])
tensor([2536, 2537, 2538, 2539, 2540]) tensor([3043, 3044, 3045, 3046, 3047])
tensor([3048, 3049, 3050, 3051, 3052]) tensor([3555, 3556, 3557, 3558, 3559])
tensor([3560, 3561, 3562, 3563, 3564]) tensor([4067, 4068, 4069, 4070, 4071])
tensor([4072, 4073, 4074, 4075, 4076]) tensor([4579, 4580, 4581, 4582, 4583])
tensor([4584, 4585, 4586, 4587, 4588]) tensor([5091, 5092, 5093, 5094, 5095])
True
---
pytorch
tensor([1000, 1001, 1002, 1003, 1004]) tensor([1507, 1508, 1509, 1510, 1511])
tensor([1512, 1513, 1514, 1515, 1516]) tensor([2019, 2020, 2021, 2022, 2023])
tensor([2024, 2025, 2026, 2027, 2028]) tensor([2531, 2532, 2533, 2534, 2535])
tensor([2536, 2537, 2538, 2539, 2540]) tensor([3043, 3044, 3045, 3046, 3047])
tensor([3048, 3049, 3050, 3051, 3052]) tensor([3555, 3556, 3557, 3558, 3559])
tensor([3560, 3561, 3562, 3563, 3564]) tensor([4067, 4068, 4069, 4070, 4071])
tensor([4072, 4073, 4074, 4075, 4076]) tensor([4579, 4580, 4581, 4582, 4583])
tensor([4584, 4585, 4586, 4587, 4588]) tensor([5091, 5092, 5093, 5094, 5095])

Expected behavior

DataLoader to produces data in the same order as the input indices when shuffle=False, regardless of the number of workers or batch size.

Environment

  • DGL Version (e.g., 1.0): 2.3.0
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): torch 2.3.1+cu121
  • OS (e.g., Linux): "Ubuntu 22.04.3 LTS"
  • How you installed DGL (conda, pip, source): pip install dgl -f https://data.dgl.ai/wheels/torch-2.3/repo.html
  • Build command you used (if compiling from source): -
  • Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
  • CUDA/cuDNN version (if applicable): -
  • GPU models and configuration (e.g. V100): -
  • Any other relevant information:

Additional context

@rudongyu rudongyu added the bug:confirmed Something isn't working label Aug 15, 2024
@frozenbugs
Copy link
Collaborator

Hi @mr-mateusz , can you try graphbolt https://docs.dgl.ai/stochastic_training/index.html, which is our latest SOTA GNN dataloader. DGL dataloader is in a unmaintained mode now and will be deprecated in the future.

If you observe same issue, please reachout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:confirmed Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants