H5Dataset with PyTorch DataLoader #4

Dingel321 · 2022-11-22T15:39:18Z

Dingel321
Nov 22, 2022

Hello,

I wanted to use the H5Dataset with PyTorch's DataLoader to load data with multiple workers.
I have a dataset of roughly 15Gb and in my simple benchmark using the H5Dataset alone was always faster.
So my question is how to use chunk_size and chunk_step in order to optimally load data with the PyTorch's DataLoader?

Answered by francois-rozet

Nov 23, 2022

Hello 👋

The advantage of DataLoader with num_workers > 0 is that the data processing is concurrent with the main process. Therefore, if a loop iteration takes longer than it takes to fetch a batch and transfer it to the main process, the next iteration will not have to wait for data. However, if iterations are fast, the overhead added by the transfer of data between processes could outweigh the benefits.

Here is an example (train.h5 contains 1M samples) where using a DataLoader is worthwhile. The effect is accentuated by the (very) large batch size.

>>> import lampe
>>> import time
>>> import torch
>>> import torch.utils.data as data
>>> import tqdm
>>>
>>> dataset = lampe.data.H5Dataset('…

View full answer

francois-rozet · 2022-11-23T10:54:05Z

francois-rozet
Nov 23, 2022
Maintainer

Hello 👋

The advantage of DataLoader with num_workers > 0 is that the data processing is concurrent with the main process. Therefore, if a loop iteration takes longer than it takes to fetch a batch and transfer it to the main process, the next iteration will not have to wait for data. However, if iterations are fast, the overhead added by the transfer of data between processes could outweigh the benefits.

Here is an example (train.h5 contains 1M samples) where using a DataLoader is worthwhile. The effect is accentuated by the (very) large batch size.

>>> import lampe
>>> import time
>>> import torch
>>> import torch.utils.data as data
>>> import tqdm
>>>
>>> dataset = lampe.data.H5Dataset('train.h5', batch_size=64 * 1024, chunk_size=1024, chunk_step=64)
>>> for theta, x in tqdm.tqdm(dataset, total=16):
...     time.sleep(1)
...
100%|██████████| 16/16 [00:18<00:00,  1.18s/it]
>>>
>>> dataloader = data.DataLoader(dataset, batch_size=None, num_workers=1)
>>> for theta, x in tqdm.tqdm(dataloader, total=16):
...     time.sleep(1)
...
100%|██████████| 16/16 [00:16<00:00,  1.04s/it]

You can see that the time per iteration is closer to 1s in the second case.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H5Dataset with PyTorch DataLoader #4

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

H5Dataset with PyTorch DataLoader #4

Dingel321 Nov 22, 2022

Replies: 1 comment

francois-rozet Nov 23, 2022 Maintainer

Dingel321
Nov 22, 2022

francois-rozet
Nov 23, 2022
Maintainer