Skip to content

A Distributed Streaming PyTorch Dataloader for Parquet.

License

Notifications You must be signed in to change notification settings

clearhanhui/ParquetLoader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ParquetLoader

This project is inspired by litdata. It implements a PyTorch dataset and dataloader that support streaming and distributed loading of Parquet datasets.

Key features:

  • Streaming loading of large Parquet datasets, e.g., Hive tables stored in Parquet format.
  • Near-zero redundancy loading across ranks & workers during distributed training.
  • Asynchronous preloading to overlap training and loading for better efficiency.

Limitations:

  • Less efficient than full memory loading for small datasets.
  • Degrades to full loading (or worse) for datasets with only one or a few Parquet files/row groups.
  • Row group size affects efficiency; it's recommended to set it to 1-1000 times the batch size.

Installation

Install from source

git clone https://github.com/clearhanhui/ParquetLoader.git
cd ParquetLoader
pip install .

Usage

from parquet_loader import ParquetDataset, ParquetDataLoader
dataset = ParquetDataset('/path/to/parquet/dataset')
dataloader = ParquetDataLoader(dataset)

See examples in tests.

Benchmark

  • fullly loading vs streaming loading

    Time(s) Memory(MB)
    fullly loading 3.041 153
    streaming loading 7.290 610
  • synchronous loading vs asynchronous loading

    Time(s)
    synchronous loading 39.204
    asynchronous loading 25.854

See full results in benckmarks.