ParquetLoader

This project is inspired by litdata. It implements a PyTorch dataset and dataloader that support streaming and distributed loading of Parquet datasets.

Key features:

Streaming loading of large Parquet datasets, e.g., Hive tables stored in Parquet format.
Near-zero redundancy loading across ranks & workers during distributed training.
Asynchronous preloading to overlap training and loading for better efficiency.

Limitations:

Less efficient than full memory loading for small datasets.
Degrades to full loading (or worse) for datasets with only one or a few Parquet files/row groups.
Row group size affects efficiency; it's recommended to set it to 1-1000 times the batch size.

Installation

Install from source

git clone https://github.com/clearhanhui/ParquetLoader.git
cd ParquetLoader
pip install .

Usage

from parquet_loader import ParquetDataset, ParquetDataLoader
dataset = ParquetDataset('/path/to/parquet/dataset')
dataloader = ParquetDataLoader(dataset)

See examples in tests.

Benchmark

fullly loading vs streaming loading

Time(s) Memory(MB)

fullly loading 3.041 153

streaming loading 7.290 610
synchronous loading vs asynchronous loading

Time(s)

synchronous loading 39.204

asynchronous loading 25.854

See full results in benckmarks.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
benchmarks		benchmarks
parquet_loader		parquet_loader
synthetic_data		synthetic_data
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
pyproject.toml		pyproject.toml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParquetLoader

Installation

Usage

Benchmark

About

Releases 1

Languages

	Time(s)	Memory(MB)
fullly loading	3.041	153
streaming loading	7.290	610

	Time(s)
synchronous loading	39.204
asynchronous loading	25.854

License

clearhanhui/ParquetLoader

Folders and files

Latest commit

History

Repository files navigation

ParquetLoader

Installation

Usage

Benchmark

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages