Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support for nvCOMP batch API #248

Open
Alexey-Kamenev opened this issue Jun 27, 2023 · 0 comments
Open

[FEATURE] Support for nvCOMP batch API #248

Alexey-Kamenev opened this issue Jun 27, 2023 · 0 comments

Comments

@Alexey-Kamenev
Copy link
Contributor

Alexey-Kamenev commented Jun 27, 2023

This feature adds support for nvCOMP batch/low-level API which allows to process multiple chunks in parallel.

The proposed implementation provides an easy way to use the API via well-known numcodecs Codec API. Using numcodecs also enables seamless integration with libraries such as zarr that use numcodecs internally.

Additionally, using nvCOMP batch API enables interoperability between existing codecs and nvCOMP batch codec. For example, the data can be compressed on CPU using default LZ4 codec and then decompressed on GPU using proposed nvCOMP batch codec.

To support batch mode, Codec interface was extended with functions, encode_batch and decode_batch, which implement batch mode.

Note that the current version of zarr does not support chunk-parallel functionality, but there is a proposal for this feature.

Currently the following compression/decompression algorithms are supported:

  • LZ4
  • Gdeflate
  • zstd
  • Snappy

nvCOMP also supports other algorithms which can be relatively easily added to kvikio.

Example of usage:

  • Simple use of Codec batch API:
import numcodecs
import numpy as np

# Get the codec from the numcodecs registry.
codec = numcodecs.registry.get_codec(dict(id="nvcomp_batch", algorithm="lz4"))

# Creater 2 chunks. The chunks do not have to be the same size.
shape = (4, 8)
chunk1, chunk2 = np.random.randn(2, *shape).astype(np.float32)

# Compress data.
data_comp = codec.encode_batch([chunk1, chunk2])

# Decompress.
data_decomp = codec.decode_batch(data_comp)

# Verify.
np.testing.assert_equal(data_decomp[0].view(np.float32).reshape(shape), chunk1)
np.testing.assert_equal(data_decomp[1].view(np.float32).reshape(shape), chunk2)
  • Using with zarr (no parallel chunking yet - see the note above).
import numcodecs
import numpy as np
import zarr

# Get the codec from the numcodecs registry.
codec = numcodecs.registry.get_codec(dict(id="nvcomp_batch", algorithm="lz4"))
shape = (16, 16)
chunks = (8, 8)

# Create data and compress.
data = np.random.randn(*shape).astype(np.float32)
z1 = zarr.array(data, chunks=chunks, compressor=codec)

# Store in compressed format.
zarr_store = zarr.MemoryStore()
zarr.save_array(zarr_store, z1, compressor=codec)

# Read back/decompress.
z2 = zarr.open_array(zarr_store)

np.testing.assert_equal(z1[:], z2[:])

If desired, the API can also be used directly, without having to use numcodecs API.

rapids-bot bot pushed a commit that referenced this issue Jul 3, 2023
See #248 for more details.

Authors:
  - Alexey Kamenev (https://github.com/Alexey-Kamenev)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #249
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant