Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Verify 32-bit CRC checksum when decoding pages #6290

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

xmakro
Copy link

@xmakro xmakro commented Aug 22, 2024

Closes #6289

Please let me know if we should expose this in the reader APIs instead of a crate feature

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 22, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some tests phase

parquet/Cargo.toml Outdated Show resolved Hide resolved
parquet/src/errors.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tustvold that this code needs to have some tests to ensure we don't break the feature in the future

Also I think the feature flag should be documented here https://crates.io/crates/parquet

@mapleFU
Copy link
Member

mapleFU commented Aug 27, 2024

FYI: https://github.com/apache/parquet-testing/tree/master/data
You can check the file with filename contains "checksum"

@xmakro
Copy link
Author

xmakro commented Sep 1, 2024

Thanks for the pointers. I added the tests and documented the feature flag. Please take a look

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Please run cargo +stable fmt --all and check in the result. Have you run any benchmarks to see if there is a measurable impact from the crc calculation?

@@ -215,3 +218,4 @@ harness = false

[lib]
bench = false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@@ -82,4 +83,4 @@ The `parquet` crate provides the following features which may be enabled in your

## License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0.
Licensed under the Apache License, Version 2.0: <http://www.apache.org/licenses/LICENSE-2.0>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the angle brackets necessary?

Comment on lines +399 to +401
return Err(ParquetError::General(
"Page CRC checksum mismatch".to_string(),
));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Err(ParquetError::General(
"Page CRC checksum mismatch".to_string(),
));
return Err(general_err!("Page CRC checksum mismatch"));

@@ -0,0 +1,55 @@
use std::path::PathBuf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the apache license notification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Verify 32-bit CRC checksum when decoding parquet pages
5 participants