Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop panic in MetadataLoader on invalid data #6367

Merged
merged 4 commits into from
Sep 16, 2024

Conversation

samuelcolvin
Copy link
Contributor

Which issue does this PR close?

None

Rationale for this change

While messing with our object_store implementation I sent invalid data to MetadataEncoder and got a panic:

thread 'tokio-runtime-worker' panicked at /Users/samuel/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/ee2f75a/parquet/src/arrow/async_reader/metadata.rs:74:40:
attempt to subtract with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
libunwind: stepWithCompactEncoding - invalid compact unwind encoding

Looking at the code I'm pretty sure it's because suffix_len is <8 causing the subtract to fail.

What changes are included in this PR?

return an error instead of panicing. I don't know how to reproduce this in a test, hence just the change. LMK if this requires a test?

Are there any user-facing changes?

no.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 6, 2024
@@ -71,7 +71,10 @@ impl<F: MetadataFetch> MetadataLoader<F> {
let suffix_len = suffix.len();

let mut footer = [0; 8];
footer.copy_from_slice(&suffix[suffix_len - 8..suffix_len]);
let Some(footer_start) = suffix_len.checked_sub(8) else {
Copy link
Member

@Xuanwo Xuanwo Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me initially, but I'm wondering about the logic behind it.

First of all, it should fail at file_size < 8, right? We need to clarify this in our API documentation regarding the validation of file_size.

Additionally, it's possible that users provide an incorrect file size. The result of fetch.fetch(footer_start..file_size).await?; seems incorrect. We should specify that fetch must handle this correctly, returning an error if the range of footer_start..file_size fails to read. We can add a check here: if suffix_len != file_size - footer_start, we can raise an error like EOF.

What do you think? My motivation is that we should return the error at the correct place instead of here, where it is just a side effect.


I believe this error could be reproduced by an invalid file size or a malfunctioning fetch that returns a smaller size than expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is also possible to hit this error if size_hint is smaller than 8? A minimum value for size_hint could make sense, or returning an error if it is too small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ye I think @jhorstmann was right, the error almost certainly happened because the prefetch value was invalid, not because fetch returned the wrong size of slice.

I've updated to check prefetch size and added a test.

Comment on lines 60 to 65
} else if let Some(size_hint) = prefetch {
if size_hint < 8 {
return Err(ParquetError::EOF(format!(
"prefetch size of {size_hint} is less than footer size"
)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that it's a "hint", I think it's ok to ignore an invalid size and instead set size_hint to 8 if necessary (although I suppose the docs for fetch_parquet_metadata would need to reflect this). I'd make a similar argument for a prefetch value larger than file_size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can just set it to None in both cases if that's agreed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to not penalize a user for over specifying the prefetch. I think of it as a budget... you can use up to 2MB of prefetch. I can see angry AWS customers complaining that a metadata fetch that should have been a single GET is now 3 😅.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm lost on what you would suggest. Feel free to take this, or let me know concretely what you'd suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking along the lines of

        // If a size hint is provided, read more than the minimum size
        // to try and avoid a second fetch.
        let footer_start = if let Some(size_hint) = prefetch {
            // check for hint smaller than footer
            let size_hint = std::cmp::max(size_hint, FOOTER_SIZE);
            // check for hint larger than the file
            let size_hint = std::cmp::min(size_hint, file_size);
            file_size.saturating_sub(size_hint)
        } else {
            file_size - FOOTER_SIZE
        };

This guards against the error you're seeing, but allows providing a hint larger than the file without triggering extra I/O (as would happen if the hint were simply set to None).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except we don't need the let size_hint = std::cmp::min(size_hint, file_size); since we're using file_size.saturating_sub(size_hint) directly afterwards.

Changed.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nit. Love the reduction in magic numbers! Thanks @samuelcolvin.

@@ -48,12 +48,14 @@ pub struct MetadataLoader<F> {
remainder: Option<(usize, Bytes)>,
}

const FOOTER_SIZE: usize = 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined elsewhere in the parquet crate.

use crate::file::FOOTER_SIZE;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@crepererum crepererum merged commit 341ec35 into apache:master Sep 16, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants