Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary / row helpers #6096

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Binary / row helpers #6096

wants to merge 7 commits into from

Conversation

bkirwi
Copy link

@bkirwi bkirwi commented Jul 19, 2024

Which issue does this PR close?

Closes #6063.

(Potentially - still under discussion at the linked issue!)

Rationale for this change

I've added the optional from_binary method discussed in the associated issue also.

What changes are included in this PR?

data, into_binary and from_binary functions, and an extension to the fuzz test that checks the data survives the roundtrip.

Are there any user-facing changes?

Yes, though I suspect the rustdoc covers them enough?

Currently these are accessible via `AsRef`, but that trait only gives
you the bytes with the lifetime of the `Row` struct and not the lifetime
of the backing data.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 19, 2024
/// Create a [BinaryArray] from the [Rows] data without reallocating the
/// underlying bytes.
pub fn into_binary(self) -> BinaryArray {
assert!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should return LargeBinaryArray here as the offsets are Vec<usize> (not Vec<u32>)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to it! Since array indices are signed ints, we'd need this assert in any case, so I went with what seemed like the most common type. I don't feel especially strongly about it though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree Binary is likely the most common type. We could potentially add a to_large_binary to support converting to LargeBinary 🤔

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bkirwi and @XiangpengHao

I think this PR needs some additional negative tests and error testing but otherwise I think it is looking good to me

cc @tustvold in case you have time to comment on the safety of the design

self.buffer.len() <= i32::MAX as usize,
"rows buffer too large"
);
let offsets_scalar = ScalarBuffer::from_iter(self.offsets.into_iter().map(i32::usize_as));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to check that the offsets don't overflow a i32 - this should be a try_into I think and the method should be like fn try_into_binary(self) -> Result<BinaryArray>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My belief is that this is guaranteed by the assert above (which asserts that the len is not larger than i32::MAX) and the existing offset invariant (which guarantees that all offsets are valid indices into the binary data). So a more expensive O(n) check seemed redundant.

I'll go ahead and turn that assert into a Result::Err; let me know what you think about the other side of it!

/// Create a [BinaryArray] from the [Rows] data without reallocating the
/// underlying bytes.
pub fn into_binary(self) -> BinaryArray {
assert!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree Binary is likely the most common type. We could potentially add a to_large_binary to support converting to LargeBinary 🤔

@@ -738,6 +738,23 @@ impl RowConverter {
}
}

/// Create a new [Rows] instance from the given binary data.
pub fn from_binary(&self, array: BinaryArray) -> Rows {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this function needs to be marked unsafe -- I am worried that someone inserts invalid data into Rows here (e.g. modifies the bytes to read in invalid UTF8).

However, I see that there is already a way to convert between Rows and [u8] and then from [u8] to Rows (e.g RowParser::parser)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's my understanding! This would definitely be unsafe without the validate_utf8 below... but with it, I believe this has the same safety properties as existing public API.

for (actual, expected) in back.iter().zip(&arrays) {
actual.to_data().validate_full().unwrap();
dictionary_eq(actual, expected)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add some negative tests (like create a BinaryArray from some random bytes and try to convert that back to an array and make sure it panics)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I notice there's just one existing test for the parser, for utf8 data; I've matched that and added a couple more tests for interesting cases.

(This seems like great API surface to fuzz... but it's challenging to write a real fuzzer for, since panics are expected and miri is disabled for our existing fuzzer. May be interesting future work!)

@@ -738,6 +738,23 @@ impl RowConverter {
}
}

/// Create a new [Rows] instance from the given binary data.
Copy link
Contributor

@alamb alamb Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a doc example showing how to do this?

I think trying to give a basic example of converting rows, then to/from binary, will not only serve as good documentation it will make sure all the required APIs are pub (for example, I think RowParser needs to be pub..)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done.

@alamb alamb marked this pull request as draft July 25, 2024 18:55
@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

Marking as draft so it is clear this PR isn't waiting on feedback anymore (at least I don't think it is). Please mark it as ready for review when it is ready for another look

Copy link
Author

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I think I've addressed all comments, though there were a couple things I wasn't certain of - addressed inline.

@@ -738,6 +738,23 @@ impl RowConverter {
}
}

/// Create a new [Rows] instance from the given binary data.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done.

@@ -738,6 +738,23 @@ impl RowConverter {
}
}

/// Create a new [Rows] instance from the given binary data.
pub fn from_binary(&self, array: BinaryArray) -> Rows {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's my understanding! This would definitely be unsafe without the validate_utf8 below... but with it, I believe this has the same safety properties as existing public API.

self.buffer.len() <= i32::MAX as usize,
"rows buffer too large"
);
let offsets_scalar = ScalarBuffer::from_iter(self.offsets.into_iter().map(i32::usize_as));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My belief is that this is guaranteed by the assert above (which asserts that the len is not larger than i32::MAX) and the existing offset invariant (which guarantees that all offsets are valid indices into the binary data). So a more expensive O(n) check seemed redundant.

I'll go ahead and turn that assert into a Result::Err; let me know what you think about the other side of it!

for (actual, expected) in back.iter().zip(&arrays) {
actual.to_data().validate_full().unwrap();
dictionary_eq(actual, expected)
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I notice there's just one existing test for the parser, for utf8 data; I've matched that and added a couple more tests for interesting cases.

(This seems like great API surface to fuzz... but it's challenging to write a real fuzzer for, since panics are expected and miri is disabled for our existing fuzzer. May be interesting future work!)

@bkirwi bkirwi marked this pull request as ready for review August 9, 2024 16:29
@bkirwi
Copy link
Author

bkirwi commented Aug 9, 2024

(Looks like there was some merge skew in the tests; I've merged the main branch in here which ought to fix it.)

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make it easier to treat Rows as bytes
3 participants