Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Parquet reader builder supports building multiple ranges to read #3841

Merged
merged 14 commits into from
May 10, 2024

Conversation

evenyag
Copy link
Contributor

@evenyag evenyag commented Apr 30, 2024

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

This PR defines the FileRange struct for parquet files and implements a method build_file_ranges() to build ranges from a parquet file. A FileRange contains a range of rows to read from a parquet file. We can read different ranges in parallel later. Now a FileRange is a row group exactly.

pub struct FileRange {}

impl FileRange {
    pub(crate) async fn reader(&self) -> Result<RowGroupReader> {}
}

To reuse code, this PR

  • Refactors the ParquetReader. Now it adds a RowGroupReader to read Batches from a row group. The ParquetReader invokes the RowGroupReader to read the parquet file.
  • Defines a struct FileRangeContext for all ranges of the same parquet file. This PR also moves the precise_filter() method to the context as the context contains all inputs the method needs.

Now the builder uses a method build_reader_input() to construct the FileRangeContext and row groups to read. Both build() and build_file_ranges can reuse this method.

impl ParquetReaderBuilder {
    pub async fn build(&self) -> Result<ParquetReader> {
        let (context, row_groups) = self.build_reader_input().await?;
        ParquetReader::new(context, row_groups).await
    }

    pub async fn build_file_ranges(&self, file_ranges: &mut Vec<FileRange>) -> Result<()> {}

    async fn build_reader_input(&self) -> Result<(FileRangeContextRef, RowGroupMap)> {}
}

This PR also fixes some remaining issues and improves the ReadFormat helper struct.

  • It passes the projection while constructing the struct so it could
    • compute the projection_indices in advance
    • compute the field_id_to_projected_index in advance. Then convert_record_batch() doesn't require &mut self. This is necessary for the FileRangeContext as we have to share the ReadFormat
  • It wraps the SimpleFilterEvaluator into a SimpleFilterContext. The context gets the column info in advance, from the expected region metadata. This ensures the context can use the correct column id to find the column in the ReadFormat

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR does not require documentation updates.

@github-actions github-actions bot added the docs-not-required This change does not impact docs. label Apr 30, 2024
@evenyag evenyag marked this pull request as ready for review April 30, 2024 07:00
Copy link

codecov bot commented Apr 30, 2024

Codecov Report

Attention: Patch coverage is 88.25397% with 37 lines in your changes are missing coverage. Please review.

Project coverage is 85.32%. Comparing base (154f561) to head (53396e1).
Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3841      +/-   ##
==========================================
- Coverage   85.60%   85.32%   -0.28%     
==========================================
  Files         952      953       +1     
  Lines      163079   163431     +352     
==========================================
- Hits       139596   139451     -145     
- Misses      23483    23980     +497     

@evenyag evenyag changed the title feat: Parquet reader builder support building multiple partitions feat: Parquet reader builder supports building multiple partitions Apr 30, 2024
@evenyag evenyag marked this pull request as draft April 30, 2024 08:37
@evenyag evenyag changed the title feat: Parquet reader builder supports building multiple partitions feat: Parquet reader builder supports building multiple ranges to read Apr 30, 2024
@evenyag evenyag marked this pull request as ready for review May 7, 2024 07:34
src/mito2/src/sst/parquet/reader.rs Show resolved Hide resolved
src/mito2/src/sst/parquet/reader.rs Show resolved Hide resolved
src/mito2/src/sst/parquet/reader.rs Outdated Show resolved Hide resolved
src/mito2/src/sst/parquet/reader.rs Show resolved Hide resolved
@evenyag evenyag requested a review from v0y4g3r May 7, 2024 12:07
Copy link
Collaborator

@fengjiachun fengjiachun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@v0y4g3r v0y4g3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@v0y4g3r v0y4g3r added this pull request to the merge queue May 10, 2024
Merged via the queue into GreptimeTeam:main with commit 5a0629e May 10, 2024
21 checks passed
@v0y4g3r v0y4g3r deleted the feat/par-row-group-scan branch May 10, 2024 07:53
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required This change does not impact docs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants