-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Nested Schema Evolution in Parquet for Presto #6675
Conversation
552579f
to
4f9f383
Compare
4f9f383
to
ec78ebd
Compare
ec78ebd
to
73d6e28
Compare
@martint would you please review when you are free? |
@nezihyigitbasi @dain @billonahill This is a PR supports all nested schema evolution in Parquet. Could you please review? We've using it for a long time. |
@zhenxiao I've run the tests from #4939 on this branch and it does not pass. The added test in that PR is for a table with a struct that has evolved with a new field that is not in an old partition. The test creates a schema with a struct, add a partition with that schema, and then change the table schema to add a field to the struct. To test I applied the patch of only |
anything that I can help with ? |
@billonahill I've used the same code and updated to fit with recent code changes and raised #10158 |
Any updates on this? Amazon Athena utilizes this parquet parser under the hood, and AWS support refuses to apply the fix on their end until this PR is merged. Long-story short - we cannot update our schemas without breaking the ability to query nested structs on older data until this fix is applied. |
schema evolution is supported in new Parquet Reader, which is enabled by default in recent releases. close this PR. |
Can someone please post which version of presto is the schema evolution for nested structs is supported? |
@gudladona87 I think starting from 0.213 release, new parquet reader is enabled by default, which supports schema evolution. |
Regarding AWS Athena, since several people here need to use it and experienced The current version of Athena uses Presto v0.172, which does not support nested schema evolution. But the preview version of Athena currently uses Presto v0.217 which contains a fix for this. I tested with the Athena preview version and there nested schema evolution is working fine (additional nested fields in Parquet files are ignored, missing ones default to One can use the Athena preview version by running queries in context of a workgroup named We can just hope Amazon is updating the production version of Athena soon. |
nice, thank you for the note, @hiddenbit |
Trying this on version 0.227 on EMR against Glue.
This to me looks like it should work, especially since I turned on using column names to access fields. If I update the partition to have the full column list, I get this instead:
The desired effect is getting null values for the missing columns. |
|
Thanks for such a fast reply! I am currently on 0.227 and get the above errors. |
230+ should be fine. Could you please share ur hive.properties if seeing errors again? |
Sadly trying a newer version is not that easy, as 0.227 is latest offered by EMR.
|
Keep this PR just for Nested Schema Evolution in Parquet for Presto