Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Nested Schema Evolution in Parquet for Presto #6675

Closed
wants to merge 1 commit into from

Conversation

zhenxiao
Copy link
Collaborator

@zhenxiao zhenxiao commented Nov 18, 2016

Keep this PR just for Nested Schema Evolution in Parquet for Presto

@zhenxiao
Copy link
Collaborator Author

@martint would you please review when you are free?

@zhenxiao zhenxiao changed the title Nested Column Pruning for Parquet Support Nested Schema Evolution in Parquet for Presto Apr 17, 2017
@zhenxiao
Copy link
Collaborator Author

@nezihyigitbasi @dain @billonahill

This is a PR supports all nested schema evolution in Parquet. Could you please review? We've using it for a long time.

@nezihyigitbasi nezihyigitbasi self-assigned this Apr 17, 2017
@billonahill
Copy link

@zhenxiao I've run the tests from #4939 on this branch and it does not pass. The added test in that PR is for a table with a struct that has evolved with a new field that is not in an old partition. The test creates a schema with a struct, add a partition with that schema, and then change the table schema to add a field to the struct.

To test I applied the patch of only AbstractTestHiveClient and create-test-hive13.sql from #4939 and rerun the tests. Would you please try the same? The patch applies cleanly to this branch.

@Gauravshah
Copy link

anything that I can help with ?
facing similar issue on Athena Schema mismatch, metastore schema for row column direct_object has 61 fields but parquet schema has 66 fields

@sathiscode
Copy link

@billonahill I've used the same code and updated to fit with recent code changes and raised #10158
The latest code changes also ran successfully with changes to AbstractTestHiveClient and create-test-hive13.sql from #4939. Can you please check when you have time.

@sdorazio
Copy link

Any updates on this? Amazon Athena utilizes this parquet parser under the hood, and AWS support refuses to apply the fix on their end until this PR is merged. Long-story short - we cannot update our schemas without breaking the ability to query nested structs on older data until this fix is applied.

@zhenxiao
Copy link
Collaborator Author

zhenxiao commented Mar 2, 2019

schema evolution is supported in new Parquet Reader, which is enabled by default in recent releases. close this PR.

@gudladona
Copy link

Can someone please post which version of presto is the schema evolution for nested structs is supported?

@zhenxiao
Copy link
Collaborator Author

zhenxiao commented Aug 7, 2019

@gudladona87 I think starting from 0.213 release, new parquet reader is enabled by default, which supports schema evolution.
To support schema evolution for parquet, you also need to play with the configuration:
hive.parquet.use-column-names=true

@hiddenbit
Copy link

Regarding AWS Athena, since several people here need to use it and experienced Schema mismatch errors there...

The current version of Athena uses Presto v0.172, which does not support nested schema evolution. But the preview version of Athena currently uses Presto v0.217 which contains a fix for this.

I tested with the Athena preview version and there nested schema evolution is working fine (additional nested fields in Parquet files are ignored, missing ones default to null).

One can use the Athena preview version by running queries in context of a workgroup named AmazonAthenaPreviewFunctionality, see Athena FAQ.

We can just hope Amazon is updating the production version of Athena soon.

@zhenxiao
Copy link
Collaborator Author

nice, thank you for the note, @hiddenbit

@dracony
Copy link

dracony commented Feb 5, 2020

Trying this on version 0.227 on EMR against Glue.

There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'b' in table 'test' is declared as type 'struct<d:string,e:string,f:string,g:string>', but partition 'bla=1' declared column 'b' as type 'struct<d:string,e:string,g:string>'

This to me looks like it should work, especially since I turned on using column names to access fields.

If I update the partition to have the full column list, I get this instead:

The column b is declared as type struct<d:string,e:string,g:string,f:string>, but the Parquet file declares the column as type optional group b {
  optional binary d (UTF8);
  optional binary e (UTF8);
  optional binary g (UTF8);
}

The desired effect is getting null values for the missing columns.

@zhenxiao
Copy link
Collaborator Author

zhenxiao commented Feb 5, 2020

The desired effect is getting null values for the missing columns. should be supported in recent versions.
This is a long time ago patch, the expected behavior is supported

@dracony
Copy link

dracony commented Feb 5, 2020

Thanks for such a fast reply!
How do I enable it or which version should I be on?

I am currently on 0.227 and get the above errors.
If this is some misconfiguration on my part, how can I debug this?

@zhenxiao
Copy link
Collaborator Author

zhenxiao commented Feb 5, 2020

230+ should be fine. Could you please share ur hive.properties if seeing errors again?

@dracony
Copy link

dracony commented Feb 5, 2020

Sadly trying a newer version is not that easy, as 0.227 is latest offered by EMR.
For now, my hive.properties are:

connector.name=hive-hadoop2
hive.metastore-cache-ttl=20m
hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
hive.non-managed-table-writes-enabled = true
hive.s3-file-system-type = EMRFS
hive.hdfs.impersonation.enabled = true
hive.metastore = glue
hive.parquet.use-column-names = true

@zhenxiao zhenxiao deleted the column-prune branch January 22, 2022 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.