Support Nested Schema Evolution in Parquet for Presto #6675

zhenxiao · 2016-11-18T23:17:23Z

Keep this PR just for Nested Schema Evolution in Parquet for Presto

zhenxiao · 2016-12-12T19:41:37Z

@martint would you please review when you are free?

zhenxiao · 2017-04-17T23:41:32Z

This is a PR supports all nested schema evolution in Parquet. Could you please review? We've using it for a long time.

billonahill · 2017-04-18T00:29:07Z

@zhenxiao I've run the tests from #4939 on this branch and it does not pass. The added test in that PR is for a table with a struct that has evolved with a new field that is not in an old partition. The test creates a schema with a struct, add a partition with that schema, and then change the table schema to add a field to the struct.

To test I applied the patch of only AbstractTestHiveClient and create-test-hive13.sql from #4939 and rerun the tests. Would you please try the same? The patch applies cleanly to this branch.

Gauravshah · 2017-12-20T10:31:10Z

anything that I can help with ?
facing similar issue on Athena Schema mismatch, metastore schema for row column direct_object has 61 fields but parquet schema has 66 fields

sathiscode · 2018-03-14T17:24:45Z

@billonahill I've used the same code and updated to fit with recent code changes and raised #10158
The latest code changes also ran successfully with changes to AbstractTestHiveClient and create-test-hive13.sql from #4939. Can you please check when you have time.

sdorazio · 2018-09-20T13:00:12Z

Any updates on this? Amazon Athena utilizes this parquet parser under the hood, and AWS support refuses to apply the fix on their end until this PR is merged. Long-story short - we cannot update our schemas without breaking the ability to query nested structs on older data until this fix is applied.

zhenxiao · 2019-03-02T01:06:46Z

schema evolution is supported in new Parquet Reader, which is enabled by default in recent releases. close this PR.

gudladona · 2019-08-02T21:05:05Z

Can someone please post which version of presto is the schema evolution for nested structs is supported?

zhenxiao · 2019-08-07T07:25:17Z

@gudladona87 I think starting from 0.213 release, new parquet reader is enabled by default, which supports schema evolution.
To support schema evolution for parquet, you also need to play with the configuration:
hive.parquet.use-column-names=true

hiddenbit · 2020-01-17T12:37:31Z

Regarding AWS Athena, since several people here need to use it and experienced Schema mismatch errors there...

The current version of Athena uses Presto v0.172, which does not support nested schema evolution. But the preview version of Athena currently uses Presto v0.217 which contains a fix for this.

I tested with the Athena preview version and there nested schema evolution is working fine (additional nested fields in Parquet files are ignored, missing ones default to null).

One can use the Athena preview version by running queries in context of a workgroup named AmazonAthenaPreviewFunctionality, see Athena FAQ.

We can just hope Amazon is updating the production version of Athena soon.

zhenxiao · 2020-01-17T20:59:08Z

nice, thank you for the note, @hiddenbit

dracony · 2020-02-05T11:13:43Z

Trying this on version 0.227 on EMR against Glue.

There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'b' in table 'test' is declared as type 'struct<d:string,e:string,f:string,g:string>', but partition 'bla=1' declared column 'b' as type 'struct<d:string,e:string,g:string>'

This to me looks like it should work, especially since I turned on using column names to access fields.

If I update the partition to have the full column list, I get this instead:

The column b is declared as type struct<d:string,e:string,g:string,f:string>, but the Parquet file declares the column as type optional group b {
  optional binary d (UTF8);
  optional binary e (UTF8);
  optional binary g (UTF8);
}

The desired effect is getting null values for the missing columns.

zhenxiao · 2020-02-05T19:25:03Z

The desired effect is getting null values for the missing columns. should be supported in recent versions.
This is a long time ago patch, the expected behavior is supported

dracony · 2020-02-05T19:59:40Z

Thanks for such a fast reply!
How do I enable it or which version should I be on?

I am currently on 0.227 and get the above errors.
If this is some misconfiguration on my part, how can I debug this?

zhenxiao · 2020-02-05T20:05:26Z

230+ should be fine. Could you please share ur hive.properties if seeing errors again?

dracony · 2020-02-05T20:22:20Z

Sadly trying a newer version is not that easy, as 0.227 is latest offered by EMR.
For now, my hive.properties are:

connector.name=hive-hadoop2
hive.metastore-cache-ttl=20m
hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
hive.non-managed-table-writes-enabled = true
hive.s3-file-system-type = EMRFS
hive.hdfs.impersonation.enabled = true
hive.metastore = glue
hive.parquet.use-column-names = true

facebook-github-bot added the CLA Signed label Nov 18, 2016

zhenxiao mentioned this pull request Nov 18, 2016

Prune Nested Fields for Parquet Columns #5547

Closed

zhenxiao force-pushed the column-prune branch from 552579f to 4f9f383 Compare November 19, 2016 05:03

zhenxiao assigned martint and zhenxiao and unassigned zhenxiao Nov 19, 2016

zhenxiao force-pushed the column-prune branch from 4f9f383 to ec78ebd Compare November 19, 2016 09:46

zhenxiao force-pushed the column-prune branch from ec78ebd to 73d6e28 Compare December 12, 2016 19:41

zhenxiao mentioned this pull request Feb 10, 2017

Parquet reader doesn't support projection pushdown past the first level #2508

Closed

Gauravshah mentioned this pull request Feb 14, 2017

Flexible parquet struct converter #4714

Closed

billonahill mentioned this pull request Apr 17, 2017

Support adding fields to nested Parquet structs #4939

Closed

zhenxiao force-pushed the column-prune branch from 73d6e28 to f556e38 Compare April 17, 2017 23:29

zhenxiao changed the title ~~Nested Column Pruning for Parquet~~ Support Nested Schema Evolution in Parquet for Presto Apr 17, 2017

nezihyigitbasi self-assigned this Apr 17, 2017

zhenxiao force-pushed the column-prune branch from f556e38 to 54c88a6 Compare April 18, 2017 20:17

Support Nested Schema Evolution in Parquet

d58807e

zhenxiao force-pushed the column-prune branch from 54c88a6 to d58807e Compare July 6, 2017 23:34

zhenxiao mentioned this pull request Oct 4, 2017

Receiving NULL values when querying array column types in hive #7947

Closed

sathiscode mentioned this pull request Mar 14, 2018

Support Nested Schema Evolution in Parquet for Presto #6675. Copy of … #10158

Closed

zhenxiao closed this Mar 2, 2019

zhenxiao unassigned martint Mar 2, 2019

zhenxiao deleted the column-prune branch January 22, 2022 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Nested Schema Evolution in Parquet for Presto #6675

Support Nested Schema Evolution in Parquet for Presto #6675

zhenxiao commented Nov 18, 2016 •

edited

Loading

zhenxiao commented Dec 12, 2016

zhenxiao commented Apr 17, 2017

billonahill commented Apr 18, 2017

Gauravshah commented Dec 20, 2017

sathiscode commented Mar 14, 2018

sdorazio commented Sep 20, 2018

zhenxiao commented Mar 2, 2019

gudladona commented Aug 2, 2019

zhenxiao commented Aug 7, 2019

hiddenbit commented Jan 17, 2020

zhenxiao commented Jan 17, 2020

dracony commented Feb 5, 2020

zhenxiao commented Feb 5, 2020

dracony commented Feb 5, 2020

zhenxiao commented Feb 5, 2020

dracony commented Feb 5, 2020

Support Nested Schema Evolution in Parquet for Presto #6675

Support Nested Schema Evolution in Parquet for Presto #6675

Conversation

zhenxiao commented Nov 18, 2016 • edited Loading

zhenxiao commented Dec 12, 2016

zhenxiao commented Apr 17, 2017

billonahill commented Apr 18, 2017

Gauravshah commented Dec 20, 2017

sathiscode commented Mar 14, 2018

sdorazio commented Sep 20, 2018

zhenxiao commented Mar 2, 2019

gudladona commented Aug 2, 2019

zhenxiao commented Aug 7, 2019

hiddenbit commented Jan 17, 2020

zhenxiao commented Jan 17, 2020

dracony commented Feb 5, 2020

zhenxiao commented Feb 5, 2020

dracony commented Feb 5, 2020

zhenxiao commented Feb 5, 2020

dracony commented Feb 5, 2020

zhenxiao commented Nov 18, 2016 •

edited

Loading