-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suppress failure when reading $partitions system table in get_indexes #426
Conversation
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
trino/sqlalchemy/dialect.py
Outdated
@@ -280,7 +280,7 @@ def get_indexes(self, connection: Connection, table_name: str, schema: str = Non | |||
if not self.has_table(connection, table_name, schema): | |||
raise exc.NoSuchTableError(f"schema={schema}, table={table_name}") | |||
|
|||
partitioned_columns = self._get_columns(connection, f"{table_name}$partitions", schema, **kw) | |||
partitioned_columns = self._get_columns(connection, table_name, schema, **kw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$partitions
table depends on the connector's implementation, so we shouldn't have used it by hard-coding.
We could change this method like a) change the behavior based on the connector name or b) suppress exception. It would be nice to adopt both eventually as relying on the name isn't perfect, but we can start with b) in my opinion.
Removing $partitions
table suffix doesn't make sense to me as it may break existing usages.
cc: @hashhar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for sharing your opinion! We're using the Hudi connector so if I choose the A pattern, this should be fixed like as follows, right...? (please feel free to make commits to this branch or create another PR 🙇 )
if connector_name == "hudi":
partitioned_columns = self._get_columns(connection, table_name, schema, **kw)
else:
partitioned_columns = self._get_columns(connection, table_name, schema, **kw)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hudi connector doesn't support $partitions
system table for now. We need a different approach, e.g. parse result of SHOW CREATE TABLE, or treat all columns as non-partition columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebyhr thank you for your comment!
but we can start with b) in my opinion.
Following this, I made a commit 2232541
Could you check if this is what you expect...? 🙏
e.g. parse result of SHOW CREATE TABLE,
sorry, I don't understand how to parse this result. 🤦
Here is the result of SHOW CREATE TABLE
for the target Hudi table:
- SHOW CREATE TABLE using Trino
Create Table
CREATE TABLE "<CATALOG>"."<DB>"."<TABLE>" (
_hoodie_commit_time varchar COMMENT '',
_hoodie_commit_seqno varchar COMMENT '',
_hoodie_record_key varchar COMMENT '',
_hoodie_partition_path varchar COMMENT '',
_hoodie_file_name varchar COMMENT '',
global_intensity double COMMENT '',
global_reactive_power double COMMENT '',
city varchar COMMENT '',
voltage double COMMENT '',
global_active_power double COMMENT '',
sub_metering_1 double COMMENT '',
sub_metering_2 double COMMENT '',
sub_metering_3 double COMMENT '',
meter_id varchar COMMENT '',
location array(double) COMMENT '',
ts varchar COMMENT ''
)
WITH (
location = 's3a://<my-bucket>',
partitioned_by = ARRAY['ts']
)
- SHOW CREATE TABLE using Presto
Create Table
CREATE TABLE hudi."data-platform-demo"."hudi-s3-ingest " (
"_hoodie_commit_time" varchar,
"_hoodie_commit_seqno" varchar,
"_hoodie_record_key" varchar,
"_hoodie_partition_path" varchar,
"_hoodie_file_name" varchar,
"global_intensity" double,
"global_reactive_power" double,
"city" varchar,
"voltage" double,
"global_active_power" double,
"sub_metering_1" double,
"sub_metering_2" double,
"sub_metering_3" double,
"meter_id" varchar,
"location" array(double),
"ts" varchar
)
The difference is that Trino has:
WITH (
location = 's3a://<my-bucket>',
partitioned_by = ARRAY['ts']
)
In Superset, fetching Hudi schema works in Presto but it does not work in Trino.
How can we use this information...? 🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Yuya meant that instead of using $partitions
table in case of Hudi you can fire a SHOW CREATE TABLE
and use the partitioned_by
information from that output instead.
IMO however the simple fix is to do something like:
partitioned_columns = None
try:
partitioned_columns = self._get_columns(connection, f"{table_name}$partitions", schema, **kw)
except Exception as e:
logger.debug("Couldn't fetch partition columns for ...")
if not partitioned_columns:
return []
partition_index = dict(
name="partition",
column_names=[col["name"] for col in partitioned_columns],
unique=False
)
return [partition_index]
This feature shouldn't have been added to begin with since there's no general purpose way to figure out if a table is partitioned or not at the moment. e.g. while Hudi/Hive use partitoned_by
, Iceberg uses partitioning
for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hashhar thank you for your comment! I fixed the code to follow your implementation.
I see... Could we use the simple fix for now to make Superset work...? 🙏
When creating the dataset in Superset using Trino+Hudi, this fetching error blocks it. There is a workaround to create the dataset via SQL Lab
but we want to use the normal way to create the dataset.
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
$partitions
table suffix046f56b
to
800303d
Compare
800303d
to
338147b
Compare
338147b
to
b1753ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay as a first step (but note Trino UI will still show failed queries).
@hovaesco Do you plan to follow-up on this to improve this?
b1753ed
to
2d39b20
Compare
2d39b20
to
e6a20f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you squash commits into one and fix the commit title like "Suppress failure when reading $partitions system table in get_indexes"?
https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages is the guideline for a commit message.
e6a20f5
to
429595a
Compare
thank you for your review in detail and sharing the doc! Fixed following your suggestion 🙇 |
Not all connectors have a `$partitions` table. This caused `get_indexes` to fail when called on a non-Hive (or non-partitioned Hive) table. Since Trino engine doesn't have concept of partitions there's no single way to fetch partition columns. One option is to parse the output of `SHOW CREATE TABLE` to identify them but the logic would differ based on what connector is being used. So we just opt to suppress the failure in case of a non-Hive or non-partitioned Hive table instead.
429595a
to
e62f55f
Compare
edited commit message to add some context, merging. Thanks @LittleWat. |
thank you for updating the commit message and merge this! this helps our project! |
Description
Presto does support the $partitions table suffix per the release notes but Trino seems not to support this so this should be removed (?)
Non-technical explanation
Release notes
(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: