Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suppress failure when reading $partitions system table in get_indexes #426

Merged
merged 1 commit into from
Dec 22, 2023

Conversation

LittleWat
Copy link
Contributor

@LittleWat LittleWat commented Nov 30, 2023

Description

Presto does support the $partitions table suffix per the release notes but Trino seems not to support this so this should be removed (?)

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/flask_appbuilder/api/__init__.py", line 110, in wraps
    return f(self, *args, **kwargs)
  File "/app/superset/views/base_api.py", line 127, in wraps
    raise ex
  File "/app/superset/views/base_api.py", line 121, in wraps
    duration, response = time_function(f, self, *args, **kwargs)
  File "/app/superset/utils/core.py", line 1454, in time_function
    response = func(*args, **kwargs)
  File "/app/superset/utils/log.py", line 255, in wrapper
    value = f(*args, **kwargs)
  File "/app/superset/databases/api.py", line 794, in table_extra_metadata
    payload = database.db_engine_spec.extra_table_metadata(
  File "/app/superset/db_engine_specs/trino.py", line 66, in extra_table_metadata
    if indexes := database.get_indexes(table_name, schema_name):
  File "/app/superset/models/core.py", line 863, in get_indexes
    return self.db_engine_spec.get_indexes(self, inspector, table_name, schema)
  File "/app/superset/db_engine_specs/base.py", line 1298, in get_indexes
    return inspector.get_indexes(table_name, schema)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/reflection.py", line 605, in get_indexes
    return self.dialect.get_indexes(
  File "/usr/local/lib/python3.9/site-packages/trino/sqlalchemy/dialect.py", line 283, in get_indexes
    partitioned_columns = self._get_columns(connection, f"{table_name}$partitions", schema, **kw)
  File "/usr/local/lib/python3.9/site-packages/trino/sqlalchemy/dialect.py", line 178, in _get_columns
    res = connection.execute(sql.text(query), {"schema": schema, "table": table_name})
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1306, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 325, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1498, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.9/site-packages/trino/sqlalchemy/dialect.py", line 399, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python3.9/site-packages/trino/dbapi.py", line 587, in execute
    self._iterator = iter(self._query.execute())
  File "/usr/local/lib/python3.9/site-packages/trino/client.py", line 810, in execute
    self._result.rows += self.fetch()
  File "/usr/local/lib/python3.9/site-packages/trino/client.py", line 830, in fetch
    status = self._request.process(response)
  File "/usr/local/lib/python3.9/site-packages/trino/client.py", line 609, in process
    raise self._process_error(response["error"], response.get("id"))
sqlalchemy.exc.ProgrammingError: (trino.exceptions.TrinoUserError) TrinoUserError(type=USER_ERROR, name=NOT_SUPPORTED, message="Invalid Hudi table name (unknown type 'partitions'): my-table$partitions", query_id=20231129_142219_00171_dk2ne)
[SQL: SELECT
    "column_name",
    "data_type",
    "column_default",
    UPPER("is_nullable") AS "is_nullable"
FROM "information_schema"."columns"
WHERE "table_schema" = ?
  AND "table_name" = ?
ORDER BY "ordinal_position" ASC]
[parameters: ('my-schema', 'my-table$partitions')]

Non-technical explanation

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

* Fix error when fetching Hudi tables schema. ({issue}`https://github.com/apache/superset/issues/21945`)

Copy link

cla-bot bot commented Nov 30, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@@ -280,7 +280,7 @@ def get_indexes(self, connection: Connection, table_name: str, schema: str = Non
if not self.has_table(connection, table_name, schema):
raise exc.NoSuchTableError(f"schema={schema}, table={table_name}")

partitioned_columns = self._get_columns(connection, f"{table_name}$partitions", schema, **kw)
partitioned_columns = self._get_columns(connection, table_name, schema, **kw)
Copy link
Member

@ebyhr ebyhr Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$partitions table depends on the connector's implementation, so we shouldn't have used it by hard-coding.
We could change this method like a) change the behavior based on the connector name or b) suppress exception. It would be nice to adopt both eventually as relying on the name isn't perfect, but we can start with b) in my opinion.

Removing $partitions table suffix doesn't make sense to me as it may break existing usages.

cc: @hashhar

Copy link
Contributor Author

@LittleWat LittleWat Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for sharing your opinion! We're using the Hudi connector so if I choose the A pattern, this should be fixed like as follows, right...? (please feel free to make commits to this branch or create another PR 🙇 )

if connector_name == "hudi":
  partitioned_columns = self._get_columns(connection, table_name, schema, **kw)
else:
  partitioned_columns = self._get_columns(connection, table_name, schema, **kw)

Copy link
Member

@ebyhr ebyhr Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hudi connector doesn't support $partitions system table for now. We need a different approach, e.g. parse result of SHOW CREATE TABLE, or treat all columns as non-partition columns

Copy link
Contributor Author

@LittleWat LittleWat Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr thank you for your comment!

but we can start with b) in my opinion.

Following this, I made a commit 2232541

Could you check if this is what you expect...? 🙏


e.g. parse result of SHOW CREATE TABLE,

sorry, I don't understand how to parse this result. 🤦

Here is the result of SHOW CREATE TABLE for the target Hudi table:

  • SHOW CREATE TABLE using Trino
Create Table
CREATE TABLE "<CATALOG>"."<DB>"."<TABLE>" (
   _hoodie_commit_time varchar COMMENT '',
   _hoodie_commit_seqno varchar COMMENT '',
   _hoodie_record_key varchar COMMENT '',
   _hoodie_partition_path varchar COMMENT '',
   _hoodie_file_name varchar COMMENT '',
   global_intensity double COMMENT '',
   global_reactive_power double COMMENT '',
   city varchar COMMENT '',
   voltage double COMMENT '',
   global_active_power double COMMENT '',
   sub_metering_1 double COMMENT '',
   sub_metering_2 double COMMENT '',
   sub_metering_3 double COMMENT '',
   meter_id varchar COMMENT '',
   location array(double) COMMENT '',
   ts varchar COMMENT ''
)
WITH (
   location = 's3a://<my-bucket>',
   partitioned_by = ARRAY['ts']
)
  • SHOW CREATE TABLE using Presto
Create Table
CREATE TABLE hudi."data-platform-demo"."hudi-s3-ingest " (
   "_hoodie_commit_time" varchar,
   "_hoodie_commit_seqno" varchar,
   "_hoodie_record_key" varchar,
   "_hoodie_partition_path" varchar,
   "_hoodie_file_name" varchar,
   "global_intensity" double,
   "global_reactive_power" double,
   "city" varchar,
   "voltage" double,
   "global_active_power" double,
   "sub_metering_1" double,
   "sub_metering_2" double,
   "sub_metering_3" double,
   "meter_id" varchar,
   "location" array(double),
   "ts" varchar
)

The difference is that Trino has:

WITH (
   location = 's3a://<my-bucket>',
   partitioned_by = ARRAY['ts']
)

In Superset, fetching Hudi schema works in Presto but it does not work in Trino.
How can we use this information...? 🙇

Copy link
Member

@hashhar hashhar Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Yuya meant that instead of using $partitions table in case of Hudi you can fire a SHOW CREATE TABLE and use the partitioned_by information from that output instead.

IMO however the simple fix is to do something like:

        partitioned_columns = None
        try:
            partitioned_columns = self._get_columns(connection, f"{table_name}$partitions", schema, **kw)
        except Exception as e:
            logger.debug("Couldn't fetch partition columns for ...")
        if not partitioned_columns:
            return []
        partition_index = dict(
            name="partition",
            column_names=[col["name"] for col in partitioned_columns],
            unique=False
        )
        return [partition_index]

This feature shouldn't have been added to begin with since there's no general purpose way to figure out if a table is partitioned or not at the moment. e.g. while Hudi/Hive use partitoned_by, Iceberg uses partitioning for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hashhar thank you for your comment! I fixed the code to follow your implementation.
I see... Could we use the simple fix for now to make Superset work...? 🙏
When creating the dataset in Superset using Trino+Hudi, this fetching error blocks it. There is a workaround to create the dataset via SQL Lab but we want to use the normal way to create the dataset.

Copy link

cla-bot bot commented Dec 6, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@LittleWat LittleWat marked this pull request as ready for review December 6, 2023 07:56
@LittleWat LittleWat changed the title remove $partitions table suffix Fix the error when fetching Hudi tables Schema Dec 6, 2023
@LittleWat LittleWat changed the title Fix the error when fetching Hudi tables Schema Fix the error when fetching Hudi tables schema Dec 6, 2023
@LittleWat LittleWat changed the title Fix the error when fetching Hudi tables schema Fix the error when fetching Hudi tables schema of Supserset Dec 6, 2023
@LittleWat LittleWat changed the title Fix the error when fetching Hudi tables schema of Supserset Fix the error when fetching Hudi tables schema of Superset Dec 6, 2023
@LittleWat LittleWat changed the title Fix the error when fetching Hudi tables schema of Superset Fix the error when fetching Hudi tables schema in Superset Dec 6, 2023
LittleWat added a commit to LittleWat/trino-python-client that referenced this pull request Dec 18, 2023
@cla-bot cla-bot bot added the cla-signed label Dec 18, 2023
LittleWat added a commit to LittleWat/trino-python-client that referenced this pull request Dec 18, 2023
@LittleWat LittleWat changed the title Fix the error when fetching Hudi tables schema in Superset Handle the error when fetching Hudi tables schema in Superset Dec 18, 2023
LittleWat added a commit to LittleWat/trino-python-client that referenced this pull request Dec 18, 2023
LittleWat added a commit to LittleWat/trino-python-client that referenced this pull request Dec 18, 2023
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay as a first step (but note Trino UI will still show failed queries).

@hovaesco Do you plan to follow-up on this to improve this?

LittleWat added a commit to LittleWat/trino-python-client that referenced this pull request Dec 18, 2023
LittleWat added a commit to LittleWat/trino-python-client that referenced this pull request Dec 18, 2023
@LittleWat
Copy link
Contributor Author

I could confirm that this patch fixes the issue, thanks!! :
スクリーンショット 2023-12-20 16 57 12
I'm glad if you could merge this so that we don't have to use the custom library. 🙏

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you squash commits into one and fix the commit title like "Suppress failure when reading $partitions system table in get_indexes"?
https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages is the guideline for a commit message.

trino/sqlalchemy/dialect.py Outdated Show resolved Hide resolved
@LittleWat LittleWat changed the title Handle the error when fetching Hudi tables schema in Superset Suppress failure when reading $partitions system table in get_indexes Dec 20, 2023
@LittleWat
Copy link
Contributor Author

thank you for your review in detail and sharing the doc! Fixed following your suggestion 🙇

Not all connectors have a `$partitions` table. This caused `get_indexes`
to fail when called on a non-Hive (or non-partitioned Hive) table.

Since Trino engine doesn't have concept of partitions there's no single
way to fetch partition columns. One option is to parse the output of
`SHOW CREATE TABLE` to identify them but the logic would differ based on
what connector is being used. So we just opt to suppress the failure in
case of a non-Hive or non-partitioned Hive table instead.
@hashhar
Copy link
Member

hashhar commented Dec 22, 2023

edited commit message to add some context, merging. Thanks @LittleWat.

@hashhar hashhar merged commit 9df2cf2 into trinodb:master Dec 22, 2023
12 checks passed
@LittleWat
Copy link
Contributor Author

thank you for updating the commit message and merge this! this helps our project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Error when fetching Hudi tables Schema
3 participants