Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: refactor SIP-68 db migrations with INSERT SELECT FROM #19421

Merged
merged 2 commits into from
Apr 20, 2022

Conversation

ktmud
Copy link
Member

@ktmud ktmud commented Mar 29, 2022

SUMMARY

This is another take on #19406 and #19416. The goal is to rewrite the bulk of loading + rewriting operations from Python to native SQL statements by utilizing INSERT SELECT FROM.

The whole migration happens in 3 steps:

  1. Copy tables, table_columns, sql_metrics to the new sl_xxx tables using INSERT SELECT FROM.
  2. Copy model associations to relationship tables.
  3. Run post processing on sl_datastes and sl_columns to
    • Tuck additional metadata columns (verbose_name, d3format, etc) in the original tables under extra_json
    • Add conditional quotes to physical table and column expressions
    • Apply dataset level is_managed_externally and external_url to columns

Comparing to #19406, this PR was able to cut down the migration time for our Superset instance from 7 hours to 1 hour while retaining the full functionality.

While working on the migration, I also noticed a couple of shortcomings in the original implementation, including problems in the shadow-writing logic. Most notably:

  1. Some long text columns should use MediumText for MySQL
  2. Columns are not automatically deleted when they are unlinked from a table or dataset (e.g. when you modify or remove a column from a dataset)
  3. Letting physical table and dataset point to the same columns could be problematic as we wouldn't want users who updated column properties in a dataset to automatically propagate that change to the related table. Tables and datasets may have different ownerships and different users may create different physical datasets for different tables. One of the original goal of SIP-68 was to change the mapping between physical datasets and database table/views to an n:1 relationship. It only makes sense if column definitions in each dataset are independent as well.
  4. Dataset does not know which database it belongs to unless being joined with a table. However, a dataset does not always have an associated table---for example, we can have virtual queries like "SELECT 1;" that does not populate data from any tables. Or, the table name extraction for a complex SQL statement could fail (especially when it has Jinja syntax).

Instead of creating multiple db migrations to address these issues, it'd be much easier for end users if we just skip the original migration (drop data for who already migrated) and bundle everything in another migration:

  1. Datasets and columns now reuse uuids from the old tables. This makes filling up the association tables easier. It also makes it possible to import a Superset export from an old Superset instance that hasn't upgrade to the new models, even after the old models have been removed.
  2. created_on, change_on, created_by and changed_by are also copied over when appropriate.
  3. In MySQL, some text columns (expression, description, etc) are now MediumText. This is consistent with current column types in the old tables.
  4. Since old and new entities can now be linked via uuid, sqlatable_id on NewDataset is removed. However, there is no 1:1 mapping of NewTable in the old models so they acquires uuids. In order to link NewTable with associated columns, sqlatable_id is added to NewTable during the db migration, but dropped once migration is complete.
  5. Refactored shadow-writing hooks to reduce duplicate code and more appropriately handle column deletion.
  6. Removed column fetching for tables. We used to fetch table metadata (column types, etc) from datasources during db migration, as well as in shadowing writing, but this has a major risk---external datasources can be unreliably slow. Users could fail the migration or saving edits to a datasource if the datasource is temporarily down. We should not let this unreliable process block the critical path of basic functionalities. I added a Table.sync_columns method, but in the future this should be consolidated with get_virtual_table_metadata and get_physical_table_metadata.
  7. dataset.owners are also added (copying PR feat(sip-68)(wip): Add owners to Dataset Model #19487).

Known issues and next steps

  1. The table columns for NewTable will only be copied from the dataset when its being created. We'd need another mechanism to sync table columns. Either during re-populating after fetch_metadata or in an async script. Since the Table model is not used anywhere yet, I think it's okay to leave this in another followup PR.
  2. I added a sync_columns method for tables. It's not currently in use and does more or less the same as SqlaTable.fetch_metadata. I think it's okay to keep this somewhat-duplicate not-in-active-use method as SqlaTable will be remove eventually.
  3. In this PR, NewTable.columns will be updated only when a physical SqlaTable is first synced to the new models and a NewTable is created. Updating columns in a SqlaTable dataset will NOT update SqlaTable.columns as we consider table columns and dataset columns to be separate and the metadata syncing should only happen in one direction---that is from tables to datasets, not the other way around.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

image

TESTING INSTRUCTIONS

To test the migration

superset db downgrade 2ed890b36b94 && SHOW_PROGRESS=1 superset db upgrade a9422eeaae74

A couple of queries to verify the correctness of the migrated data:

select
  sl_tables.name as table_name,
  sl_datasets.name as dataset_name,
  sl_datasets.expression
from sl_datasets
  join sl_dataset_tables on sl_datasets.id = sl_dataset_tables.dataset_id
  join sl_tables on sl_tables.id = sl_dataset_tables.table_id
limit 5;
select id, name, extra_json, changed_by_fk
from sl_columns
where
  changed_by_fk is not null
  and extra_json like "%verbose_name%"
limit 10;
select
  c.id, c.name, c.expression, c.uuid, c.created_on
from sl_columns as c join sl_dataset_columns a
  on a.column_id = c.id
where a.dataset_id = <your id>
order by created_on desc limit 10;

ADDITIONAL INFORMATION

@github-actions
Copy link
Contributor

⚠️ @ktmud Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

1 similar comment
@github-actions
Copy link
Contributor

⚠️ @ktmud Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@ktmud ktmud force-pushed the new-dataset-model-db-migration-alt branch from 2cb286c to b739ca4 Compare March 31, 2022 08:03
@ktmud ktmud force-pushed the new-dataset-model-db-migration-alt branch from b739ca4 to 1e99938 Compare March 31, 2022 10:35
@ktmud ktmud marked this pull request as ready for review March 31, 2022 18:08
@ktmud ktmud requested a review from a team as a code owner March 31, 2022 18:08
@ktmud ktmud force-pushed the new-dataset-model-db-migration-alt branch 3 times, most recently from 182726c to 8b0cdda Compare March 31, 2022 19:00
@@ -586,7 +586,7 @@ class BaseColumn(AuditMixinNullable, ImportExportMixin):
type = Column(Text)
groupby = Column(Boolean, default=True)
filterable = Column(Boolean, default=True)
description = Column(Text)
description = Column(MediumText())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MediumText is current type for these fields. They were updated in db migrations at some point. Updating for consistency.

@@ -130,6 +131,7 @@
"sum",
"doubleSum",
}
ADDITIVE_METRIC_TYPES_LOWER = {op.lower() for op in ADDITIVE_METRIC_TYPES}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metric_type.lower() is compared with doubleSum, which will always be false. Not sure if the original casing will be used elsewhere, so I added a new variable.

"""
Return all the dependencies from a SQL sql_text.
"""
dialect = "generic"
for dialect, sqla_dialects in sqloxide_dialects.items():
if sqla_dialect in sqla_dialects:
break
sql_text = RE_JINJA_BLOCK.sub(" ", sql_text)
sql_text = RE_JINJA_VAR.sub("abc", sql_text)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interpolate Jinja vars to give sqloxite a higher chance of successfully parsing the SQL text.


description = sa.Column(MediumText())
warning_text = sa.Column(MediumText())
unit = sa.Column(sa.Text)
Copy link
Member Author

@ktmud ktmud Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betodealmeida unit is in superset.columns.models but not in the migration script. Should I keep it?



def upgrade():
# Create tables for the new models.
op.create_table(
Copy link
Member Author

@ktmud ktmud Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual specification of these create_table commands is not needed anymore. Tables are now created with Base.metadata.create_all(bind=bind, tables=new_tables).

@@ -522,7 +524,7 @@ class SqlaTable(Model, BaseDatasource): # pylint: disable=too-many-public-metho
foreign_keys=[database_id],
)
schema = Column(String(255))
sql = Column(Text)
sql = Column(MediumText())
Copy link
Member Author

@ktmud ktmud Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out some columns need to be MediumText only after I noticed sql parse was failing because some SQL statements were cut off when copying to the new table.

@ktmud ktmud force-pushed the new-dataset-model-db-migration-alt branch 2 times, most recently from 43ba9bb to f52a737 Compare March 31, 2022 19:33
@codecov
Copy link

codecov bot commented Mar 31, 2022

Codecov Report

Merging #19421 (7be2ad2) into master (3663a33) will decrease coverage by 0.06%.
The diff coverage is 83.57%.

❗ Current head 7be2ad2 differs from pull request most recent head d8ada89. Consider uploading reports for the commit d8ada89 to get more accurate results

@@            Coverage Diff             @@
##           master   #19421      +/-   ##
==========================================
- Coverage   66.51%   66.44%   -0.07%     
==========================================
  Files        1690     1689       -1     
  Lines       64616    64738     +122     
  Branches     6656     6649       -7     
==========================================
+ Hits        42978    43016      +38     
- Misses      19937    20019      +82     
- Partials     1701     1703       +2     
Flag Coverage Δ
hive 52.87% <74.78%> (+0.17%) ⬆️
mysql 81.88% <83.57%> (-0.07%) ⬇️
postgres 81.93% <83.57%> (-0.07%) ⬇️
presto ?
python 82.24% <83.57%> (-0.19%) ⬇️
sqlite 81.70% <83.57%> (-0.07%) ⬇️
unit ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/migrations/shared/utils.py 36.17% <32.35%> (-39.39%) ⬇️
superset/tables/models.py 67.53% <57.62%> (-32.47%) ⬇️
superset/connectors/sqla/utils.py 91.25% <88.88%> (+0.23%) ⬆️
superset/columns/models.py 96.15% <92.85%> (-3.85%) ⬇️
superset/datasets/models.py 96.42% <93.33%> (-3.58%) ⬇️
superset/sql_parse.py 97.07% <94.59%> (-0.31%) ⬇️
superset/connectors/sqla/models.py 89.03% <98.66%> (-0.43%) ⬇️
superset/connectors/base/models.py 88.65% <100.00%> (ø)
superset/examples/birth_names.py 71.29% <100.00%> (+0.26%) ⬆️
superset/models/core.py 89.09% <100.00%> (+0.02%) ⬆️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3663a33...d8ada89. Read the comment docs.

@ktmud ktmud force-pushed the new-dataset-model-db-migration-alt branch 2 times, most recently from 3a1e9c3 to 05d39a1 Compare March 31, 2022 22:47
@@ -30,9 +30,7 @@

import sqlalchemy as sa
from alembic import op
from sqlalchemy import and_, func, or_
from sqlalchemy.dialects import postgresql
from sqlalchemy.sql.schema import Table
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bycatch: clean up unused imports.

)
inspector = inspect(engine)

# add missing tables
Copy link
Member Author

@ktmud ktmud Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic of syncing table schema from datasources is removed. It should lie in another offline or async process. Currently users have to hit a "Sync columns" button in the datasource editor to trigger this.

@ktmud ktmud force-pushed the new-dataset-model-db-migration-alt branch from 05d39a1 to ad6e167 Compare March 31, 2022 23:07
@ktmud ktmud force-pushed the new-dataset-model-db-migration-alt branch from 194d572 to 38f34de Compare April 1, 2022 16:04
Copy link
Member

@eschutho eschutho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per comment above, would recommend pulling out anything that has changed in the db structure that doesn't affect performance into a new migration.

@ktmud
Copy link
Member Author

ktmud commented Apr 1, 2022

@eschutho I propose to change current migration to no-op and move my updated code to a new migration.

I DM'ed @betodealmeida and @hughhhh earlier on Slack. Reposting the messages here for visibility:


Hi, I noticed we are making more adjustments to SIP-68 models and have prepared a couple of more db migrations. I’m wondering whether we should bundle all these migrations (including the first one that’s already merged) into one new migration and change the original migration to no-op.

Pros:

  • Reduced total migration time: bundle everything should be faster than running them separately
  • We get a chance to fix a couple of more errors such as using MediumText for MySQL and incorrect additive_metric_types matching
  • We get a chance to copy over other missing data such as changed on and last updated
  • We can re-ID the copied entities to follow the original ones, making it easier to spot-check potential data inconsistency bugs down the road
  • Everyone’s db is in a clean and consistent state
  • It's easier to review the db structure in the future

Cons:

  • Those who already ran the migration and bore the slowness may have to experience it again

Happy to incorporate #19487 and #19425 to my PR if they are still needed.

Btw, I think the Dataset model may need a database_id column as well. There is the implicit assumption that a dataset can only run on one database. I cannot imagine a case where we need to support a virtual dataset being used on different tables in different databases. Having direct link to databases makes sure existing virtual datasets can be linked to the correct database without relying on an unreliable table name extraction process. Currently if table name extraction fails, a virtual dataset lost its association with a correct table, hence the only link to database. It would require joining SqlaTable with sqlatable_id to get the correct database id.

@ktmud ktmud force-pushed the new-dataset-model-db-migration-alt branch from c9a121a to cc7168b Compare April 1, 2022 18:53
@ktmud ktmud force-pushed the new-dataset-model-db-migration-alt branch 3 times, most recently from bc5892e to f9d49dd Compare April 19, 2022 01:38
Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of work, @ktmud! Thanks for bringing this to next level and making sure Superset works at scale!

I left a few comments, there are a few places where I think you need to wrap the call to NewTable.load_or_create in a list.

superset/connectors/sqla/models.py Outdated Show resolved Hide resolved
superset/connectors/sqla/models.py Outdated Show resolved Hide resolved
superset/connectors/sqla/models.py Outdated Show resolved Hide resolved
superset/connectors/sqla/models.py Outdated Show resolved Hide resolved
superset/connectors/sqla/models.py Show resolved Hide resolved
superset/sql_parse.py Show resolved Hide resolved
superset/tables/models.py Outdated Show resolved Hide resolved
superset/tables/models.py Outdated Show resolved Hide resolved
@ktmud ktmud force-pushed the new-dataset-model-db-migration-alt branch from f9d49dd to d8ada89 Compare April 20, 2022 00:44
@ktmud ktmud force-pushed the new-dataset-model-db-migration-alt branch from d8ada89 to 24b8670 Compare April 20, 2022 00:56
@ktmud
Copy link
Member Author

ktmud commented Apr 20, 2022

@betodealmeida I think I've addressed all you comments (and also changed how physical columns for tables were added during migrations). Can you take another look?

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, Jesse!

@ktmud ktmud merged commit 231716c into apache:master Apr 20, 2022
@cemremengu
Copy link
Contributor

cemremengu commented Apr 20, 2022

I get the following error when applying this migration. Even if I update the column to be False by default it still does not work (which is probably expected). Using Postgres 13.6

I don't alembic much but provided the PR #19786 which I believe will resolve the problem. Feel free to adjust to your liking.

INFO  [alembic.runtime.migration] Running upgrade ad07e4fdbaba -> a9422eeaae74, new_dataset_models_take_2
>> Copy 12 physical tables to sl_tables...
>> Copy 41 SqlaTable to sl_datasets...
   Copy dataset owners...
   Link physical datasets with tables...
>> Copy 2,326 table columns to sl_columns...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.NotNullViolation: null value in column "is_temporal" of relation "sl_columns" violates not-null constraint
DETAIL:  Failing row contains (368c1f15-f08d-440b-b21d-b113e1a4e42d, 2021-12-03 08:09:13.776918, 2021-12-03 21:20:46.402427, 27, null, f, f, t, t, t, f, f, t, null, f, default_serial_num, STRING, null, , null, null, null, {"warning_markdown":null}, 1, 1).

@ktmud
Copy link
Member Author

ktmud commented Apr 20, 2022

@cemremengu thanks for reporting this. I'll work on a fix.

@cemremengu
Copy link
Contributor

There is another one @ktmud not sure about the fix yet:

>> Run postprocessing on 41 datasets
   Process dataset 1~41...


  File "<string>", line 8, in run_migrations
  File "/usr/local/lib/python3.8/site-packages/alembic/runtime/environment.py", line 813, in run_migrations
    self.get_context().run_migrations(**kw)
  File "/usr/local/lib/python3.8/site-packages/alembic/runtime/migration.py", line 561, in run_migrations
    step.migration_fn(**kw)
  File "/app/superset/migrations/versions/a9422eeaae74_new_dataset_models_take_2.py", line 887, in upgrade
    postprocess_datasets(session)
  File "/app/superset/migrations/versions/a9422eeaae74_new_dataset_models_take_2.py", line 581, in postprocess_datasets
    quoted_expression = get_identifier_quoter(drivername)(expression)
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/sql/compiler.py", line 3700, in quote
    if self._requires_quotes(ident):
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/sql/compiler.py", line 3613, in _requires_quotes
    or value[0] in self.illegal_initial_characters
IndexError: string index out of range

@ktmud
Copy link
Member Author

ktmud commented Apr 20, 2022

@cemremengu does changing L580 to

            if is_physical and drivername and expression:

fix the error? It should not happen but it seems somehow some of your physical datasets have empty strings as the table name/expression.

@cemremengu
Copy link
Contributor

cemremengu commented Apr 20, 2022

I think it is because my connection string is in the form postgresql+psycopg2:// so sqlalchemy is failing to find an identifier quoter?

Should it be like .split("://")[0].split('+')[0] ?

Here are the rows in my sl_datasets table (if that is the right one), everything seems to be alright:

image

@ktmud
Copy link
Member Author

ktmud commented Apr 20, 2022

No, if it's a connection string error, you won't even reach line 3613 of sqlalchemy/sql/compiler.py. Are these all the records in your sl_datasets table?

@cemremengu
Copy link
Contributor

cemremengu commented Apr 20, 2022

Yes, I removed the + part and still got the error unfortunately.

Are these all the records in your sl_datasets table?

Yes but it is strange since I have 41 datasets currently. Is there any chance that I did one of the previous migrations incorrectly?

EDIT: It says it copied all 41 though:

>> Copy 12 physical tables to sl_tables...
>> Copy 41 SqlaTable to sl_datasets...
   Copy dataset owners...
   Link physical datasets with tables...
>> Copy 2,326 table columns to sl_columns...
   Link all columns to sl_datasets...
>> Copy 41 metrics to sl_columns...
   Link metric columns to datasets...
>> Run postprocessing on 2,367 columns
   [Column 1 to 2,367] 2,354 may be updated
   Assign table column relations...
>> Run postprocessing on 41 datasets
   Process dataset 1~41...

@ktmud
Copy link
Member Author

ktmud commented Apr 20, 2022

@cemremengu that's very strange. can you downgrade and rerun the whole migration?

superset db downgrade ad07e4fdbaba && SHOW_PROGRESS=1 superset db upgrade a9422eeaae74

@cemremengu
Copy link
Contributor

Downgrading combined with if is_physical and drivername and expression: fixed the issue. Thanks so much!

PS. I had to manually update nextval for the sl_* table sequences but that might be due to my previous attempts.

hughhhh pushed a commit to hve-labs/superset that referenced this pull request May 11, 2022
@john-bodley john-bodley deleted the new-dataset-model-db-migration-alt branch June 8, 2022 05:32
philipher29 pushed a commit to ValtechMobility/superset that referenced this pull request Jun 9, 2022
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 and removed 🚢 2.0.1 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels risk:db-migration PRs that require a DB migration size/XXL 🚢 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants