Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: upgrade_catalog_perms and downgrade_catalog_perms implementation #29860

Merged

Conversation

michael-s-molina
Copy link
Member

@michael-s-molina michael-s-molina commented Aug 5, 2024

SUMMARY

This PR fixes the implementation of upgrade_catalog_perms and downgrade_catalog_perms methods to handle tables with millions of rows. To achieve this, I changed the algorithm to execute the migrations using a batched approach invoking the update/delete commands directly using a subquery instead of changing the records one by one. It also commits the transaction on every batch to avoid keeping a long standing transaction which leads to memory and timeout issues. Finally, I added detailed logging to help system administrators to track the migration progress and prepare for downtime.

Fixes #29801

TESTING INSTRUCTIONS

1 - Run any migration that uses the changed commands
2 - Check that both upgrades and downgrades work

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@michael-s-molina michael-s-molina requested a review from a team as a code owner August 5, 2024 17:44
@github-actions github-actions bot added the risk:db-migration PRs that require a DB migration label Aug 5, 2024
@dosubot dosubot bot added logging Creates a UI or API endpoint that could benefit from logging. risk:refactor High risk as it involves large refactoring work labels Aug 5, 2024
Copy link

codecov bot commented Aug 5, 2024

Codecov Report

Attention: Patch coverage is 87.83784% with 9 lines in your changes missing coverage. Please review.

Project coverage is 83.69%. Comparing base (76d897e) to head (adc0996).
Report is 592 commits behind head on master.

Files Patch % Lines
superset/migrations/shared/catalogs.py 87.83% 9 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #29860       +/-   ##
===========================================
+ Coverage   60.48%   83.69%   +23.20%     
===========================================
  Files        1931      527     -1404     
  Lines       76236    38088    -38148     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    31878    -14236     
+ Misses      28017     6210    -21807     
+ Partials     2105        0     -2105     
Flag Coverage Δ
hive 48.97% <0.00%> (-0.19%) ⬇️
javascript ?
mysql 76.71% <0.00%> (?)
postgres 76.78% <0.00%> (?)
presto 53.51% <0.00%> (-0.29%) ⬇️
python 83.69% <87.83%> (+20.20%) ⬆️
sqlite 76.27% <0.00%> (?)
unit 60.37% <87.83%> (+2.75%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

.execution_options(synchronize_session=False)
)
# Commit the transaction after each batch
session.commit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had also suggested making transactions smaller on migrations, but the limitation that I've been told is that if the second batch fails, we won't be able to roll back the first batch. Do you have an idea how we can do this so that it can roll back gracefully?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, rolling back means setting the column to None, which can be done independently of previous failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eschutho Answering your question more generically, there are some possible strategies to revert batch migrations such as logging changes to be reverted, making migrations reversible by storing the original value in a dedicated column (we do that for chart migrations) or using database checkpoints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, would the other "possible strategies" involve another custom script to get back to the pre-migration state? I'm just concerned that for most people just running an upgrade that fails on a batch other than the first batch would get them in a state where they could have some columns with a catalog value, and some without. @betodealmeida may be able to answer if this would be problematic, knowing how the codebase uses these values.

Copy link
Member Author

@michael-s-molina michael-s-molina Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eschutho I was able to remove the commit statement after verifying that we could hold/rollback a transaction with an appropriate block size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great. 🙏

@michael-s-molina
Copy link
Member Author

I'll also need to fix this part which generates a query for each dataset, which in Airbnb's case is more than 25,000 queries.

# update `schema_perm` and `catalog_perm` for tables and charts
for table in session.query(SqlaTable).filter_by(
    database_id=database.id,
    catalog=None,
):
    schema_perm = security_manager.get_schema_perm(
        database.database_name,
        default_catalog,
        table.schema,
    )

    table.catalog = default_catalog
    table.catalog_perm = catalog_perm
    table.schema_perm = schema_perm

    for chart in session.query(Slice).filter_by(
        datasource_id=table.id,
        datasource_type="table",
    ):
        chart.catalog_perm = catalog_perm
        chart.schema_perm = schema_perm

@michael-s-molina michael-s-molina marked this pull request as draft August 5, 2024 20:48
Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks @michael-s-molina! I was worried about performance when I introduced the feature, but it's hard to test things at the AirBnB scale.

superset/migrations/shared/catalogs.py Outdated Show resolved Hide resolved
Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, approved too early.

@betodealmeida
Copy link
Member

I'll also need to fix this part which generates a query for each dataset, which in Airbnb's case is more than 25,000 queries.

# update `schema_perm` and `catalog_perm` for tables and charts
for table in session.query(SqlaTable).filter_by(
    database_id=database.id,
    catalog=None,
):
    schema_perm = security_manager.get_schema_perm(
        database.database_name,
        default_catalog,
        table.schema,
    )

    table.catalog = default_catalog
    table.catalog_perm = catalog_perm
    table.schema_perm = schema_perm

    for chart in session.query(Slice).filter_by(
        datasource_id=table.id,
        datasource_type="table",
    ):
        chart.catalog_perm = catalog_perm
        chart.schema_perm = schema_perm

You might be able to rewrite this as a single query for some dialects:

UPDATE tables
SET
  catalog=${default_catalog},
  catalog_perm=${catalog_perm},
  schema_perm=REGEXP_REPLACE(schema_perm, '\[([^\]]+)\]\.\[([^\]]+)\]', '[\1].[${default_catalog}].[\2]', 'g')
WHERE
  database_id=${database_id} AND
  catalog IS NULL
RETURNING id;

For charts you can then use the returned IDs from the query above to build a similar UPDATE query.

@michael-s-molina
Copy link
Member Author

You might be able to rewrite this as a single query for some dialects:

@betodealmeida Given that the number of rows for tables and charts will hardly pass 1 million, I opted for preserving a lot of the previous logic and eliminating the part where we queried the slices table for each dataset. Now we query for all slices of type table for each database and I use a map to get the appropriate permissions. This will significantly reduce the number of queries to the number of databases.

@michael-s-molina michael-s-molina marked this pull request as ready for review August 6, 2024 15:35
@dosubot dosubot bot added the change:backend Requires changing the backend label Aug 6, 2024
@sadpandajoe
Copy link
Member

@supersetbot label 4.1

@github-actions github-actions bot added the v4.1 Label added by the release manager to track PRs to be included in the 4.1 branch label Aug 6, 2024
@michael-s-molina
Copy link
Member Author

@betodealmeida I found another problem with how we were dealing with catalog and schema permissions that come from the security manager. Looking at the interface definition, they could be None but we were not handling these cases which resulted in errors when inserting to ab_view_menu because of NULL names. I fixed the problem here but I don't have much context about these permissions so please take a look.

@sadpandajoe
Copy link
Member

@betodealmeida @eschutho any other concerns about this pr? I think @michael-s-molina has addressed both your comments.

@betodealmeida
Copy link
Member

@betodealmeida I found another problem with how we were dealing with catalog and schema permissions that come from the security manager. Looking at the interface definition, they could be None but we were not handling these cases which resulted in errors when inserting to ab_view_menu because of NULL names. I fixed the problem here but I don't have much context about these permissions so please take a look.

Ah, good point! This looks correct.

@michael-s-molina
Copy link
Member Author

@betodealmeida Can you approve or remove the request for changes?

Copy link
Member

@eschutho eschutho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:backend Requires changing the backend logging Creates a UI or API endpoint that could benefit from logging. risk:db-migration PRs that require a DB migration risk:refactor High risk as it involves large refactoring work size/L v4.1 Label added by the release manager to track PRs to be included in the 4.1 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

upgrade_catalog_perms implementation is problematic for tables with a lot of data
4 participants