feat: Apply row limit transform to query in backend #1461

baumandm · 2024-06-17T14:54:14Z

This PR addresses the issue whereby users will create and test a DataDoc manually with automatic row limits applied, but then schedule the DataDoc and it runs without the automatic limits.

If the Query Engine has the (Experimental) Enable Row Limit feature enabled, all row limit transforms are now performed in the backend, for both adhoc and scheduled queries. If this feature is disabled, nothing changes.

Summary of changes:

Creating a new DataDoc cell now initializes it in the database with the default row limit (previously unset)
POST /query_execution now accepts an optional row_limit param and applies a limit transform if configured; this is provided for adhoc executions from either the adhoc editor or a DataDoc cell
The celery task run_datadoc now looks up the DataDoc cell metadata, and applies a limit transform if configured
The Run All feature also uses the run_datadoc task, so it behaves the same as scheduled DataDocs
The frontend no longer applies the row limit transform on the client-side, but it does do a limit check to enable the Your SELECT query is unbounded popup; this feature continues to work as before, allowing the user to confirm or cancel the execution. The original query is sent to /query_execution along with the row_limit
Added an admin_logic.get_engine_feature_param() method to make it easier to get a single param

I considered a number of different approaches for this PR, and picked this approach because it required the fewest, simplest changes, but I can refactor if needed. Happy to discuss!

jczhong84 · 2024-06-19T03:14:20Z

querybook/webapp/components/QueryComposer/RunQuery.tsx

-        rowLimit,
-        engine
-    );
+    await checkUnlimitedQuery(sampledQuery, rowLimit, engine);


what about still keeping the old logic here, but only replace the transformLimitedQuery with the backend api, like sampledQuery does? you probably thought about this, what is the concern?

btw, I think we should keep the logic consistent for both the sampling and limited query

I had two thoughts:

There would be three API calls required to prep a query before sending it to be executed, seemed like too much. I was thinking about creating a single unified endpoint that combined templating/sampling/limiting, but that was even more changes

Scheduled DataDocs don't go through that flow, so it doesn't make it simpler; it just adds a new endpoint purely for adhoc executions

That said, I'm fine with refactoring to use a new endpoint for this. If it returned a flag for unlimited queries, then we can remove even more frontend code.

you raised a good point. I was also thinking of merging all the transforms into one single API endpoint. maybe we could try that route?

I combined sampling/limiting into POST /query/transform/, but left the individual endpoints—sampling preview uses /query/transform/sampling/, and there may be some use for a standalone limited transform as well.

jczhong84 · 2024-06-19T03:17:12Z

querybook/webapp/components/QueryComposer/RunQuery.tsx

-    }
-
-    if (rowLimit != null && rowLimit >= 0) {
-        return getLimitedQuery(query, rowLimit, engine.language);


may delete the function getLimitedQuery, which is not needed

jczhong84 · 2024-06-21T20:17:53Z

@baumandm thanks for the PR and all the updates! lgtm.

Just one more question: with this change, will all the scheduled docs automatically have the limit enabled as most of the query cells have the default limit? Dont want to surprise the users

baumandm · 2024-06-22T02:59:52Z

Just one more question: with this change, will all the scheduled docs automatically have the limit enabled as most of the query cells have the default limit? Dont want to surprise the users

Good question, that was something I thought about as well. Previously, DataDoc cells didn't store the limit value in the meta column unless the user manually changed it from the default. So after deploying this change, any scheduled DataDoc cells where the user has manually adjusted the row limit will start being limited, automatically. But any existing cells which have never had the limit changed will continue being unlimited (when scheduled).

This PR includes a change so that new DataDoc cells will be initialized with the default row limit from the frontend, to avoid this kind of situation in the future.

If this change is something you are worried about, the only thing I can think of would be to run a SQL command to unset every DataCell's meta.limit field and let users adjust as needed. But that seems pretty dramatic and could have negative consequences the other direction.

For us, since it only applies to SELECT queries, the impact should be low and the potential improvements are high, so we are just planning to communicate the change rather than try to mitigate it.

jczhong84 · 2024-06-24T20:04:36Z

Just one more question: with this change, will all the scheduled docs automatically have the limit enabled as most of the query cells have the default limit? Dont want to surprise the users

Good question, that was something I thought about as well. Previously, DataDoc cells didn't store the limit value in the meta column unless the user manually changed it from the default. So after deploying this change, any scheduled DataDoc cells where the user has manually adjusted the row limit will start being limited, automatically. But any existing cells which have never had the limit changed will continue being unlimited (when scheduled).

This PR includes a change so that new DataDoc cells will be initialized with the default row limit from the frontend, to avoid this kind of situation in the future.

If this change is something you are worried about, the only thing I can think of would be to run a SQL command to unset every DataCell's meta.limit field and let users adjust as needed. But that seems pretty dramatic and could have negative consequences the other direction.

For us, since it only applies to SELECT queries, the impact should be low and the potential improvements are high, so we are just planning to communicate the change rather than try to mitigate it.

how about we add a flag in the doc schedule modal, people can explicitly choose to enable or disable query limiting or sampling? by default it is disabled.

baumandm · 2024-06-25T19:10:28Z

Does this sound correct:

Schedule modal contains a toggle to apply query transforms (sampling, limiting)
Toggle defaults to disabled for all existing schedules (where it's not configured)
Toggle defaults to disabled for new schedules (or should this be enabled?)

jczhong84 · 2024-06-25T20:32:59Z

that sounds perfectly right! for the 3rd one, I dont have a strong opinion, may start with disabled by default

jczhong84 reviewed Jun 19, 2024

View reviewed changes

feat: Apply row limit transform to query in backend

eff3c2f

baumandm force-pushed the external/backend-limit branch from f3fcd1f to b8a33e0 Compare June 20, 2024 16:54

fix: PR comments

1164dde

baumandm force-pushed the external/backend-limit branch from b8a33e0 to 1164dde Compare June 20, 2024 17:08

fix: Unified transform endpoint for sampling/limiting

fcac7c9

fix: Fix templatizedQuery

844ae1e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Apply row limit transform to query in backend #1461

feat: Apply row limit transform to query in backend #1461

baumandm commented Jun 17, 2024 •

edited

Loading

jczhong84 Jun 19, 2024

baumandm Jun 19, 2024

jczhong84 Jun 20, 2024

baumandm Jun 20, 2024

jczhong84 Jun 19, 2024

jczhong84 commented Jun 21, 2024

baumandm commented Jun 22, 2024

jczhong84 commented Jun 24, 2024

baumandm commented Jun 25, 2024

jczhong84 commented Jun 25, 2024

feat: Apply row limit transform to query in backend #1461

Are you sure you want to change the base?

feat: Apply row limit transform to query in backend #1461

Conversation

baumandm commented Jun 17, 2024 • edited Loading

jczhong84 Jun 19, 2024

Choose a reason for hiding this comment

baumandm Jun 19, 2024

Choose a reason for hiding this comment

jczhong84 Jun 20, 2024

Choose a reason for hiding this comment

baumandm Jun 20, 2024

Choose a reason for hiding this comment

jczhong84 Jun 19, 2024

Choose a reason for hiding this comment

jczhong84 commented Jun 21, 2024

baumandm commented Jun 22, 2024

jczhong84 commented Jun 24, 2024

baumandm commented Jun 25, 2024

jczhong84 commented Jun 25, 2024

baumandm commented Jun 17, 2024 •

edited

Loading