Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support optimizer hints for select queries #260

Open
KTONQ opened this issue Apr 21, 2023 · 46 comments
Open

Support optimizer hints for select queries #260

KTONQ opened this issue Apr 21, 2023 · 46 comments

Comments

@KTONQ
Copy link
Contributor

KTONQ commented Apr 21, 2023

Hi, is it good to commit a change to support optimizer hints for select queries?

@mohrezaei
Copy link
Collaborator

We've tried optimizer hints before and found out that they're just as likely to hurt performance as to help. There are several reasons why optimizer hints are not a good idea.

  • The query path depends on the data distribution. Data distribution is not a constant. It changes both in time and in space (by space, I mean if you use Reladomo's source attribute feature -- same schema different data). An optimizer hint that's good today might be very bad tomorrow.
  • Only the database can know the data distribution. If the database's data distribution is out of sync with the actual data, it's better to schedule jobs to sync those instead.
  • If you're having trouble with a particular query, it's better to use the tools the db provides to solve the issue: add, change or remove (yes, remove) indices. Try to arrange columns in an index in selectivility order, etc. This strategy will improve your db schema, instead of ignoring it.
  • Some optimiziers are happier when the query is expressed in different forms, e.g. in-clauses are usually much better than or-clauses. Try logical variations on the query to see what works best.

Even some database vendors consider hints to be an anti-pattern.

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 25, 2023

agree that it is better to trust the optimizer of db is doing a good job on execution plan for most queries, but no all queries, at least not for the query we are facing.

The query path depends on the data distribution. Data distribution is not a constant. It changes both in time and in space (by space, I mean if you use Reladomo's source attribute feature -- same schema different data). An optimizer hint that's good today might be very bad tomorrow.

could hardly agree on the this point, whatever hints we human add to the query is not only basing on the data as of now, but prediction on data change in future as well, which the DB will never know.

Only the database can know the data distribution. If the database's data distribution is out of sync with the actual data, it's better to schedule jobs to sync those instead.

the correct index was chosen after we analyzed the table manually, but sadly the execution plan was changed back to use wrong index in several hours.

If you're having trouble with a particular query, it's better to use the tools the db provides to solve the issue: add, change or remove (yes, remove) indices. Try to arrange columns in an index in selectivility order, etc. This strategy will improve your db schema, instead of ignoring it.

sadly, the wrong index which is being used is primary key, which could not be removed or changed.
it is not the first time we are facing this issue, we changed the order by clause half a year back, then the correct index was being used, but it changed back again.

Some optimiziers are happier when the query is expressed in different forms, e.g. in-clauses are usually much better than or-clauses. Try logical variations on the query to see what works best.

here is the query, very simple, 1 table, without any or-clause or in-clause.
we could not find any solution other than optimizer hints.

primary key is on item_seq and processing_date_to;
index on cp_seq, processing_date_to, item_seq is preferred but not being chosen;

SELECT * FROM item t0 WHERE t0.cp_seq = 'xxx' AND t0.processing_date_to = '9999-12-01 23:59:00.0' AND t0.item_seq >= 'yyy' ORDER BY t0.item_seq LIMIT 1002;

@mohrezaei
Copy link
Collaborator

Given this query, I would also choose the item_seq, processing_date_to index. With a small limit, and a range query on item_seq, a small number of reads from that index, followed by the table lookup will end the query.

The other index is rather bizarre. item_seq, processing_date is already unique. Throwing in cp_seq at the start, which by its very nature must have worse cardinality than item_seq creates a classically bad index that serves no purpose other than to confuse the optimizer.

The advice I gave you was very specific: "arrange columns in an index in selectivility order". For this triplet, that's either item_seq, cp_seq, processing_date_to, or item_seq, processing_date_to, cp_seq. And it must be marked as unique (it's actually over-qualified to be unique).

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 25, 2023

considering the order by clause, in worst case: item_seq='yyy' located at the first line, and the last record satisfied the condition located at the end, any index start from item_seq will cause a full index table scan.

@mohrezaei
Copy link
Collaborator

For this query, an index like item_seq, cp_seq, processing_date_to is called a covering index. There is no reason to look at the table data unless a match is found in the index. Have you tried it?

From your description, it sounds like cp_seq has some rare values. From the limit clause, it's also clear that cp_seq has some very common values. For the latter, an index starting from cp_seq is a bad idea. If the above is true, a little bit of application level logic could help here: keep a count of cp_seq (in memory via a group by query, in a different table, or just by knowing the magic values, etc). Then issue the query from the application as either the rare cp_seq (no clause on item_seq) or the common cp_seq (same query as you have).

P.S. Are you using the database as a queue? There is no fix for that particular anti-pattern, other than to simply stop and use the right tool for that instead.

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 25, 2023

the purpose of this query is to fetch all item data of specified cp_seq, with pagination.
cp_seq which is father of item, keeps on increasing, and we don't know how many items they will have.
the particular cp_seq which caused performance issue owns more than 1 million items.

@mohrezaei
Copy link
Collaborator

if that's the case, the most appropriate index is: unique item_seq, processing_date_to, cp_seq.

If you're not willing to try that, there is no point in continuing this conversation.

we don't know how many items they will have.

Yes, you do. It's called a count/group by query.

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 26, 2023

please refer to the test result below, hint used to force index to be selected.

case 1: as you suggested
unique item_seq, processing_date_to, cp_seq
execution time: 36s 485ms

case 2:
primary key: item_seq, processing_date_to
execution time: 15s 482 ms

case 3:
unique cp_seq, processing_date_to, item_seq
execution time: 33 ms

@mohrezaei
Copy link
Collaborator

are you sure you're not running into caching effects here? Did you execute them in that order? What happens if you execute them in a different order or multiple times?

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 26, 2023

case 1: as you suggested
unique item_seq, processing_date_to, cp_seq
[2023-04-26 23:36:21] 56 rows retrieved starting from 1 in 39 s 915 ms (execution: 39 s 807 ms, fetching: 108 ms)
[2023-04-26 23:37:07] 56 rows retrieved starting from 1 in 38 s 988 ms (execution: 38 s 926 ms, fetching: 62 ms)
[2023-04-26 23:38:04] 56 rows retrieved starting from 1 in 39 s 921 ms (execution: 39 s 888 ms, fetching: 33 ms)
[2023-04-26 23:38:50] 56 rows retrieved starting from 1 in 38 s 463 ms (execution: 38 s 435 ms, fetching: 28 ms)
[2023-04-26 23:39:40] 56 rows retrieved starting from 1 in 39 s 669 ms (execution: 39 s 632 ms, fetching: 37 ms)

case 3:
unique cp_seq, processing_date_to, item_seq
[2023-04-26 23:40:08] 56 rows retrieved starting from 1 in 76 ms (execution: 46 ms, fetching: 30 ms)
[2023-04-26 23:40:16] 56 rows retrieved starting from 1 in 50 ms (execution: 27 ms, fetching: 23 ms)
[2023-04-26 23:40:21] 56 rows retrieved starting from 1 in 49 ms (execution: 28 ms, fetching: 21 ms)
[2023-04-26 23:40:26] 56 rows retrieved starting from 1 in 44 ms (execution: 22 ms, fetching: 22 ms)
[2023-04-26 23:40:31] 56 rows retrieved starting from 1 in 46 ms (execution: 21 ms, fetching: 25 ms)

case 2:
primary key: item_seq, processing_date_to
[2023-04-26 23:41:06] 56 rows retrieved starting from 1 in 18 s 844 ms (execution: 18 s 800 ms, fetching: 44 ms)
[2023-04-26 23:42:46] 56 rows retrieved starting from 1 in 18 s 205 ms (execution: 18 s 175 ms, fetching: 30 ms)
[2023-04-26 23:43:12] 56 rows retrieved starting from 1 in 18 s 778 ms (execution: 18 s 751 ms, fetching: 27 ms)
[2023-04-26 23:43:39] 56 rows retrieved starting from 1 in 17 s 967 ms (execution: 17 s 942 ms, fetching: 25 ms)
[2023-04-26 23:44:21] 56 rows retrieved starting from 1 in 18 s 603 ms (execution: 18 s 574 ms, fetching: 29 ms)

@mohrezaei
Copy link
Collaborator

so you told me the difficult case was when cp_seq had millions of rows, for which I suggested an index and then choose a query with a result of 56 rows?

I already told you what you should do with the rare cp case: issue the query without conditions on item_seq.

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 26, 2023

this test is executed on local environment, data is different from real environment, which we could not add any index simply.

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 26, 2023

on other hand, I checked the release notes of later version of Mysql.
and found similar issue, which was resolved and released on 8.0.21.

here is the bug report
https://bugs.mysql.com/bug.php?id=97001

here is the release note
https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-21.html#mysqld-8-0-21-optimizer

DBA prepared 2 testing environment with the version we are using now, and the latest version, with new environment variable prefer_ordering_index to off, and imported same data from real environment.
but sadly, wrong index was selected as well.

but obviously, the optimizer of DB itself have a limitation.
maybe that is reason, Oracle and Mysql and MSSQL support hint.

@mohrezaei
Copy link
Collaborator

mohrezaei commented Apr 26, 2023

Where you're running the query shouldn't matter. What does matter is running the right query. First, run this query:

SELECT count(1), cp_seq FROM item t0 WHERE t0.processing_date_to = '9999-12-01 23:59:00.0' group by t0.cp_seq

Then based on that, either issue a query with or without item_seq conditions. If you do that, you shouldn't have to force any indices.

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 26, 2023

I already told you what you should do with the rare cp case: issue the query without conditions on item_seq.

Data keeps changing every minute, and our system has been running for 3 years.
It is no matter how many items each cp_seq owns, what is required is to ensure all cases to be finished in 250ms.
that is why I am saying

cp_seq which is father of item, keeps on increasing, and we don't know how many items they will have.

or you can simply assume the item count of each cp_seq will be between 0 and 10% of total records.

@mohrezaei
Copy link
Collaborator

The data from the group by is an estimate. You don't have to do it with every query. Just do it once an hour in the background.

If the estimate is low (< 10K or not found), issue without item_seq. If it's high (> 10K), issue with item_seq.

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 26, 2023

The data from the group by is an estimate. You don't have to do it with every query. Just do it once an hour in the background.

If the estimate is low (< 10K or not found), issue without item_seq. If it's high (> 10K), issue with item_seq.

this is to implement an optimizer ourselves, which is far costy than using a simple hint clause. if you could kindly support.

@mohrezaei
Copy link
Collaborator

How is it costly? If it takes you more than 50 lines of code to do this, you're doing it wrong. It uses correct data science to issue proper queries, now and forever, instead of an irrational belief that a human can predict data distributions years into the future.

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 26, 2023

sorry, is there any evidence shows the solution you shared is using correct data science and will retrieve a higher performance?

the fact is, in real environment, the item count is more than 1 million, and local environment, it is 56.
both result shows below index performs better.

unique cp_seq, processing_date_to, item_seq

@mohrezaei
Copy link
Collaborator

did you follow what I said? In this particular case, you have to:

in your production environment, add a unique item_seq, processing_date_to, cp_seq (or perhaps unique item_seq, cp_seq, processing_date_to), then run the query with item_seq clauses.

in your test environment, run the query without item_seq clauses.

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 26, 2023

what do you mean

then run the query with item_seq clauses.

@mohrezaei
Copy link
Collaborator

with:

SELECT * FROM item t0 WHERE t0.cp_seq = 'xxx' AND t0.processing_date_to = '9999-12-01 23:59:00.0' AND t0.item_seq >= 'yyy' ORDER BY t0.item_seq LIMIT 1002;

without

SELECT * FROM item t0 WHERE t0.cp_seq = 'xxx' AND t0.processing_date_to = '9999-12-01 23:59:00.0' 

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 26, 2023

for those cp_seqs which owns less than 10k, you mean we have to change specification of our system to improve performance, right?

what about the performance more than 10k?
how do you know the optimizer will select the index you shared, but not others? and how do you know it will performs good?

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 26, 2023

fortunately there are some cp_seqs which own items more than 1 million in local environment.
(am using intellij which will retrieve 500 rows only)

case 1: as you suggested
unique item_seq, processing_date_to, cp_seq
[2023-04-27 01:07:42] 500 rows retrieved starting from 1 in 6 s 43 ms (execution: 2 s 34 ms, fetching: 4 s 9 ms)

case 3:
unique cp_seq, processing_date_to, item_seq
[2023-04-27 01:08:09] 500 rows retrieved starting from 1 in 162 ms (execution: 98 ms, fetching: 64 ms)

case 2:
primary key: item_seq, processing_date_to
[2023-04-27 01:08:23] 500 rows retrieved starting from 1 in 1 s 321 ms (execution: 483 ms, fetching: 838 ms)

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 26, 2023

and the primary key is selected if hint to force index is removed.

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 26, 2023

hence, we still need your help to support hint, to force to use the index you suggested.

@mohrezaei
Copy link
Collaborator

Something isn't making sense. What's the output of show index from item?

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 27, 2023

there is no other index available for this query.
as for your concern, I removed all other indices and executed the query again.

case 1: as you suggested
unique item_seq, processing_date_to, cp_seq
[2023-04-27 11:29:14] 500 rows retrieved starting from 1 in 7 s 349 ms (execution: 2 s 387 ms, fetching: 4 s 962 ms)

case 2:
primary key: item_seq, processing_date_to
[2023-04-27 11:28:55] 500 rows retrieved starting from 1 in 1 s 453 ms (execution: 528 ms, fetching: 925 ms)

case 3:
unique cp_seq, processing_date_to, item_seq
[2023-04-27 11:29:21] 500 rows retrieved starting from 1 in 100 ms (execution: 54 ms, fetching: 46 ms)

@mohrezaei
Copy link
Collaborator

mohrezaei commented Apr 27, 2023

Try the following:
unique item_seq, cp_seq, processing_date_to.

Also, issue this query:

SELECT count(1), cp_seq, min(item_seq) FROM item t0 WHERE t0.processing_date_to = '9999-12-01 23:59:00.0' group by t0.cp_seq

then use the min value from the above query in your original query:

SELECT * FROM item t0 WHERE t0.cp_seq = 'xxx' AND t0.processing_date_to = '9999-12-01 23:59:00.0' AND t0.item_seq >= min_val_from_above ORDER BY t0.item_seq LIMIT 1002;

@KTONQ
Copy link
Contributor Author

KTONQ commented Apr 27, 2023

case 1: as you suggested
unique item_seq, processing_date_to, cp_seq
[2023-04-27 11:29:14] 500 rows retrieved starting from 1 in 7 s 349 ms (execution: 2 s 387 ms, fetching: 4 s 962 ms)

case 2:
primary key: item_seq, processing_date_to
[2023-04-27 11:28:55] 500 rows retrieved starting from 1 in 1 s 453 ms (execution: 528 ms, fetching: 925 ms)

case 3:
unique cp_seq, processing_date_to, item_seq
[2023-04-27 11:29:21] 500 rows retrieved starting from 1 in 100 ms (execution: 54 ms, fetching: 46 ms)

since the parameter used in above execution exactly satisfy your condition, same parameter executed with the new index you suggested.

case 4: as you suggested
unique item_seq, cp_seq, processing_date_to
[2023-04-27 13:20:14] 500 rows retrieved starting from 1 in 7 s 446 ms (execution: 2 s 458 ms, fetching: 4 s 988 ms)

by the way, primary key is still being selected by Mysql, after the new index created.

@KTONQ
Copy link
Contributor Author

KTONQ commented May 8, 2023

On other hand, I reported the issue to Mysql, and it has been verified.
https://bugs.mysql.com/bug.php?id=110848

but, I don"t think it will be resolved, since there are more than 600 bugs verified on optimizer, and some of them are reported more than 10 years back.

hence, hint seems is the optimal solution, we are still looking forward to your support.

@mohrezaei
Copy link
Collaborator

I'm waiting on the results of this:

Also, issue this query:

SELECT count(1), cp_seq, min(item_seq) FROM item t0 WHERE t0.processing_date_to = '9999-12-01 23:59:00.0' group by t0.cp_seq

then use the min value from the above query in your original query:

SELECT * FROM item t0 WHERE t0.cp_seq = 'xxx' AND t0.processing_date_to = '9999-12-01 23:59:00.0' AND t0.item_seq >= min_val_from_above ORDER BY t0.item_seq LIMIT 1002;

@KTONQ
Copy link
Contributor Author

KTONQ commented May 8, 2023

I have already tried as you suggested and here is the result.
#260 (comment)
Is there anything missing?

@mohrezaei
Copy link
Collaborator

There are too many things labeled "case x" in this conversation. Can you clarify what exact queries you ran?

@KTONQ
Copy link
Contributor Author

KTONQ commented May 8, 2023

As mentioned earlier, hint is used to force index to be used in the query.
#260 (comment)

here is the query, the index name is the only difference in USE INDEX clause.

SELECT * FROM item t0 USE INDEX (item_idx_itemseq_cp2) WHERE t0.cp_seq = 'xxx' AND t0.processing_date_to = '9999-12-01 23:59:00.0' AND t0.item_seq >= 'yyy' ORDER BY t0.item_seq LIMIT 1002;

@mohrezaei
Copy link
Collaborator

what was the value for 'yyy' ?

@KTONQ
Copy link
Contributor Author

KTONQ commented May 8, 2023

Can you please explain does the value matter?

@mohrezaei
Copy link
Collaborator

The point of #260 (comment) was to use values that are specific to each cp_seq using the two queries listed. It's not clear you did that.

@KTONQ
Copy link
Contributor Author

KTONQ commented May 8, 2023

any item_seq maybe used in the query, data keeps changing.

@mohrezaei
Copy link
Collaborator

so just so we're clear, you did not follow the instructions provided in #260 (comment) ?

@KTONQ
Copy link
Contributor Author

KTONQ commented May 8, 2023

I don't understand, do you mean that we are facing performance issue with the parameter for example 'xxx' and 'yyy', and you are telling us to use parameter 'aaa' and 'bbb' to resolve it?

@mohrezaei
Copy link
Collaborator

No, I'm trying to see if there is a better way to construct the query.

Let's take a step back: where does the query value ('yyy) come from?

@KTONQ
Copy link
Contributor Author

KTONQ commented May 8, 2023

as you mentioned, the first item_seq of target cp_seq.
SELECT count(1), cp_seq, min(item_seq) FROM item t0 WHERE t0.processing_date_to = '9999-12-01 23:59:00.0' group by t0.cp_seq

@KTONQ
Copy link
Contributor Author

KTONQ commented May 8, 2023

If you have mysql environment and would like to tune the query, please refer to the bug details, there is simple way to generate testing data, and repeat the bug.

and also you may have an image on the data distribution.

https://bugs.mysql.com/bug.php?id=110848

@mohrezaei
Copy link
Collaborator

So in your production code, you run the aggregate query before running the normal query? and for the test results here you've done the same?

@KTONQ
Copy link
Contributor Author

KTONQ commented May 9, 2023

the parameters are from client, not decided by us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants