Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](Nereids) lock table when generate distribute plan (#38950) #39037

Merged
merged 2 commits into from
Aug 7, 2024

Conversation

924060929
Copy link
Contributor

cherry pick from #38950

We should lock table when generate distribute plan, because insert
overwrite by async materialized view will drop partitions parallel, and
query thread will throw exception:
```
java.lang.RuntimeException: Cannot invoke "org.apache.doris.catalog.Partition.getBaseIndex()" because "partition" is null
    at org.apache.doris.nereids.util.Utils.execWithUncheckedException(Utils.java:76) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.nereids.glue.translator.PhysicalPlanTranslator.translatePlan(PhysicalPlanTranslator.java:278) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.nereids.NereidsPlanner.splitFragments(NereidsPlanner.java:341) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.nereids.NereidsPlanner.distribute(NereidsPlanner.java:400) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.nereids.NereidsPlanner.plan(NereidsPlanner.java:147) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.executeByNereids(StmtExecutor.java:796) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:605) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.queryRetry(StmtExecutor.java:558) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:548) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.ConnectProcessor.executeQuery(ConnectProcessor.java:385) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:237) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.MysqlConnectProcessor.handleQuery(MysqlConnectProcessor.java:260) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.MysqlConnectProcessor.dispatch(MysqlConnectProcessor.java:288) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.MysqlConnectProcessor.processOnce(MysqlConnectProcessor.java:342) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.doris.catalog.Partition.getBaseIndex()" because "partition" is null
    at org.apache.doris.planner.OlapScanNode.mockRowCountInStatistic(OlapScanNode.java:589) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.planner.OlapScanNode.finalizeForNereids(OlapScanNode.java:1733) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.nereids.util.Utils.execWithUncheckedException(Utils.java:74) ~[doris-fe.jar:1.2-SNAPSHOT]
    ... 17 more
2024-07-29 00:46:17,608 WARN (mysql-nio-pool-114|201) Analyze failed. stmt[210035, 49d3041004ba4b6a-b07fe4491d03c5de]
org.apache.doris.common.NereidsException: errCode = 2, detailMessage = Cannot invoke "org.apache.doris.catalog.Partition.getBaseIndex()" because "partition" is null
    at org.apache.doris.qe.StmtExecutor.executeByNereids(StmtExecutor.java:803) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:605) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.queryRetry(StmtExecutor.java:558) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:548) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.ConnectProcessor.executeQuery(ConnectProcessor.java:385) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:237) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.MysqlConnectProcessor.handleQuery(MysqlConnectProcessor.java:260) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.MysqlConnectProcessor.dispatch(MysqlConnectProcessor.java:288) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.MysqlConnectProcessor.processOnce(MysqlConnectProcessor.java:342) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]
```

this exception is too hard to reproduce, so I can not write a test case

(cherry picked from commit 3eb9501)
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@924060929
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 50388 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d328dd9079c357c1bdeed2b368459d29271b5391, data reload: false

------ Round 1 ----------------------------------
q1	18981	4429	4538	4429
q2	2072	160	152	152
q3	10253	1947	1943	1943
q4	10300	1271	1348	1271
q5	8456	3928	3938	3928
q6	236	129	152	129
q7	2100	1722	1698	1698
q8	9598	2791	2774	2774
q9	11484	10692	10337	10337
q10	8747	3529	3511	3511
q11	423	249	253	249
q12	472	315	318	315
q13	18619	3981	4046	3981
q14	361	337	322	322
q15	512	457	461	457
q16	674	573	572	572
q17	1137	977	977	977
q18	7353	6931	6902	6902
q19	1774	1656	1626	1626
q20	564	327	300	300
q21	4481	4118	4078	4078
q22	545	437	457	437
Total cold run time: 119142 ms
Total hot run time: 50388 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4373	4394	4342	4342
q2	321	221	227	221
q3	4169	4126	4153	4126
q4	2772	2743	2740	2740
q5	7164	7110	7098	7098
q6	243	124	121	121
q7	3276	2856	2907	2856
q8	4382	4498	4487	4487
q9	16745	16839	16819	16819
q10	4249	4297	4313	4297
q11	786	683	691	683
q12	1049	865	854	854
q13	6561	3761	3736	3736
q14	455	428	413	413
q15	511	467	445	445
q16	747	690	676	676
q17	3854	3858	3882	3858
q18	8938	8776	8820	8776
q19	1719	1742	1646	1646
q20	2372	2177	2110	2110
q21	8603	8641	8589	8589
q22	1042	933	980	933
Total cold run time: 84331 ms
Total hot run time: 79826 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 204278 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d328dd9079c357c1bdeed2b368459d29271b5391, data reload: false

query1	934	390	420	390
query2	6518	2900	2677	2677
query3	6924	217	204	204
query4	21498	18074	17861	17861
query5	19733	6521	6559	6521
query6	286	220	230	220
query7	4173	307	332	307
query8	433	463	442	442
query9	3113	2654	2593	2593
query10	419	294	301	294
query11	11325	10828	10690	10690
query12	120	77	76	76
query13	5607	717	700	700
query14	17822	13214	13486	13214
query15	368	246	255	246
query16	6453	287	259	259
query17	1729	1429	876	876
query18	2305	420	429	420
query19	215	154	152	152
query20	82	81	78	78
query21	193	106	95	95
query22	5162	5006	5006	5006
query23	32553	31956	32317	31956
query24	6939	6569	6556	6556
query25	524	435	441	435
query26	537	165	166	165
query27	1866	302	304	302
query28	6095	2355	2319	2319
query29	3023	2737	2822	2737
query30	247	170	164	164
query31	917	731	752	731
query32	75	53	60	53
query33	409	276	260	260
query34	851	483	480	480
query35	1117	920	938	920
query36	1289	1135	1055	1055
query37	92	58	64	58
query38	3098	2929	2972	2929
query39	1387	1325	1310	1310
query40	215	100	96	96
query41	47	45	45	45
query42	84	86	84	84
query43	779	662	729	662
query44	1147	729	723	723
query45	245	244	237	237
query46	1237	950	961	950
query47	1952	1731	1789	1731
query48	1015	730	753	730
query49	623	387	383	383
query50	860	635	630	630
query51	4776	4649	4665	4649
query52	93	86	86	86
query53	460	325	324	324
query54	2643	2429	2431	2429
query55	87	81	87	81
query56	240	233	216	216
query57	1288	1192	1095	1095
query58	226	195	193	193
query59	4101	4150	3933	3933
query60	215	210	200	200
query61	101	94	95	94
query62	837	549	463	463
query63	490	352	346	346
query64	2459	1528	1496	1496
query65	3616	3582	3616	3582
query66	839	387	377	377
query67	15736	15705	15783	15705
query68	8856	657	657	657
query69	581	355	352	352
query70	1675	1449	1379	1379
query71	419	318	317	317
query72	6624	3549	3510	3510
query73	734	318	316	316
query74	6278	5918	5819	5819
query75	5276	3598	3720	3598
query76	5408	1167	1204	1167
query77	896	259	261	259
query78	12486	11848	12252	11848
query79	9095	651	640	640
query80	1398	399	417	399
query81	494	234	240	234
query82	1128	97	103	97
query83	169	138	130	130
query84	270	73	72	72
query85	894	345	348	345
query86	328	289	302	289
query87	3234	3008	3028	3008
query88	4824	2311	2326	2311
query89	384	279	276	276
query90	1935	211	205	205
query91	186	144	155	144
query92	65	55	54	54
query93	5525	577	593	577
query94	717	216	211	211
query95	1148	1069	1080	1069
query96	641	331	327	327
query97	6467	6334	6369	6334
query98	193	168	176	168
query99	2913	879	875	875
Total cold run time: 314083 ms
Total hot run time: 204278 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.88 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit d328dd9079c357c1bdeed2b368459d29271b5391, data reload: false

query1	0.02	0.02	0.03
query2	0.07	0.03	0.02
query3	0.24	0.04	0.04
query4	1.80	0.07	0.06
query5	0.53	0.53	0.53
query6	1.24	0.62	0.62
query7	0.01	0.01	0.01
query8	0.03	0.03	0.02
query9	0.53	0.50	0.48
query10	0.55	0.54	0.54
query11	0.12	0.09	0.10
query12	0.11	0.10	0.09
query13	0.62	0.62	0.61
query14	0.78	0.80	0.78
query15	0.78	0.77	0.76
query16	0.37	0.39	0.36
query17	1.02	1.03	1.02
query18	0.23	0.26	0.25
query19	1.92	1.79	1.86
query20	0.02	0.01	0.01
query21	15.47	0.55	0.54
query22	2.01	2.50	1.73
query23	17.30	0.97	0.97
query24	6.32	1.43	0.95
query25	0.38	0.13	0.03
query26	0.67	0.18	0.15
query27	0.04	0.04	0.04
query28	6.49	0.76	0.77
query29	12.64	2.38	2.29
query30	0.59	0.52	0.52
query31	2.82	0.39	0.38
query32	3.37	0.50	0.50
query33	3.15	3.07	3.06
query34	15.25	4.79	4.78
query35	4.86	4.83	4.82
query36	1.03	1.01	1.02
query37	0.06	0.04	0.04
query38	0.03	0.02	0.02
query39	0.02	0.02	0.01
query40	0.16	0.14	0.15
query41	0.07	0.02	0.01
query42	0.02	0.02	0.01
query43	0.02	0.02	0.02
Total cold run time: 103.76 s
Total hot run time: 30.88 s

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit d328dd9079c357c1bdeed2b368459d29271b5391 with default session variables
Stream load json:         20 seconds loaded 2358488459 Bytes, about 112 MB/s
Stream load orc:          58 seconds loaded 1101869774 Bytes, about 18 MB/s
Stream load parquet:      31 seconds loaded 861443392 Bytes, about 26 MB/s
Insert into select:       21.2 seconds inserted 10000000 Rows, about 471K ops/s

@924060929 924060929 merged commit 17b72a6 into apache:branch-2.0 Aug 7, 2024
23 of 25 checks passed
GoGoWen pushed a commit to GoGoWen/incubator-doris that referenced this pull request Aug 27, 2024
apache#39037)

We should lock table when generate distribute plan, because insert overwrite by async materialized view will drop partitions parallel, and query thread will throw exception:
```
java.lang.RuntimeException: Cannot invoke "org.apache.doris.catalog.Partition.getBaseIndex()" because "partition" is null
    at org.apache.doris.nereids.util.Utils.execWithUncheckedException(Utils.java:76) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.nereids.glue.translator.PhysicalPlanTranslator.translatePlan(PhysicalPlanTranslator.java:278) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.nereids.NereidsPlanner.splitFragments(NereidsPlanner.java:341) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.nereids.NereidsPlanner.distribute(NereidsPlanner.java:400) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.nereids.NereidsPlanner.plan(NereidsPlanner.java:147) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.executeByNereids(StmtExecutor.java:796) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:605) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.queryRetry(StmtExecutor.java:558) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:548) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.ConnectProcessor.executeQuery(ConnectProcessor.java:385) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:237) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.MysqlConnectProcessor.handleQuery(MysqlConnectProcessor.java:260) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.MysqlConnectProcessor.dispatch(MysqlConnectProcessor.java:288) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.MysqlConnectProcessor.processOnce(MysqlConnectProcessor.java:342) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.doris.catalog.Partition.getBaseIndex()" because "partition" is null
    at org.apache.doris.planner.OlapScanNode.mockRowCountInStatistic(OlapScanNode.java:589) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.planner.OlapScanNode.finalizeForNereids(OlapScanNode.java:1733) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.nereids.util.Utils.execWithUncheckedException(Utils.java:74) ~[doris-fe.jar:1.2-SNAPSHOT]
    ... 17 more
2024-07-29 00:46:17,608 WARN (mysql-nio-pool-114|201) Analyze failed. stmt[210035, 49d3041004ba4b6a-b07fe4491d03c5de]
org.apache.doris.common.NereidsException: errCode = 2, detailMessage = Cannot invoke "org.apache.doris.catalog.Partition.getBaseIndex()" because "partition" is null
    at org.apache.doris.qe.StmtExecutor.executeByNereids(StmtExecutor.java:803) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:605) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.queryRetry(StmtExecutor.java:558) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:548) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.ConnectProcessor.executeQuery(ConnectProcessor.java:385) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:237) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.MysqlConnectProcessor.handleQuery(MysqlConnectProcessor.java:260) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.MysqlConnectProcessor.dispatch(MysqlConnectProcessor.java:288) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.qe.MysqlConnectProcessor.processOnce(MysqlConnectProcessor.java:342) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]
```

this exception is too hard to reproduce, so I can not write a test case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants