Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](load) skip cancel rpc if stub is not initialized #33006

Closed
wants to merge 1 commit into from

Conversation

kaijchen
Copy link
Contributor

Proposed changes

Fix coredump in VNodeChannel::cancel

*** Query id: b94d185960dc7990-4cc55561c7aef297 ***
*** tablet id: 0 ***
*** Aborted at 1711443908 (unix time) try "date -d @1711443908" if you are using GNU date ***
*** Current BE git commitID: d75ba6ef4e ***
*** SIGSEGV address not mapped to object (@0x0) received by PID 22808 (TID 25126 OR 0xfff9d72d8010) from PID 0; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:417
 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /data/software/jdk1.8.0_401/jre/lib/aarch64/server/libjvm.so
 2# JVM_handle_linux_signal in /data/software/jdk1.8.0_401/jre/lib/aarch64/server/libjvm.so
 3# signalHandler(int, siginfo_t*, void*) in /data/software/jdk1.8.0_401/jre/lib/aarch64/server/libjvm.so
 4# 0x0000FFFFA4BE066C in 
 5# doris::stream_load::VNodeChannel::cancel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/vtablet_sink.cpp:871
 6# std::_Function_handler<void (std::shared_ptr<doris::stream_load::VNodeChannel> const&), doris::stream_load::VOlapTableSink::_cancel_all_channel(doris::Status)::$_0>::_M_invoke(std::_Any_data const&, std::shared_ptr<doris::stream_load::VNodeChannel> const&) at /usr/local/bin/ldb-toolchain/bin/../lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291
 7# doris::stream_load::VOlapTableSink::_cancel_all_channel(doris::Status) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/vtablet_sink.cpp:1467
 8# doris::stream_load::VOlapTableSink::close(doris::RuntimeState*, doris::Status) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/vtablet_sink.cpp:1535
 9# doris::PlanFragmentExecutor::close() at /home/zcp/repo_center/doris_release/doris/be/src/runtime/plan_fragment_executor.cpp:522
10# doris::PlanFragmentExecutor::~PlanFragmentExecutor() at /home/zcp/repo_center/doris_release/doris/be/src/runtime/plan_fragment_executor.cpp:95
11# doris::FragmentExecState::~FragmentExecState() at /home/zcp/repo_center/doris_release/doris/be/src/runtime/fragment_mgr.cpp:112
12# std::_Sp_counted_ptr<doris::FragmentExecState*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() at /usr/local/bin/ldb-toolchain/bin/../lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:348
13# doris::FragmentMgr::exec_plan_fragment(doris::TExecPlanFragmentParams const&, std::function<void (doris::RuntimeState*, doris::Status*)> const&) at /home/zcp/repo_center/doris_release/doris/be/src/runtime/fragment_mgr.cpp:864
14# doris::StreamLoadExecutor::execute_plan_fragment(std::shared_ptr<doris::StreamLoadContext>) at /home/zcp/repo_center/doris_release/doris/be/src/runtime/stream_load/stream_load_executor.cpp:75
15# doris::StreamLoadAction::_process_put(doris::HttpRequest*, std::shared_ptr<doris::StreamLoadContext>) at /home/zcp/repo_center/doris_release/doris/be/src/http/action/stream_load.cpp:620
16# doris::StreamLoadAction::_on_header(doris::HttpRequest*, std::shared_ptr<doris::StreamLoadContext>) at /home/zcp/repo_center/doris_release/doris/be/src/http/action/stream_load.cpp:318
17# doris::StreamLoadAction::on_header(doris::HttpRequest*) at /home/zcp/repo_center/doris_release/doris/be/src/http/action/stream_load.cpp:193
18# doris::EvHttpServer::on_header(evhttp_request*) at /home/zcp/repo_center/doris_release/doris/be/src/http/ev_http_server.cpp:255
19# 0x0000AAAAF5E2D334 in /data/software/doris206/be/lib/doris_be
20# bufferevent_run_readcb_ in /data/software/doris206/be/lib/doris_be
21# 0x0000AAAAF5E31544 in /data/software/doris206/be/lib/doris_be
22# 0x0000AAAAF5E1941C in /data/software/doris206/be/lib/doris_be
23# 0x0000AAAAF5E19B90 in /data/software/doris206/be/lib/doris_be
24# 0x0000AAAAF5E1C0DC in /data/software/doris206/be/lib/doris_be
25# std::_Function_handler<void (), doris::EvHttpServer::start()::$_0>::_M_invoke(std::_Any_data const&) at /usr/local/bin/ldb-toolchain/bin/../lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291
26# doris::ThreadPool::dispatch_thread() at /home/zcp/repo_center/doris_release/doris/be/src/util/threadpool.cpp:541
27# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thread.cpp:499
28# start_thread in /lib64/libpthread.so.0
29# thread_start in /lib64/libc.so.6 

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@kaijchen
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 49553 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 57630578b98b0e5c8467b38d8f481900b481958c, data reload: false

------ Round 1 ----------------------------------
q1	17685	4379	4332	4332
q2	2074	153	145	145
q3	10472	1908	1901	1901
q4	10364	1225	1289	1225
q5	8527	3965	3986	3965
q6	229	121	122	121
q7	2029	1582	1584	1582
q8	9270	2714	2710	2710
q9	10884	10520	10319	10319
q10	8589	3492	3478	3478
q11	418	230	233	230
q12	462	296	297	296
q13	18326	3924	3996	3924
q14	360	327	329	327
q15	507	445	458	445
q16	685	594	586	586
q17	1121	938	949	938
q18	7261	6866	6802	6802
q19	1665	1527	1477	1477
q20	493	295	296	295
q21	4471	4102	4061	4061
q22	495	398	394	394
Total cold run time: 116387 ms
Total hot run time: 49553 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4359	4297	4286	4286
q2	318	223	219	219
q3	4120	4145	4099	4099
q4	2759	2752	2736	2736
q5	7267	7184	7183	7183
q6	238	123	118	118
q7	3249	2904	2850	2850
q8	4368	4477	4479	4477
q9	17093	17017	16986	16986
q10	4241	4224	4265	4224
q11	761	711	683	683
q12	1047	857	858	857
q13	7051	3744	3740	3740
q14	452	420	408	408
q15	499	454	447	447
q16	759	706	702	702
q17	3855	3904	3899	3899
q18	8692	8693	8828	8693
q19	1716	1689	1636	1636
q20	2393	2105	2125	2105
q21	8422	8623	8570	8570
q22	1050	918	926	918
Total cold run time: 84709 ms
Total hot run time: 79836 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.81% (8048/21286)
Line Coverage: 29.47% (65730/223020)
Region Coverage: 28.94% (33829/116907)
Branch Coverage: 24.78% (17366/70068)
Coverage Report: http://coverage.selectdb-in.cc/coverage/57630578b98b0e5c8467b38d8f481900b481958c_57630578b98b0e5c8467b38d8f481900b481958c/report/index.html

@doris-robot
Copy link

TPC-DS: Total hot run time: 200934 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 57630578b98b0e5c8467b38d8f481900b481958c, data reload: false

query1	922	385	389	385
query2	6525	2196	2036	2036
query3	6917	201	201	201
query4	20458	17988	17968	17968
query5	19720	6474	6490	6474
query6	303	228	225	225
query7	4160	300	299	299
query8	260	252	240	240
query9	3184	2757	2673	2673
query10	408	288	330	288
query11	11330	10675	10787	10675
query12	119	81	72	72
query13	5587	650	643	643
query14	17614	13563	13382	13382
query15	359	242	237	237
query16	6456	272	257	257
query17	1749	1437	886	886
query18	2304	407	412	407
query19	214	149	158	149
query20	78	76	75	75
query21	188	104	96	96
query22	5314	5001	4953	4953
query23	32571	32271	31969	31969
query24	6985	6614	6514	6514
query25	529	421	411	411
query26	519	158	157	157
query27	1904	295	299	295
query28	6180	2270	2226	2226
query29	3022	2848	2759	2759
query30	241	165	163	163
query31	895	718	775	718
query32	69	59	60	59
query33	387	253	246	246
query34	852	472	478	472
query35	1122	967	911	911
query36	1227	1086	1395	1086
query37	89	65	58	58
query38	3071	2930	2949	2930
query39	1365	1318	1327	1318
query40	198	92	92	92
query41	34	34	32	32
query42	81	80	87	80
query43	657	588	573	573
query44	1134	723	726	723
query45	244	234	231	231
query46	1227	970	958	958
query47	1777	1706	1751	1706
query48	989	696	677	677
query49	612	375	369	369
query50	883	584	585	584
query51	4730	4677	4699	4677
query52	91	72	76	72
query53	448	320	318	318
query54	2643	2460	2464	2460
query55	81	89	78	78
query56	211	204	212	204
query57	1125	1185	1086	1086
query58	213	209	188	188
query59	3571	3217	3078	3078
query60	213	181	201	181
query61	85	82	85	82
query62	841	473	467	467
query63	469	343	333	333
query64	2552	1501	1454	1454
query65	3626	3508	3544	3508
query66	764	365	366	365
query67	16268	15319	14771	14771
query68	8442	675	665	665
query69	589	342	355	342
query70	1705	1619	1399	1399
query71	399	316	315	315
query72	6541	3454	3414	3414
query73	716	320	309	309
query74	6246	5966	5863	5863
query75	4688	3690	3711	3690
query76	4765	1164	1235	1164
query77	668	258	250	250
query78	12479	12459	11682	11682
query79	11815	666	646	646
query80	816	392	384	384
query81	493	228	240	228
query82	1339	100	97	97
query83	166	138	136	136
query84	260	68	73	68
query85	827	276	277	276
query86	328	328	290	290
query87	3258	3031	2995	2995
query88	5014	2320	2324	2320
query89	452	282	300	282
query90	1967	210	209	209
query91	146	117	117	117
query92	57	50	49	49
query93	6326	615	568	568
query94	786	203	204	203
query95	1092	1053	1077	1053
query96	646	332	328	328
query97	6526	6318	6540	6318
query98	183	175	171	171
query99	2908	965	843	843
Total cold run time: 313607 ms
Total hot run time: 200934 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.78 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 57630578b98b0e5c8467b38d8f481900b481958c, data reload: false

query1	0.02	0.02	0.01
query2	0.08	0.03	0.02
query3	0.25	0.05	0.05
query4	1.81	0.08	0.08
query5	0.54	0.53	0.52
query6	1.32	0.63	0.62
query7	0.02	0.02	0.01
query8	0.03	0.02	0.02
query9	0.51	0.47	0.48
query10	0.54	0.53	0.54
query11	0.11	0.08	0.09
query12	0.11	0.09	0.10
query13	0.63	0.60	0.63
query14	0.78	0.80	0.79
query15	0.77	0.76	0.76
query16	0.36	0.39	0.37
query17	1.02	1.03	0.99
query18	0.21	0.25	0.26
query19	1.88	1.83	1.81
query20	0.02	0.01	0.02
query21	15.49	0.57	0.55
query22	2.29	2.18	1.56
query23	17.14	1.11	0.94
query24	4.07	1.04	1.37
query25	0.34	0.10	0.06
query26	0.52	0.16	0.16
query27	0.05	0.03	0.04
query28	8.72	0.74	0.71
query29	12.67	2.35	2.37
query30	0.51	0.52	0.51
query31	2.80	0.39	0.40
query32	3.38	0.49	0.49
query33	3.05	3.03	3.08
query34	15.24	4.80	4.80
query35	4.84	4.84	4.81
query36	1.06	1.02	1.02
query37	0.06	0.04	0.04
query38	0.04	0.02	0.02
query39	0.02	0.01	0.02
query40	0.16	0.14	0.14
query41	0.07	0.01	0.01
query42	0.02	0.02	0.02
query43	0.03	0.02	0.01
Total cold run time: 103.58 s
Total hot run time: 30.78 s

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit 57630578b98b0e5c8467b38d8f481900b481958c with default session variables
Stream load json:         20 seconds loaded 2358488459 Bytes, about 112 MB/s
Stream load orc:          59 seconds loaded 1101869774 Bytes, about 17 MB/s
Stream load parquet:      32 seconds loaded 861443392 Bytes, about 25 MB/s
Insert into select:       20.2 seconds inserted 10000000 Rows, about 495K ops/s

@@ -852,6 +852,10 @@ void VNodeChannel::cancel(const std::string& cancel_msg) {
// But do we need brpc::StartCancel(call_id)?
_cancel_with_msg(cancel_msg);

if (_stub == nullptr) {
Copy link
Contributor

@cambyzju cambyzju Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_stub is created in VNodeChannel::init, if _stub is null, it will set _is_closed to true.
We should not reach here normally.

If we do not find the root cause, please add some comments and logs here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was initially fixed in #30915 and later changed in #33006.
There is no more additional information in the latest Jira CIR-8301.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @cambyzju. We should not reach here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancel() could be called when _inited = false, see #34897 (fixed in that PR)

@xiaokang xiaokang marked this pull request as draft April 3, 2024 15:21
@kaijchen kaijchen closed this Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants