Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature](cloud) Implement hdfs accessor #33059

Merged
merged 1 commit into from
Apr 2, 2024

Conversation

platoneko
Copy link
Contributor

Proposed changes

Implement hdfs accessor

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -24,6 +24,7 @@
#include <unordered_map>
#include <unordered_set>

#include "recycler/s3_accessor.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'recycler/s3_accessor.h' file not found [clang-diagnostic-error]

#include "recycler/s3_accessor.h"
         ^

// specific language governing permissions and limitations
// under the License.

#include "recycler/hdfs_accessor.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'recycler/hdfs_accessor.h' file not found [clang-diagnostic-error]

#include "recycler/hdfs_accessor.h"
         ^


class HDFSBuilder {
public:
~HDFSBuilder() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use '= default' to define a trivial destructor [modernize-use-equals-default]

    ~HDFSBuilder() {
    ^

HDFSBuilder() = default;

// returns 0 for success otherwise error
int init(const HdfsBuildConf& conf) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: method 'init' can be made static [readability-convert-member-functions-to-static]

Suggested change
int init(const HdfsBuildConf& conf) {
static int init(const HdfsBuildConf& conf) {

}

// returns 0 for success otherwise error
int init_hdfs_builder() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: method 'init_hdfs_builder' can be made static [readability-convert-member-functions-to-static]

Suggested change
int init_hdfs_builder() {
static int init_hdfs_builder() {

}

// returns 0 for success otherwise error
int check_krb_params(std::string_view hdfs_kerberos_principal,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: method 'check_krb_params' can be made static [readability-convert-member-functions-to-static]

Suggested change
int check_krb_params(std::string_view hdfs_kerberos_principal,
static int check_krb_params(std::string_view hdfs_kerberos_principal,

#ifdef USE_HADOOP_HDFS
#include <hadoop_hdfs/hdfs.h> // IWYU pragma: export
#else
#include <hdfs/hdfs.h> // IWYU pragma: export
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'hdfs/hdfs.h' file not found [clang-diagnostic-error]

#include <hdfs/hdfs.h> // IWYU pragma: export
         ^

#include <string>
#include <vector>

#include "recycler/obj_store_accessor.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'recycler/obj_store_accessor.h' file not found [clang-diagnostic-error]

#include "recycler/obj_store_accessor.h"
         ^

@platoneko platoneko force-pushed the hdfs-accessor branch 2 times, most recently from 5088dc9 to 2e53a91 Compare March 31, 2024 15:21
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

// specific language governing permissions and limitations
// under the License.

#include "recycler/hdfs_accessor.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'recycler/hdfs_accessor.h' file not found [clang-diagnostic-error]

#include "recycler/hdfs_accessor.h"
         ^

@platoneko platoneko marked this pull request as ready for review April 1, 2024 08:46
Copy link
Contributor

github-actions bot commented Apr 1, 2024

sh-checker report

To get the full details, please check in the job output.

shellcheck errors

'shellcheck ' returned error 1 finding the following syntactical issues:

----------

In cloud/script/run_all_tests.sh line 97:
echo "CLASSPATH=$CLASSPATH"
                ^--------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
echo "CLASSPATH=${CLASSPATH}"


In cloud/script/run_all_tests.sh line 100:
    DORIS_JAVA_HOME=${JAVA_HOME}
                    ^----------^ SC2154 (warning): JAVA_HOME is referenced but not assigned.


In cloud/script/run_all_tests.sh line 103:
echo "DORIS_JAVA_HOME=$DORIS_JAVA_HOME"
                      ^--------------^ SC2250 (style): Prefer putting braces around variable references even when not strictly required.

Did you mean: 
echo "DORIS_JAVA_HOME=${DORIS_JAVA_HOME}"

For more information:
  https://www.shellcheck.net/wiki/SC2154 -- JAVA_HOME is referenced but not a...
  https://www.shellcheck.net/wiki/SC2250 -- Prefer putting braces around vari...
----------

You can address the above issues in one of three ways:
1. Manually correct the issue in the offending shell script;
2. Disable specific issues by adding the comment:
  # shellcheck disable=NNNN
above the line that contains the issue, where NNNN is the error code;
3. Add '-e NNNN' to the SHELLCHECK_OPTS setting in your .yml action file.



shfmt errors
'shfmt ' found no issues.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

namespace doris::cloud {

#define HdfsAccessorTest DISABLED_HdfsAccessorTest
TEST(HdfsAccessorTest, normal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'TEST' exceeds recommended size/complexity thresholds [readability-function-size]

TEST(HdfsAccessorTest, normal) {
^
Additional context

cloud/test/hdfs_accessor_test.cpp:47: 92 lines including whitespace and comments (threshold 80)

TEST(HdfsAccessorTest, normal) {
^

Copy link
Contributor

github-actions bot commented Apr 1, 2024

sh-checker report

To get the full details, please check in the job output.

shellcheck errors

'shellcheck ' returned error 1 finding the following syntactical issues:

----------

In cloud/script/run_all_tests.sh line 100:
    DORIS_JAVA_HOME=${JAVA_HOME}
                    ^----------^ SC2154 (warning): JAVA_HOME is referenced but not assigned.

For more information:
  https://www.shellcheck.net/wiki/SC2154 -- JAVA_HOME is referenced but not a...
----------

You can address the above issues in one of three ways:
1. Manually correct the issue in the offending shell script;
2. Disable specific issues by adding the comment:
  # shellcheck disable=NNNN
above the line that contains the issue, where NNNN is the error code;
3. Add '-e NNNN' to the SHELLCHECK_OPTS setting in your .yml action file.



shfmt errors
'shfmt ' found no issues.

@platoneko
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

run compile

@dataroaring
Copy link
Contributor

run buildall

Copy link
Contributor

github-actions bot commented Apr 2, 2024

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-H: Total hot run time: 38983 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c72c61a4bfd57717018ec31a40dbf1d418d54dc1, data reload: false

------ Round 1 ----------------------------------
q1	17671	4122	4079	4079
q2	2025	191	184	184
q3	10471	1306	1423	1306
q4	10204	842	1056	842
q5	7466	3228	2966	2966
q6	218	132	132	132
q7	1117	626	632	626
q8	9404	2066	2056	2056
q9	6627	6217	6171	6171
q10	8420	3524	3516	3516
q11	414	251	262	251
q12	391	223	218	218
q13	17774	2912	2915	2912
q14	271	247	242	242
q15	527	515	486	486
q16	488	401	385	385
q17	974	913	909	909
q18	7363	6568	6420	6420
q19	1590	1546	1544	1544
q20	601	326	305	305
q21	3608	3151	3123	3123
q22	374	310	317	310
Total cold run time: 107998 ms
Total hot run time: 38983 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4058	4054	4082	4054
q2	333	222	215	215
q3	2959	2977	2949	2949
q4	1889	1840	1846	1840
q5	5238	5236	5205	5205
q6	207	124	123	123
q7	2247	1830	1815	1815
q8	3214	3289	3280	3280
q9	8480	8455	8456	8455
q10	3751	4009	4040	4009
q11	555	466	470	466
q12	768	615	626	615
q13	17147	3152	3085	3085
q14	309	287	276	276
q15	524	473	503	473
q16	501	437	439	437
q17	1781	1762	1745	1745
q18	8238	7679	7602	7602
q19	1732	1709	1717	1709
q20	1979	1727	1815	1727
q21	5259	4983	4985	4983
q22	508	433	419	419
Total cold run time: 71677 ms
Total hot run time: 55482 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 181489 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c72c61a4bfd57717018ec31a40dbf1d418d54dc1, data reload: false

query1	1219	374	1114	374
query2	6225	2035	1985	1985
query3	6665	219	203	203
query4	24867	21394	21265	21265
query5	4206	421	415	415
query6	288	205	190	190
query7	4605	311	304	304
query8	232	180	180	180
query9	8479	2278	2304	2278
query10	439	253	267	253
query11	14755	14599	14440	14440
query12	152	95	97	95
query13	1642	389	396	389
query14	8492	7002	7043	7002
query15	216	172	187	172
query16	6569	283	290	283
query17	957	603	585	585
query18	1804	316	285	285
query19	213	161	164	161
query20	99	96	93	93
query21	202	133	130	130
query22	4943	4773	4737	4737
query23	33486	32704	32800	32704
query24	12979	3291	3219	3219
query25	698	408	403	403
query26	1915	165	161	161
query27	3285	377	373	373
query28	7196	1911	1900	1900
query29	1182	610	593	593
query30	313	158	151	151
query31	1027	754	769	754
query32	100	80	74	74
query33	699	236	243	236
query34	1130	525	516	516
query35	872	722	729	722
query36	1023	879	869	869
query37	240	79	77	77
query38	3708	3583	3616	3583
query39	1664	1576	1597	1576
query40	240	162	142	142
query41	49	48	48	48
query42	110	113	103	103
query43	468	414	418	414
query44	1095	727	742	727
query45	300	257	269	257
query46	1107	864	813	813
query47	2017	1876	1928	1876
query48	389	317	303	303
query49	958	375	364	364
query50	844	417	399	399
query51	7058	6845	6854	6845
query52	116	103	105	103
query53	372	306	295	295
query54	316	251	251	251
query55	86	87	80	80
query56	264	238	247	238
query57	1284	1199	1181	1181
query58	253	235	230	230
query59	2764	2528	2420	2420
query60	268	243	249	243
query61	114	106	109	106
query62	648	471	450	450
query63	314	282	277	277
query64	5950	3155	3052	3052
query65	3076	3019	3023	3019
query66	1310	313	339	313
query67	15666	14845	14820	14820
query68	8764	568	571	568
query69	554	342	332	332
query70	1236	1091	1095	1091
query71	494	268	262	262
query72	6257	2570	2425	2425
query73	875	326	330	326
query74	6625	6424	6268	6268
query75	3347	2328	2313	2313
query76	4979	1087	1191	1087
query77	566	258	242	242
query78	10965	10061	10129	10061
query79	8693	519	530	519
query80	1403	420	428	420
query81	487	227	221	221
query82	709	104	103	103
query83	207	164	161	161
query84	270	88	90	88
query85	1323	305	291	291
query86	413	282	285	282
query87	3685	3502	3482	3482
query88	3940	2348	2367	2348
query89	553	366	370	366
query90	1914	178	188	178
query91	130	106	106	106
query92	66	52	52	52
query93	6728	530	524	524
query94	967	193	195	193
query95	435	330	322	322
query96	600	279	271	271
query97	2676	2470	2466	2466
query98	228	221	209	209
query99	1290	813	829	813
Total cold run time: 297601 ms
Total hot run time: 181489 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.79 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c72c61a4bfd57717018ec31a40dbf1d418d54dc1, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.04	0.04
query3	0.23	0.05	0.05
query4	1.68	0.06	0.06
query5	0.51	0.48	0.49
query6	1.11	0.65	0.66
query7	0.02	0.01	0.01
query8	0.06	0.04	0.04
query9	0.55	0.53	0.51
query10	0.57	0.58	0.55
query11	0.15	0.12	0.11
query12	0.13	0.11	0.12
query13	0.62	0.59	0.59
query14	0.77	0.78	0.79
query15	0.86	0.84	0.84
query16	0.36	0.35	0.35
query17	0.99	0.99	0.98
query18	0.24	0.24	0.25
query19	1.86	1.72	1.72
query20	0.01	0.01	0.00
query21	15.44	0.78	0.69
query22	2.87	5.58	3.41
query23	17.51	1.33	1.06
query24	1.34	0.22	0.23
query25	0.12	0.09	0.10
query26	0.28	0.18	0.20
query27	0.09	0.10	0.08
query28	13.96	0.96	0.93
query29	12.53	3.59	3.40
query30	0.26	0.06	0.06
query31	2.87	0.38	0.39
query32	3.28	0.47	0.47
query33	2.87	2.88	2.86
query34	15.51	4.36	4.30
query35	4.37	4.37	4.37
query36	0.66	0.47	0.47
query37	0.19	0.18	0.18
query38	0.17	0.16	0.16
query39	0.05	0.04	0.04
query40	0.18	0.16	0.16
query41	0.09	0.06	0.05
query42	0.06	0.05	0.06
query43	0.05	0.04	0.04
Total cold run time: 105.58 s
Total hot run time: 31.79 s

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit c72c61a4bfd57717018ec31a40dbf1d418d54dc1 with default session variables
Stream load json:         18 seconds loaded 2358488459 Bytes, about 124 MB/s
Stream load orc:          59 seconds loaded 1101869774 Bytes, about 17 MB/s
Stream load parquet:      31 seconds loaded 861443392 Bytes, about 26 MB/s
Insert into select:       16.7 seconds inserted 10000000 Rows, about 598K ops/s

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit d946662 into apache:master Apr 2, 2024
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants