Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature](load) refactor CSV reading process during scanning, and support enclose and escape for stream load #22539

Merged
merged 28 commits into from
Aug 15, 2023

Conversation

TangSiyang2001
Copy link
Collaborator

@TangSiyang2001 TangSiyang2001 commented Aug 3, 2023

Proposed changes

Refactor thoughts: close #22383
Descriptions about enclose and escape: #22385

Further comments

2023-08-09:
It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic.

Fallback of performance is unavoidable anyway. From the CSV reader's perspective, the real weak point may be the write column behavior, proved by the flame graph.

Trimming escape will be enable after fix: #22411 is merged

Cases should be discussed:

  1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large?
  2. What if an infinite line occurs in the case? Essentially, 1. is equivalent to this.

Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when enclose and escape is available for all kinds of load.

@TangSiyang2001 TangSiyang2001 changed the title [[feature](load) refactor CSV reading process during scanning, and support enclose and escape for stream load [feature](load) refactor CSV reading process during scanning, and support enclose and escape for stream load Aug 3, 2023
@TangSiyang2001
Copy link
Collaborator Author

run buildall

@TangSiyang2001
Copy link
Collaborator Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

clang-tidy review says "All clean, LGTM! 👍"

@TangSiyang2001
Copy link
Collaborator Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

clang-tidy review says "All clean, LGTM! 👍"

@TangSiyang2001
Copy link
Collaborator Author

run clickbench

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.35 seconds
stream load tsv: 550 seconds loaded 74807831229 Bytes, about 129 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.3 seconds inserted 10000000 Rows, about 341K ops/s
storage size: 17162156441 Bytes

@TangSiyang2001
Copy link
Collaborator Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

clang-tidy review says "All clean, LGTM! 👍"

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.39 seconds
stream load tsv: 547 seconds loaded 74807831229 Bytes, about 130 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 29.0 seconds inserted 10000000 Rows, about 344K ops/s
storage size: 17162164067 Bytes

@TangSiyang2001
Copy link
Collaborator Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

clang-tidy review says "All clean, LGTM! 👍"

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.39 seconds
stream load tsv: 535 seconds loaded 74807831229 Bytes, about 133 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.9 seconds inserted 10000000 Rows, about 334K ops/s
storage size: 17162420222 Bytes

@TangSiyang2001
Copy link
Collaborator Author

run buildall

@TangSiyang2001
Copy link
Collaborator Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

@TangSiyang2001
Copy link
Collaborator Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

clang-tidy review says "All clean, LGTM! 👍"

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 47.04 seconds
stream load tsv: 540 seconds loaded 74807831229 Bytes, about 132 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.4 seconds inserted 10000000 Rows, about 340K ops/s
storage size: 17161966789 Bytes

@TangSiyang2001
Copy link
Collaborator Author

TangSiyang2001 commented Aug 9, 2023

It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic.

Fallback of performance is unavoidable anyway. From the CSV reader's perspective, the real weak point may be the write column behavior, proved by the flame graph.

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@TangSiyang2001
Copy link
Collaborator Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@TangSiyang2001
Copy link
Collaborator Author

run clickbench

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Aug 11, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@TangSiyang2001
Copy link
Collaborator Author

run arm

@TangSiyang2001
Copy link
Collaborator Author

run clickbench

@TangSiyang2001
Copy link
Collaborator Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.57 seconds
stream load tsv: 544 seconds loaded 74807831229 Bytes, about 131 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.0 seconds inserted 10000000 Rows, about 344K ops/s
storage size: 17162285800 Bytes

@TangSiyang2001
Copy link
Collaborator Author

run p0

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit b49dc80 into apache:master Aug 15, 2023
15 of 16 checks passed
xiaokang pushed a commit that referenced this pull request Aug 17, 2023
…port enclose and escape for stream load (#22539)

 ## Proposed changes

Refactor thoughts: close #22383
Descriptions about `enclose` and `escape`: #22385

2023-08-09:
It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic.

Fallback of performance is unavoidable anyway. From the `CSV reader`'s perspective, the real weak point may be the write column behavior, proved by the flame graph.

Trimming escape will be enable after fix: #22411 is merged

Cases should be discussed:

1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large?
2. What if an infinite line occurs in the case? Essentially,  `1.` is equivalent to this.

Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when `enclose` and `escape` is available for all kinds of load.
airborne12 pushed a commit to airborne12/apache-doris that referenced this pull request Aug 21, 2023
…port enclose and escape for stream load (apache#22539)

 ## Proposed changes

Refactor thoughts: close apache#22383
Descriptions about `enclose` and `escape`: apache#22385

2023-08-09:
It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic.

Fallback of performance is unavoidable anyway. From the `CSV reader`'s perspective, the real weak point may be the write column behavior, proved by the flame graph.

Trimming escape will be enable after fix: apache#22411 is merged

Cases should be discussed:

1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large?
2. What if an infinite line occurs in the case? Essentially,  `1.` is equivalent to this.

Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when `enclose` and `escape` is available for all kinds of load.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.1-merged merge_conflict reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement] refactor CSV reading process during scanning
5 participants