-
Notifications
You must be signed in to change notification settings - Fork 163
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[FEAT] Add streaming + parallel CSV reader, with decompression suppor…
…t. (#1501) This PR adds streaming + parallel CSV reading and parsing, along with support for streaming decompression. In particular, this PR: - Adds support for streaming decompression for brotli, bz, deflate, gzip, lzma, xz, zlib, and zstd. - Performs chunk-based streaming CSV reads, filling up a small buffer of unparsed records. - Pipelines chunk-based CSV parsing with reading by spawning Tokio + rayon parsing tasks. - Performances chunk parsing, as well as column parsing within a chunk, in parallel on the rayon threadpool. - Changes schema inference to involve an (at most) 1 MiB file peak rather than a full file read. - Gathers a mean row size in bytes estimate during schema inference and propagates this estimate back to the reader. - Unifies local and cloud reads + schema inference. - Adds thorough Rust-side local + cloud test coverage. The streaming + parallel reading + parsing leads to a 4-8x speed up over the pyarrow reader and the previous non-parallel reader when benchmarking large file (~1 GB) reads, while also resulting in lower memory utilization due to the streaming reading + parsing. ## TODOs (follow-up PRs) - [ ] Add snappy decompression support (need to essentially do something like [this](https://github.com/belltoy/tokio-snappy/blob/master/src/lib.rs))
- Loading branch information
1 parent
76e256a
commit ad829c9
Showing
42 changed files
with
2,191 additions
and
158 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.