forked from apache/systemds
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SYSTEMDS-3444][SYSTEMDS-2699] Compressed I/O
This commit is a major overhaul of the writing and reading of compressed matrices. The design is now changed to write dictionaries separately and reading in both local and spark is working. Where a spark read will combine the dictionaries written in a distributed execution. Also contained in this PR is updates and refinements of the schema apply that now in a fused manner of update and apply can compress a matrix single-threaded at around 669MiB/s and multi-threaded 2GiB. This is done via first a full materialization of the compressed format in memory meaning that there is potential for further speedup if we relocate this compression on the IO path. But this is left for future work. One major improvement that makes our default compression faster as well is ACountHashMap.java now generalize the counting hashmap between the co-coded columns and single columns and optimized the increment calls for improved performance. The Co-Coding algorithm has also been slightly modified in this PR to add a small fraction to the cost of column groups depending on their column indexes. this makes it so that columns with the same cost are sorted based on their average column indexes, and in turn, improve the compression time of highly compressible data such as binary or ultra-sparse data. The PR also fixed the Nan Compression to not be treated specially to allow us to compress matrices containing Nan and then afterward we can replace Nan in an already compressed representation. Before the behavior was to replace all Nan Values with 0. Future work is to parallelize the reading of compressed matrices, which currently only is single threaded in the CP case. In the serialization performance benchmark, this commit moves the size calculation outside of the timed part. and improves the general code evaluation of individual functions. Closes apache#1880
- Loading branch information
1 parent
884ad3a
commit a54f513
Showing
136 changed files
with
5,308 additions
and
3,190 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.