Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ARROW-1990: [JS] C++ Refactor, Add DataFrame
This PR moves the `Table` class out of the Vector hierarchy and adds optimized dataframe operations to it. Currently implements an optimized `scan()` method, `filter(predicate)`, `count()`, and `countBy(column_name)` (only works on dictionary-encoded columns). Some usage examples, based on the file generated by `js/test/data/tables/generate.py`: ``` js > let table = Table.from(...); > table.count() 1000000 > table.filter(col('lat').gteq(0)).count() 499718 > table.countBy('origin').toJSON() { Charlottesville: 166839, 'New York': 166251, 'San Francisco': 166642, Seattle: 166659, 'Terre Haute': 166756, 'Washington, DC': 166853 } > table.filter(col('lng').gteq(0)).countBy('origin').toJSON() { Charlottesville: 83109, 'New York': 83221, 'San Francisco': 83515, Seattle: 83362, 'Terre Haute': 83314, 'Washington, DC': 83479 } ``` There are performance tests for the dataframe operations, to run them you must first generate the test data by running `npm run create:perfdata`. The PR also includes @trxcllnt's refactor of the JS implementation to make it more closely resemble the C++ implementation. This refactor resolves multiple JIRAs: ARROW-1903, ARROW-1898, ARROW-1502, ARROW-1952 (partially), and ARROW-1985 Author: Paul Taylor <[email protected]> Author: Brian Hulette <[email protected]> Author: Brian Hulette <[email protected]> Closes apache#1482 from TheNeuralBit/table-scan-perf and squashes the following commits: 52f1e0e [Brian Hulette] <, > are not commutative, misc cleanup 04b1838 [Brian Hulette] even more table tests 16b9ccb [Brian Hulette] Merge pull request #4 from trxcllnt/js-cpp-refactor fe300df [Paul Taylor] fix closure es5/umd toString() iterator 3d5240a [Paul Taylor] fix more externs 10c48ad [Paul Taylor] Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-refactor dbe7f81 [Brian Hulette] Add more Table unit tests 1910962 [Brian Hulette] Add optional bind callback to scan 5bdf17f [Brian Hulette] Fix perf 8cf2473 [Brian Hulette] Merge remote-tracking branch 'origin/master' into table-scan-perf 4a41b18 [Paul Taylor] add src/predicate to the list of exports we should save from uglify 5a91fab [Paul Taylor] add more view, predicate externs f6adfb3 [Brian Hulette] Create predicate namespace f7bb0ed [Paul Taylor] Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-refactor e148ee4 [Paul Taylor] Merge branch 'extern-woes' into js-cpp-refactor 25cdc4a [Paul Taylor] add src/predicate to the list of exports we should save from uglify dc7c728 [Paul Taylor] add more view, predicate externs 25e6af7 [Brian Hulette] Create predicate namespace 579ab1f [Brian Hulette] Merge pull request #2 from trxcllnt/js-cpp-refactor f3cde1a [Paul Taylor] fix lint 9769773 [Paul Taylor] fix vector perf tests 016ba78 [Brian Hulette] Merge pull request #1 from trxcllnt/js-cpp-refactor 272d293 [Paul Taylor] Merge pull request #4 from ccri/empty-table 7bc7363 [Brian Hulette] Fix exception for empty Table 8ddce0a [Paul Taylor] check bounds in getChildAt(i) to avoid NPEs f1dead0 [Paul Taylor] compute chunked nested childData list correctly 18807c6 [Paul Taylor] rename ChunkData's fields so it's more clear they're not semantically similar to other similarly named fields 7e43b78 [Paul Taylor] add test:integration npm script a5f200f [Paul Taylor] Merge pull request #3 from ccri/table-from-struct c8cd286 [Brian Hulette] Add Table.fromStruct a00415e [Brian Hulette] Fix perf 54d4f5b [Paul Taylor] lazily allocate table and recordbatch columns, support NestedView's getChildAt(i) method in ChunkedView 40b3638 [Paul Taylor] run integration tests with local data for coverage stats fe31ee0 [Paul Taylor] slice the flat data values before returning an iterator of them e537789 [Paul Taylor] make it easier to run all integration tests from local data c0fd2f9 [Paul Taylor] use the dictionary of the last chunked vector list for chunked dictionary vectors e33c068 [Paul Taylor] Merge pull request #2 from ccri/fixed-size-list 5bb63af [Brian Hulette] Don't read OFFSET vector for FixedSizeList 614b688 [Paul Taylor] add asEpochMs to date and timestamp vectors 87334a5 [Paul Taylor] Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-refactor b7f5bfb [Paul Taylor] rename numRows to length, add table.getColumn() e81082f [Paul Taylor] export vector views, allow cloning data as another type 700a47c [Paul Taylor] export visitors e859e13 [Paul Taylor] fix package.json bin entry 0620cfd [Brian Hulette] use Math.fround 0126dc4 [Brian Hulette] Don't recompute total length e761eee [Brian Hulette] Rename asJSON to toJSON 6c91ed4 [Paul Taylor] Merge branch 'master' of github.com:apache/arrow into js-cpp-refactor-merge_with-table-scan-perf d2b18d5 [Paul Taylor] Merge remote-tracking branch 'ccri/table-scan-perf' into js-cpp-refactor-merge_with-table-scan-perf f3f3b86 [Paul Taylor] rename table.ts to recordbatch.ts in preparation for merging latest e3f629d [Paul Taylor] fix rest of the mangling issues fa7c17a [Paul Taylor] passing all tests except es5 umd mangler ones e20decd [Brian Hulette] Add license headers edcbdbe [Brian Hulette] cleanup 20717d5 [Brian Hulette] Fixed countBy(string) 7244887 [Brian Hulette] Add table unit tests... 6719147 [Brian Hulette] Add DataFrame.countBy operation 2f4a349 [Brian Hulette] Minor tweaks 2e118ab [Brian Hulette] linter a788db3 [Brian Hulette] Cleanup a9fff89 [Brian Hulette] Move Table out of the Vector hierarchy 1d60aa1 [Brian Hulette] Moved DataFrame ops to Table. DataFrame is now an interface e8979ba [Brian Hulette] Refactor DataFrame to extend Vector<StructRow> 6a41d68 [Brian Hulette] clean up table benchmarks 2744c63 [Brian Hulette] Remove Chunked/Simple DataFrame distinction aa999f8 [Brian Hulette] Add DictionaryVector optimization for equals predicate 4d9e8c0 [Brian Hulette] Add concept of predicates for filtering dataframes 796f45d [Brian Hulette] add DataFrame filter and count ops 30f0330 [Brian Hulette] Add basic DataFrame impl ... a1edac2 [Brian Hulette] Add perf tests for table scans d18d915 [Paul Taylor] fix struct and map rows 61dc699 [Paul Taylor] WIP -- refactor types to closer match arrow-cpp 62db338 [Paul Taylor] update dependencies and add es6+ umd targets to jest transform ignore patterns to fix ci 6ff18e9 [Paul Taylor] ship es2015 commonJS in main package to avoid confusion 74e828a [Paul Taylor] fix typings issues (ARROW-1903)
- Loading branch information