-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Adds interop with Arrow library using new method Dataset.to_arrow()
#281
Conversation
WalkthroughWalkthroughThe updates introduce native Arrow support by adding Changes
Sequence Diagram(s)
Assessment against linked issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few changes requested inline.
Also, I would like request at least one or two tests added. Feel free to replicate another test that validates to_pandas()
capabilities. At minimum, we just want to make sure that the code can execute. It is probably also worth adding a version of the test that forces a small chunk size, ensuring that the ">1 chunk" codepath is also working as expected.
Overall - this looks great. I think we will be ready to merge once this feedback is resolved.
@coderabbitai review |
Actions performedReview triggered.
|
/fix-pr
|
@avirajsingh7 - Auto-fix applied some lint fixing. And the other fixes you applied looked great. Lmk if this is ready for final review or if you are still applying additional changes. |
@aaronsteers it is ready for review |
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
Files selected for processing (5)
- airbyte/caches/base.py (2 hunks)
- airbyte/datasets/_base.py (2 hunks)
- airbyte/datasets/_sql.py (4 hunks)
- pyproject.toml (1 hunks)
- tests/integration_tests/test_all_cache_types.py (1 hunks)
Files skipped from review due to trivial changes (1)
- pyproject.toml
Additional comments not posted (3)
airbyte/datasets/_sql.py (2)
108-109
: Ensure chunk size is appropriate and method is tested.The
to_arrow
method allows specifying a chunk size for processing the dataset. Ensure that the chunk size is appropriate and that the method is tested.
175-186
: Ensure chunk size is appropriate and method is tested.The
to_arrow
method allows specifying a chunk size for processing the dataset. Ensure that the chunk size is appropriate and that the method is tested.Verification successful
Ensure chunk size is appropriate and method is tested.
The
to_arrow
method allows specifying a chunk size for processing the dataset. Based on the search results, theCachedDataset
class is referenced in several integration tests, indicating that the method is likely tested indirectly through these tests.
tests/integration_tests/test_source_test_fixture.py
tests/integration_tests/cloud/test_cloud_sql_reads.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the to_arrow method in CachedDataset is tested. # Test: Search for tests that call the to_arrow method in CachedDataset. rg --type python $'CachedDataset.to_arrow'Length of output: 72
Script:
#!/bin/bash # Description: Verify that the to_arrow method in CachedDataset is tested. # Test: Search for tests that call the to_arrow method in CachedDataset. rg 'CachedDataset.to_arrow'Length of output: 27
Script:
#!/bin/bash # Description: Verify that the CachedDataset class and its methods are tested. # Search for any references to the CachedDataset class in test files. rg 'CachedDataset' --glob '*test*.py' # Search for any test files that might be testing the functionality of CachedDataset. fd test --type f --extension pyLength of output: 3235
tests/integration_tests/test_all_cache_types.py (1)
166-168
: Verify the correctness of the conversion and assertions.The conversion to Arrow dataset and the assertions for row count and batch iteration look correct. Ensure that the chunk size of 10 is optimal for the dataset size.
Verification successful
Verify the correctness of the conversion and assertions.
The conversion to an Arrow dataset and the assertions for row count and batch iteration look correct. Ensure that the chunk size of 10 is optimal for the dataset size.
- The code snippet is located in
tests/integration_tests/test_all_cache_types.py
around lines 166-168.- The assertions check both the row count (
FAKER_SCALE_A
) and the number of batches (20).Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the correctness of the conversion and assertions. # Test: Run the test suite to ensure the new functionality is covered by tests. pytest tests/integration_tests/test_all_cache_types.pyLength of output: 130
Script:
#!/bin/bash # Search for the function containing the provided code snippet to inspect its correctness. ast-grep --lang python --pattern $'def $_($$$):\n $$$\n arrow_dataset = result["users"].to_arrow(chunksize=10)\n assert arrow_dataset.count_rows() == FAKER_SCALE_A\n assert sum(1 for _ in arrow_dataset.to_batches()) == 20\n $$$'Length of output: 239
Script:
#!/bin/bash # Search for the provided code snippet in the specified file to locate its exact position and inspect its context. rg 'arrow_dataset = result\["users"\].to_arrow\(chunksize=10\)' tests/integration_tests/test_all_cache_types.py -A 5 -B 5Length of output: 644
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look great. Thanks for the added changes and for the contribution. 🚀
Will merge once all tests are passed.
/poetry-lock
|
Resolving dependencies... changes
This comment was marked as outdated.
This comment was marked as outdated.
/test-pr
|
@avirajsingh7 - On the tests side, I'm seeing this weird error. I'm not sure if it is new, since pulling in version updates from
https://github.com/airbytehq/PyAirbyte/actions/runs/9832269420 Update: I think I have this resolved now. Will push an update shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
Files selected for processing (1)
- pyproject.toml (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- pyproject.toml
/test-pr
|
BigQuery is failing at the To merge, we may want to declare BigQuery unsupported and simply raise
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (1)
airbyte/caches/bigquery.py (1)
33-33
: Remove whitespace from blank line.The blank line contains unnecessary whitespace.
- +Tools
Ruff
33-33: Blank line contains whitespace
Remove whitespace from blank line
(W293)
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- airbyte/caches/base.py (3 hunks)
- airbyte/caches/bigquery.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- airbyte/caches/base.py
Additional context used
Ruff
airbyte/caches/bigquery.py
33-33: Blank line contains whitespace
Remove whitespace from blank line
(W293)
Additional comments not posted (2)
airbyte/caches/bigquery.py (2)
26-26
: LGTM! Import statement is appropriate.The import statement for
DEFAULT_ARROW_MAX_CHUNK_SIZE
is necessary for the new method.
34-45
: LGTM! Method implementation is appropriate.The method correctly raises a
NotImplementedError
with a clear and informative message.
@aaronsteers I think this fix should work, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- tests/integration_tests/test_all_cache_types.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- tests/integration_tests/test_all_cache_types.py
/fix-pr |
@avirajsingh7 - Thanks for taking the revision for the BigQuery issue. I did some digging and found this issue from a while back, describing a similar problem for |
/fix-pr
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- airbyte/caches/bigquery.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- airbyte/caches/bigquery.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- airbyte/caches/bigquery.py (1 hunks)
- tests/integration_tests/test_all_cache_types.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- tests/integration_tests/test_all_cache_types.py
Additional context used
Ruff
airbyte/caches/bigquery.py
41-41: Blank line contains whitespace
Remove whitespace from blank line
(W293)
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- airbyte/caches/bigquery.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- airbyte/caches/bigquery.py
/test-pr
(Actually successful. Failure is false-positive.) |
@avirajsingh7 - Thanks for all your hard work on this. Tests are passing. Merged!! 🚀 🙌 |
Thanks @aaronsteers for all your efforts and help |
Resolves #204
As discuss in issue added a supoort pyarrow dataset instead of pyarrow table to handle a large datasets.
Supporting docs:
pyarrow dataset
pyarrow.Table.from_pandas
Still need input on optimal chunksize...
Summary by CodeRabbit
New Features
to_arrow
method for converting datasets to Arrow format with a specified chunk size.Bug Fixes
Tests