-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Resolve conflicting schema issues by removing dependency on PyArrow #67
Conversation
Performance snapshot, prior to changes (approx 15K records per second):
|
Refactoring is functionally complete - although still some cleanup to do. Performance is boosted by approximately 30% - from 15K records per second to 20K records per second.
|
@bindipankhudi - Tests are passing locally and lint cleanup is complete. I am taking another pass through the code myself but wanted to ping you of the latest status. |
Sounds great! I am taking a look also, but don't wait on it. Feel free to merge.
…On Thu, Mar 7, 2024 at 1:17 PM aaronsteers ***@***.***> wrote:
@bindipankhudi <https://github.com/bindipankhudi> - Tests are passing
locally and lint cleanup is complete.
I am taking another pass through the code myself but wanted to ping you of
the latest status.
—
Reply to this email directly, view it on GitHub
<#67 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BF2EJSTJIV6LA2G3D4EU2FDYXDKNVAVCNFSM6AAAAABDZAGBPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBUGUYDONRXGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite some refactoring. Looks great and cleaner than before!
Resolves: #89
We've discovered through user testing that the PyArrow dependency, and PyArrow itself, are inherently fragile when it comes to JSON schemas with variable typing. PyArrow exceeds at handling large volumes of typed data very quickly, but there's no way to process ragged and potentially conflicting types without reverting json types to strings.
Paqruet, similarly, has problems rendering json structures without child nodes. So, something like "
extra_properties: {}
" would cause a runtime error.We've recently moved our file writer to a JSONL file writer, in order to better handle ragged, non-populated objects, and unpredictable types. However, we still have a dependency on
pyarrow
, which is problematic for some users' workloads.This PR will remove the
pyarrow
dependency and replace it with a flow that simply writes JSONL lines to local files as quickly as possible.