Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

Local type casts #173

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

Tolsto
Copy link

@Tolsto Tolsto commented Apr 20, 2022

Problem

The current implementation for log-based replication uses the Postgres server for type casts of hstore and array values. This will cause network requests to the Postgres server for each hstore or array column for every affected line in the log. Even worse, it uses a new connection for each request. With TCP and TLS handshakes and PG authentication, this means multiple network roundtrips per typecast. Total madness.

Proposed changes

Parse hstore and select array values locally. Could also be done for the other array types but I didn't have time to test it. The solution for array parsing is based on singer-io/tap-postgres#72
The results speak for themselves: In my case, I wasn't able to use my pipeline as the processing of 3 hours of WAL data would have taken about 4 months. Speed improvement after the change: > 1000x.

Types of changes

What types of changes does your code introduce to PipelineWise?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

  • Description above provides context of the change
  • I have added tests that prove my fix is effective or that my feature works
  • Unit tests for changes (not needed for documentation changes)
  • CI checks pass with my changes
  • Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
  • Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
  • Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions

Saves a lot of network round trips.
Saves a lot of network roundtrips. Could also be done for
the other array types, but I didn't have a need for them and
not enough time for testing.
@Tolsto Tolsto mentioned this pull request Apr 24, 2022
13 tasks
@Tolsto
Copy link
Author

Tolsto commented Apr 24, 2022

I think that the CSV approach to parsing can be optimized even further. Can anybody share benchmarks (e.g. entries per second or minute and hardware) for log-based processing for tables without arrays or hstores? I know that it also depends on the schema of the tables but it would be great to get a ballpark figure.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant