-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pg_bulkload on PG13 fails to handle large data duplicates #148
Comments
Thanks. I could reproduced.
|
Hey, I was also looking at the code and this is what I found. Line 587 in 8caced4
because merge variable is set to -1 (https://github.com/ossc-db/pg_bulkload/blob/8caced46019119e2adf6f3dbdf96b72de9df08e9/lib/pg_btree.c#L489C6-L489C12) here and that, in turn, happens because of this piece of code
|
Hi, thanks for sharing your idea. In my understanding, the root cause of the error is that pg_bulkload forgot to be aware of pivot tuples (ex. b-tree root and intermediate nodes) and its When the b-tree is in a multi-level hierarchical state, the error occurs because there are pivot tuples. That's why the error occurred only when many data is loaded. As you said, I assumed that we need to fix |
@mikecaat thank you for figuring out! |
In addition to v16 support, we plan to release a fix for the issue by January. IIUC, the workarounds are
Yes. In addition, the problem also occurs with single column indexes. |
In our testing we noticed this doesn't occur if we are loading the data straight to empty table. (and this allows to eliminate duplicates in the data fed to bulkload which is also part of our use case) |
IIUC, if the table is empty, which means that the b-tree index of the table doesn't have data, the issue doesn't occur. That's one of the workaround. |
Hello @mikecaat, |
BTReaderInit() assumed that the case (itup->t_tid).ip_posid = 0 was anomalous when searching for the leftmost leaf page of a b-tree, but this is an old condition and has been removed. With the commit (postgres/postgres@8224de4) to support the INCLUDE clause in indexes from PostgreSQL v11, new pivot tuples (e.g. b-tree root and intermediate nodes) were added, which could cause ip_posid to be 0. Because they had forgotten to address this, errors sometimes occurred when running the process on B-trees that held some large data with pivot tuples.
Hey,
I am trying to test out pg_bulkload for a sample dataset that has a unique constraint between 3 of its columns. When I test out a dataset of ~200 lines, pg_bulkload loads the data fine and when re-load is attempted again with the same data, all are recognized as duplicates and discarded. This was on PG9.6.20
When upgrading to PG13.3 the same test is attempted with the same table and data and works the same. However, when the data size is increased to ~350-400 lines issues begin to occur. The first load works fine, but instead of detecting the second run as duplicates, the following error is output:
277 Row(s) loaded so far to table mbaf_exclude. 2 Row(s) bad. 0 rows skipped, duplicates will now be removed
ERROR: query failed: ERROR: could not create unique index "index_name"
DETAIL: Key (account_number, areacode, phone)=(083439221 , 954, 2222222) is duplicated.
DETAIL: query was: SELECT * FROM pg_bulkload($1)
The increased data size works as expected on PG9.6.20, which makes me believe this has to do with postgres version compatibility. Any guidance here would be greatly appreciated!
The text was updated successfully, but these errors were encountered: