Data lost when exceed batch size. #111

wycwyx · 2020-05-21T09:44:20Z

Describe the bug
Data lost when the number of rows is exceed batch size.

To Reproduce

set "replica_batch_size :10000" in configure file.
insert 18000 rows in the source database.
chameleon will replica only 10000, and 8000 rows lost.

After debuged the code(v2.0.12 or master), I found the code snippet in mysql_lib. py.
code :
`

1194 if close_batch:
1195 my_stream.close()
1196 return [master_data, close_batch]

`
According the code, the rows in group_insert will not to be stored.

I changed them to
`

1194 if close_batch:
1195 break

`

Then result was right.

the4thdoctor · 2020-07-25T06:58:37Z

I wasn't able to reproduce the bug as the rows you mentioned seems already ok.
I've just merged some changes in master/ver2.0. Can you try again please?

wycwyx · 2020-08-14T03:38:12Z

In current version (tag v2.0.14)
The line 1266&1267, should be a single code: break.
And let the code block begin with line 1375 running, which will write the batch remaining in variable group_insert.
If not , according current 1266&1267, variable group_insert will not be written.
When function __read_replica_stream be called next , variable group_insert will be assign to empty array in line 1144.

rascalDan · 2021-03-24T10:35:31Z

I'm trying to avoid saying "me too", because I'm struggling to understand the scope of the problem we're witnessing. It appears to be the case that "some" rows in one table "sometimes" aren't updated when the source rows are. The batch sizes in question are far far lower then our max batch size (10000 in config, actual tends to be less than 200) We haven't been able identify the criteria for failure as yet, and are not entirely sure it's limited to one table (that wouldn't make sense, but it does have a large amount of text in it, so that might issue some how). I am actively looking into the problem now we speak and hope to have more concrete information later today.

rascalDan · 2021-03-24T11:31:52Z

OK... I think I might have found the problem, and agree with @wycwyx about the fix, but not the cause.
My understanding is that this flow may occur:

Some rows are added to group_insert on line 1398, but not yet written at 1405-1412.
We loop around to the next binlogevent at 1219
The next event is a QueryEvent at 1250 and is also something that is true at 1257
The function is exited at 1308 without ever writing the contents of group_insert

rascalDan · 2021-03-24T17:06:55Z

I've managed to get a copy of the binary log for an example of this occurring in our production environment and whilst I think my previous point holds true, the log doesn't support it being the cause. I'd like to share the log, but it's several hundred megabytes and contains personal information which I cannot disclose.

What I see is very strange though:
there are two updates to the table in question, the first wasn't applied but the second was... then some updates to a schema not replicated to PostgreSQL... and then another update to the same table, not applied... then an insert to a different table which was.

What I don't see in the log is anything in between the successful ones and failing ones that would be classed as QueryEvent and trigger the above flow. On that note, the part about being true on line 1257 is only required if close_batch was previously false. If close_batch was already true, a QueryEvent would drop the existing contents of group_insert regardless.

Is it possible that mysqlbinlog is lying to me and hiding QueryEvents? I didn't think these would be logged at all, but maybe it includes things I'm not thinking of and it's hiding these by default.

I dunno... I'm really confused by this, so apologies for just brain dumping on here, just sharing what I'm finding. We have a steady trickle of the problem in our production environment, apparently only to one table and I can't for the life of me recreate it outside of that.. and all the tools, logs, statuses etc suggest it's not happening.

rascalDan · 2021-04-19T14:28:47Z

We've now witnessed evidence of records missing (so an insert being dropped) from other tables, which suggests this is isn't limited scope problem. That makes sense as we could never explain why it would be. Unfortunately, we're also no closer to explaining why this is happening at all.

FreCap · 2022-04-27T13:24:18Z

@rascalDan did you end up finding a solution? e.g. adding some configs such as replica_batch_size = 1000000

rascalDan · 2022-04-29T16:48:40Z

@FreCap I'm afraid not... we've never bottomed the problem and currently have a second process comparing/repairing data that's been missed... which I'd love to get rid of! (but are currently extending to cover other tables)

FreCap · 2022-04-29T17:10:20Z

Could you please share a fully runnable SQL and pg_chameleon example to reproduce so we can check on my side as well?

I saw your previous description, but to get everybody on the same page and allow reproducing, I think it would be instrumental in finding a way to remove the compare/repair feature (which scares me out quite a bit).

rascalDan · 2022-05-03T10:35:11Z

Honestly, not really. Most of our code is just insert into foo(...) values(?, ?, ?) or update foo set bar = ? where bar = ?... and sometimes they don't get replicated. I appreciate that our problem might not be directly related to the batch size problem.

…replica_size is reached rel-issue:the4thdoctor#111

the4thdoctor self-assigned this Jul 24, 2020

the4thdoctor added bug v2.0 labels Jul 24, 2020

JasonLiu1567 added a commit to JasonLiu1567/pg_chameleon that referenced this issue Jun 13, 2024

Fix missing some data when begin query event is occurred after batch_…

ea3c4ae

…replica_size is reached rel-issue:the4thdoctor#111

JasonLiu1567 mentioned this issue Jun 13, 2024

fix issue-111 #171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data lost when exceed batch size. #111

Data lost when exceed batch size. #111

wycwyx commented May 21, 2020 •

edited

Loading

the4thdoctor commented Jul 25, 2020

wycwyx commented Aug 14, 2020 •

edited

Loading

rascalDan commented Mar 24, 2021

rascalDan commented Mar 24, 2021

rascalDan commented Mar 24, 2021

rascalDan commented Apr 19, 2021

FreCap commented Apr 27, 2022 •

edited

Loading

rascalDan commented Apr 29, 2022

FreCap commented Apr 29, 2022

rascalDan commented May 3, 2022

Data lost when exceed batch size. #111

Data lost when exceed batch size. #111

Comments

wycwyx commented May 21, 2020 • edited Loading

the4thdoctor commented Jul 25, 2020

wycwyx commented Aug 14, 2020 • edited Loading

rascalDan commented Mar 24, 2021

rascalDan commented Mar 24, 2021

rascalDan commented Mar 24, 2021

rascalDan commented Apr 19, 2021

FreCap commented Apr 27, 2022 • edited Loading

rascalDan commented Apr 29, 2022

FreCap commented Apr 29, 2022

rascalDan commented May 3, 2022

wycwyx commented May 21, 2020 •

edited

Loading

wycwyx commented Aug 14, 2020 •

edited

Loading

FreCap commented Apr 27, 2022 •

edited

Loading