Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data lost when exceed batch size. #111

Open
wycwyx opened this issue May 21, 2020 · 10 comments
Open

Data lost when exceed batch size. #111

wycwyx opened this issue May 21, 2020 · 10 comments
Assignees

Comments

@wycwyx
Copy link

wycwyx commented May 21, 2020

Describe the bug
Data lost when the number of rows is exceed batch size.

To Reproduce

  1. set "replica_batch_size :10000" in configure file.
  2. insert 18000 rows in the source database.
  3. chameleon will replica only 10000, and 8000 rows lost.

After debuged the code(v2.0.12 or master), I found the code snippet in mysql_lib. py.
code :
`

  • 1194 if close_batch:
  • 1195 my_stream.close()
  • 1196 return [master_data, close_batch]

`
According the code, the rows in group_insert will not to be stored.

I changed them to
`

  • 1194 if close_batch:
  • 1195 break

`

Then result was right.

@the4thdoctor
Copy link
Owner

I wasn't able to reproduce the bug as the rows you mentioned seems already ok.
I've just merged some changes in master/ver2.0. Can you try again please?

@wycwyx
Copy link
Author

wycwyx commented Aug 14, 2020

In current version (tag v2.0.14)
The line 1266&1267, should be a single code: break.
And let the code block begin with line 1375 running, which will write the batch remaining in variable group_insert.
If not , according current 1266&1267, variable group_insert will not be written.
When function __read_replica_stream be called next , variable group_insert will be assign to empty array in line 1144.

@rascalDan
Copy link
Contributor

I'm trying to avoid saying "me too", because I'm struggling to understand the scope of the problem we're witnessing. It appears to be the case that "some" rows in one table "sometimes" aren't updated when the source rows are. The batch sizes in question are far far lower then our max batch size (10000 in config, actual tends to be less than 200) We haven't been able identify the criteria for failure as yet, and are not entirely sure it's limited to one table (that wouldn't make sense, but it does have a large amount of text in it, so that might issue some how). I am actively looking into the problem now we speak and hope to have more concrete information later today.

@rascalDan
Copy link
Contributor

OK... I think I might have found the problem, and agree with @wycwyx about the fix, but not the cause.
My understanding is that this flow may occur:

  • Some rows are added to group_insert on line 1398, but not yet written at 1405-1412.
  • We loop around to the next binlogevent at 1219
  • The next event is a QueryEvent at 1250 and is also something that is true at 1257
  • The function is exited at 1308 without ever writing the contents of group_insert

@rascalDan
Copy link
Contributor

I've managed to get a copy of the binary log for an example of this occurring in our production environment and whilst I think my previous point holds true, the log doesn't support it being the cause. I'd like to share the log, but it's several hundred megabytes and contains personal information which I cannot disclose.

What I see is very strange though:
there are two updates to the table in question, the first wasn't applied but the second was... then some updates to a schema not replicated to PostgreSQL... and then another update to the same table, not applied... then an insert to a different table which was.

What I don't see in the log is anything in between the successful ones and failing ones that would be classed as QueryEvent and trigger the above flow. On that note, the part about being true on line 1257 is only required if close_batch was previously false. If close_batch was already true, a QueryEvent would drop the existing contents of group_insert regardless.

Is it possible that mysqlbinlog is lying to me and hiding QueryEvents? I didn't think these would be logged at all, but maybe it includes things I'm not thinking of and it's hiding these by default.

I dunno... I'm really confused by this, so apologies for just brain dumping on here, just sharing what I'm finding. We have a steady trickle of the problem in our production environment, apparently only to one table and I can't for the life of me recreate it outside of that.. and all the tools, logs, statuses etc suggest it's not happening.

@rascalDan
Copy link
Contributor

We've now witnessed evidence of records missing (so an insert being dropped) from other tables, which suggests this is isn't limited scope problem. That makes sense as we could never explain why it would be. Unfortunately, we're also no closer to explaining why this is happening at all.

@FreCap
Copy link

FreCap commented Apr 27, 2022

@rascalDan did you end up finding a solution? e.g. adding some configs such as replica_batch_size = 1000000

@rascalDan
Copy link
Contributor

@FreCap I'm afraid not... we've never bottomed the problem and currently have a second process comparing/repairing data that's been missed... which I'd love to get rid of! (but are currently extending to cover other tables)

@FreCap
Copy link

FreCap commented Apr 29, 2022

Could you please share a fully runnable SQL and pg_chameleon example to reproduce so we can check on my side as well?

I saw your previous description, but to get everybody on the same page and allow reproducing, I think it would be instrumental in finding a way to remove the compare/repair feature (which scares me out quite a bit).

@rascalDan
Copy link
Contributor

Honestly, not really. Most of our code is just insert into foo(...) values(?, ?, ?) or update foo set bar = ? where bar = ?... and sometimes they don't get replicated. I appreciate that our problem might not be directly related to the batch size problem.

JasonLiu1567 added a commit to JasonLiu1567/pg_chameleon that referenced this issue Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants