Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: kafka skip msgs... #118

Closed
lispc opened this issue Jul 12, 2021 · 5 comments
Closed

bug: kafka skip msgs... #118

lispc opened this issue Jul 12, 2021 · 5 comments
Assignees

Comments

@lispc
Copy link
Member

lispc commented Jul 12, 2021

After adding #117, we found the msgs received by consumer is not continuous. Msg of offset 419, msg of offset 420, msg of offset 477.... like this...

There must be something wrong with the consumer.

You can use https://github.com/Fluidex/fluidex-backend as the dev env. ( bash run.sh )

@lispc
Copy link
Member Author

lispc commented Jul 12, 2021

telegram-cloud-photo-size-5-6296523447685197248-y

I think rollup process updates offset incorrectly.

Offset 0 msg must be RegisterUser. So i think the offset must be updated incorrectly

@noel2004 noel2004 self-assigned this Jul 20, 2021
@lispc
Copy link
Member Author

lispc commented Jul 21, 2021

reproduce:

  1. git clone https://github.com/Fluidex/fluidex-backend and checkout submodules. install deps scripts/install_deps.sh
  2. set persist_every_n_block to 20 in rollup-state-manager/config.yaml
  3. run.sh
  4. after a minute, kill rollup state manager
  5. modify run.sh, and restart rollup state manager only
  6. rollup state manager will crash soon.
  7. check log: less rollup-state-manager/rollup_state_manager.$DATE.log

@noel2004
Copy link
Member

have reproduced the issue on me, diving in ...

@noel2004
Copy link
Member

Well it should be my fault to cause rollup_state_manager crash because I have put messages triggered by two calling of tick.ts into message queue, the conflicting input (duplicated registry and unmatched balance, etc) would cause assertion failure inside rollup_state_manager

However, I fail to found non continuous messages or abnormal offset. For a "valid" message queue (triggered by calling tick.ts only one time), rollup_state_manager always can replay it correctly

My 'minimal' playground is built like this:

  1. Only one kafka and one postgreSQL db instance are used. 3 dbs (prover_cluster, rollup_state_manager and exchange ) have been created in the sole db instance;

  2. Only matchingengine and rollup_state_manager are built and run.

  3. Test data generated by tick.ts

  4. After rollup_state_manager have read and processed all messages it received, I can erase some of its dumping records and restart rollup_state_manager, program can always replay the message queue correctly, with offset specified in the latest dumping left.

    For example, when we set rollup_state_manager, and process about 600 messages in kafka, rollup_state_manager dump records in 20, 40, 60 and 80, each in one directory.

    Then I stop rollup_state_manager, and erased some dumping records, say, 60.db and 80.db, and restart. rollup_state_manager correctly start in the offset in the record of 40.db, and handle the rest message again without any errors.

@noel2004
Copy link
Member

Getting rid of the other factor which also cause rollup manager crash (see #133) I finally can reproduce the not continuous issues in message receiving.

Also found the issues is never deterministic: simply run rollup manager again and program overcome the discontinue offset then run smoothly. See attachment: in step2_fail.log program throw assert failure because it receive message at offset 3865 after 3863. And in step3_pass.log I run program again and it receive message at offset 3864, and then keep handling.

step3_pass.log
step2_fail.log

It seems the issue raise when there are tons of messages lay in kafka topic waiting for read and the message processing thread in rollup manager run too fast?

@lispc lispc closed this as completed in 6aaaed9 Jul 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants