-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load [v2.10.18] #5824
Comments
So in this case that consumer was deleted but gets stuck trying those operations? |
That is correct. The consumer is deleted from the other replicas and from disk on the affected server. I can see the server it running as a candidate in the stacksz when the error is is being logged. All of the other servers have deleted the consumer and stopped the raft group.
When this happens it, it happens for one out of ~10-15 consumers on a single server. The others are deleted fine. |
Do you think those consumers had existed for some time or were fairly new? |
These were R3 consumers were around for at least 7 days. |
Reproduced this issue in a synthetic environment. TL;DR: It is possible for 2 RAFT node objects for the same RAFT group to exist in the same server with the current locking scheme. This check is there to prevent this situation:
However if the node is NOT found, what happens next is:
If two threads enter And this is in fact what I am seeing in the repro trace (notice the timestamps):
|
Observed behavior
Occasionally a single server in a cluster will log the following until it is restarted. There is no perceived impact of these logs other than noise until the server is restarted.
When in this state, the parent directory
/data/jetstream/a/_js_/C-R3F-jvOdwUEr
has been removed, so the server cannot write thetav.idx
and continually fails. Deleting many consumers in parallel when the servers are under heavy load seems to be the trigger, but I have not been able to find a reliable reproduction for this issue. I see it happen in different environments and a restart always clears the issue.Expected behavior
Writing the
tav.idx
should handle the scenario where the parent directory has been removed.Server and client version
nats-server
2.10.18Host environment
No response
Steps to reproduce
No response
The text was updated successfully, but these errors were encountered: