Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load [v2.10.18] #5824

jarretlavallee · 2024-08-23T18:55:17Z

Observed behavior

Occasionally a single server in a cluster will log the following until it is restarted. There is no perceived impact of these logs other than noise until the server is restarted.

[ERR] RAFT [JoABv7BM - C-R3F-jvOdwUEr] Resource not found: open /data/jetstream/a/_js_/C-R3F-jvOdwUEr/tav.idx: no such file or directory
[WRN] RAFT [JoABv7BM - C-R3F-jvOdwUEr] Error writing term and vote file for "C-R3F-jvOdwUEr": open /data/jetstream/a/_js_/C-R3F-jvOdwUEr/tav.idx: no such file or directory

When in this state, the parent directory /data/jetstream/a/_js_/C-R3F-jvOdwUEr has been removed, so the server cannot write the tav.idx and continually fails. Deleting many consumers in parallel when the servers are under heavy load seems to be the trigger, but I have not been able to find a reliable reproduction for this issue. I see it happen in different environments and a restart always clears the issue.

Expected behavior

Writing the tav.idx should handle the scenario where the parent directory has been removed.

Server and client version

nats-server 2.10.18

Host environment

No response

Steps to reproduce

No response

The text was updated successfully, but these errors were encountered:

derekcollison · 2024-08-25T21:31:06Z

So in this case that consumer was deleted but gets stuck trying those operations?

jarretlavallee · 2024-08-26T13:10:28Z

That is correct. The consumer is deleted from the other replicas and from disk on the affected server.

I can see the server it running as a candidate in the stacksz when the error is is being logged. All of the other servers have deleted the consumer and stopped the raft group.

         1   runtime.gopark
             runtime.selectgo
             github.com/nats-io/nats-server/v2/server.(*raft).runAsCandidate
             github.com/nats-io/nats-server/v2/server.(*raft).run
             github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine.func1

When this happens it, it happens for one out of ~10-15 consumers on a single server. The others are deleted fine.

derekcollison · 2024-08-27T02:11:56Z

Do you think those consumers had existed for some time or were fairly new?

jarretlavallee · 2024-08-27T02:19:04Z

These were R3 consumers were around for at least 7 days.

mprimi · 2024-08-29T19:32:14Z

Reproduced this issue in a synthetic environment.

TL;DR: It is possible for 2 RAFT node objects for the same RAFT group to exist in the same server with the current locking scheme.

This check is there to prevent this situation:

	// Check if we already have this assigned.
	if node := s.lookupRaftNode(rg.Name); node != nil {
		s.Debugf("JetStream cluster already has raft group %q assigned", rg.Name)
		rg.node = node

However if the node is NOT found, what happens next is:

The lock is released:

	s.Debugf("JetStream cluster creating raft group:%+v", rg)
	js.mu.Unlock()

While NOT holding the lock, store for this node is created.
Node is created and registered in the map of nodes by startRaftNode

If two threads enter createRaftGroup (to create the same node), the current locking mechanism would allow it.
The first one registered will be overwritten (leaked?) by the second in the server map s.raftNodes[group].

And this is in fact what I am seeing in the repro trace (notice the timestamps):

[207.052] [DBG] JetStream cluster creating raft group:&{Name:C-R3M-vvSC4DFZ Peers:[cnrtt3eg yrzKKRBu S1Nunr6R] Storage:Memory Cluster:nats-cluster Preferred: node:<nil>}
[207.055] [DBG] JetStream cluster creating raft group:&{Name:C-R3M-vvSC4DFZ Peers:[cnrtt3eg yrzKKRBu S1Nunr6R] Storage:Memory Cluster:nats-cluster Preferred: node:<nil>}

jarretlavallee added the defect Suspected defect such as a bug or regression label Aug 23, 2024

jarretlavallee changed the title ~~Intermittent Error writing term and vote file - no such file or directory~~ Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load Aug 27, 2024

wallyqs changed the title ~~Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load~~ Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load [v2.10.18] Sep 4, 2024

github-actions bot added the stale This issue has had no activity in a while label Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load [v2.10.18] #5824

Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load [v2.10.18] #5824

jarretlavallee commented Aug 23, 2024

derekcollison commented Aug 25, 2024

jarretlavallee commented Aug 26, 2024

derekcollison commented Aug 27, 2024

jarretlavallee commented Aug 27, 2024

mprimi commented Aug 29, 2024

Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load [v2.10.18] #5824

Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load [v2.10.18] #5824

Comments

jarretlavallee commented Aug 23, 2024

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

derekcollison commented Aug 25, 2024

jarretlavallee commented Aug 26, 2024

derekcollison commented Aug 27, 2024

jarretlavallee commented Aug 27, 2024

mprimi commented Aug 29, 2024