Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load [v2.10.18] #5824

Open
jarretlavallee opened this issue Aug 23, 2024 · 5 comments
Labels
defect Suspected defect such as a bug or regression stale This issue has had no activity in a while

Comments

@jarretlavallee
Copy link
Contributor

Observed behavior

Occasionally a single server in a cluster will log the following until it is restarted. There is no perceived impact of these logs other than noise until the server is restarted.

[ERR] RAFT [JoABv7BM - C-R3F-jvOdwUEr] Resource not found: open /data/jetstream/a/_js_/C-R3F-jvOdwUEr/tav.idx: no such file or directory
[WRN] RAFT [JoABv7BM - C-R3F-jvOdwUEr] Error writing term and vote file for "C-R3F-jvOdwUEr": open /data/jetstream/a/_js_/C-R3F-jvOdwUEr/tav.idx: no such file or directory

When in this state, the parent directory /data/jetstream/a/_js_/C-R3F-jvOdwUEr has been removed, so the server cannot write the tav.idx and continually fails. Deleting many consumers in parallel when the servers are under heavy load seems to be the trigger, but I have not been able to find a reliable reproduction for this issue. I see it happen in different environments and a restart always clears the issue.

Expected behavior

Writing the tav.idx should handle the scenario where the parent directory has been removed.

Server and client version

nats-server 2.10.18

Host environment

No response

Steps to reproduce

No response

@jarretlavallee jarretlavallee added the defect Suspected defect such as a bug or regression label Aug 23, 2024
@derekcollison
Copy link
Member

So in this case that consumer was deleted but gets stuck trying those operations?

@jarretlavallee
Copy link
Contributor Author

That is correct. The consumer is deleted from the other replicas and from disk on the affected server.

I can see the server it running as a candidate in the stacksz when the error is is being logged. All of the other servers have deleted the consumer and stopped the raft group.

         1   runtime.gopark
             runtime.selectgo
             github.com/nats-io/nats-server/v2/server.(*raft).runAsCandidate
             github.com/nats-io/nats-server/v2/server.(*raft).run
             github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine.func1

When this happens it, it happens for one out of ~10-15 consumers on a single server. The others are deleted fine.

@derekcollison
Copy link
Member

Do you think those consumers had existed for some time or were fairly new?

@jarretlavallee
Copy link
Contributor Author

These were R3 consumers were around for at least 7 days.

@jarretlavallee jarretlavallee changed the title Intermittent Error writing term and vote file - no such file or directory Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load Aug 27, 2024
@mprimi
Copy link
Contributor

mprimi commented Aug 29, 2024

Reproduced this issue in a synthetic environment.

TL;DR: It is possible for 2 RAFT node objects for the same RAFT group to exist in the same server with the current locking scheme.

This check is there to prevent this situation:

	// Check if we already have this assigned.
	if node := s.lookupRaftNode(rg.Name); node != nil {
		s.Debugf("JetStream cluster already has raft group %q assigned", rg.Name)
		rg.node = node

However if the node is NOT found, what happens next is:

  1. The lock is released:
	s.Debugf("JetStream cluster creating raft group:%+v", rg)
	js.mu.Unlock()
  1. While NOT holding the lock, store for this node is created.
  2. Node is created and registered in the map of nodes by startRaftNode

If two threads enter createRaftGroup (to create the same node), the current locking mechanism would allow it.
The first one registered will be overwritten (leaked?) by the second in the server map s.raftNodes[group].

And this is in fact what I am seeing in the repro trace (notice the timestamps):

[207.052] [DBG] JetStream cluster creating raft group:&{Name:C-R3M-vvSC4DFZ Peers:[cnrtt3eg yrzKKRBu S1Nunr6R] Storage:Memory Cluster:nats-cluster Preferred: node:<nil>}
[207.055] [DBG] JetStream cluster creating raft group:&{Name:C-R3M-vvSC4DFZ Peers:[cnrtt3eg yrzKKRBu S1Nunr6R] Storage:Memory Cluster:nats-cluster Preferred: node:<nil>}

@wallyqs wallyqs changed the title Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load Error writing term and vote file no such file or directory - when many consumers are deleted in parallel with high load [v2.10.18] Sep 4, 2024
@github-actions github-actions bot added the stale This issue has had no activity in a while label Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression stale This issue has had no activity in a while
Projects
None yet
Development

No branches or pull requests

3 participants