Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node terminated with panic #4481

Open
wdbaruni opened this issue Sep 19, 2024 · 2 comments · May be fixed by #4482
Open

Node terminated with panic #4481

wdbaruni opened this issue Sep 19, 2024 · 2 comments · May be fixed by #4482
Assignees
Labels
th/game-day Issues reported during game day testing

Comments

@wdbaruni
Copy link
Member

Happened multiple times that the orchestrator node died with panic even though no job was processing at that time

Sample logs
panic.md

@wdbaruni wdbaruni added the th/game-day Issues reported during game day testing label Sep 19, 2024
@frrist
Copy link
Member

frrist commented Sep 19, 2024

Based on the panic:

fatal error: concurrent map read and map write

goroutine 146 [running]:
github.com/bacalhau-project/bacalhau/pkg/lib/collections.(*HashedPriorityQueue[...]).Contains(...)
github.com/bacalhau-project/bacalhau/pkg/lib/collections/hashed_priority_queue.go:30
github.com/bacalhau-project/bacalhau/pkg/node/heartbeat.(*HeartbeatServer).Handle(0x14000816440, {0x103957070, 0x104e405e0}, {{0x14000232e10?, 0x1035243c0?}, 0x1400089ca80?})
github.com/bacalhau-project/bacalhau/pkg/node/heartbeat/server.go:180 +0x140
github.com/bacalhau-project/bacalhau/pkg/node/heartbeat.(*HeartbeatServer).HandleMessage(0x1400000cf18?, {0x103957070?, 0x104e405e0?}, 0xdf?)
github.com/bacalhau-project/bacalhau/pkg/node/heartbeat/server.go:216 +0x4c
github.com/bacalhau-project/bacalhau/pkg/lib/ncl.(*subscriber).processMessage(0x14000b61c00, 0x14000bec970?)
github.com/bacalhau-project/bacalhau/pkg/lib/ncl/subscriber.go:148 +0x194
github.com/nats-io/nats%2ego.(*Conn).waitForMsgs(0x140007f3c08, 0x14000b28700)
github.com/nats-io/[email protected]/nats.go:3106 +0x440
created by github.com/nats-io/nats%2ego.(*Conn).subscribeLocked in goroutine 7
github.com/nats-io/[email protected]/nats.go:4320 +0x30c

It looks like we are missing a lock here: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/lib/collections/hashed_priority_queue.go#L30

@frrist frrist self-assigned this Sep 19, 2024
frrist pushed a commit that referenced this issue Sep 19, 2024
@frrist frrist linked a pull request Sep 19, 2024 that will close this issue
@wdbaruni
Copy link
Member Author

Thanks for looking into this. What needs further investigation is why this is happening now. Not locking the queue existed since v1.4.0 if not even before that. What did we change to trigger this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
th/game-day Issues reported during game day testing
Projects
Status: In Review
Development

Successfully merging a pull request may close this issue.

2 participants