Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client v6 occasionally crushes with concurrent map read and map write #399

Open
Gaudeamus opened this issue Mar 26, 2023 · 7 comments
Open
Labels

Comments

@Gaudeamus
Copy link

Gaudeamus commented Mar 26, 2023

Hello,
working with remote cluster on slow network I'm facing with a regular problem causing entire app to crush
client version v6.12.0

fatal error: concurrent map read and map write

goroutine 87917046 [running]:
github.com/aerospike/aerospike-client-go/v6.PartitionForRead(0xc0123d3cb8?, 0xc0999c7a90, 0xc059c39680)
        /src/vendor/github.com/aerospike/aerospike-client-go/v6/partition.go:66 +0xa5
github.com/aerospike/aerospike-client-go/v6.newReadCommand(0xc317e6f200, 0xc0999c7a90, 0xc059c39680, {0xc25e780d20?, 0x5, 0x5}, 0x10?)
        /src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:53 +0x90
github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012580000?, {0xc25e780d20, 0x5, 0x5})
        /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:355 +0x137

the same issue was with v5

github.com/aerospike/aerospike-client-go/v5.PartitionForWrite(0x0?, 0xc0a44b3830, 0xc4d432d800)
        /src/dsp/vendor/github.com/aerospike/aerospike-client-go/v5/partition.go:52 +0x96
github.com/aerospike/aerospike-client-go/v5.newExecuteCommand(0x7?, 0xc0a44b3830, 0x3?, {0x146b1ff, _}, {_, _}, _)
        /src/dsp/vendor/github.com/aerospike/aerospike-client-go/v5/execute_command.go:35 +0x9a
github.com/aerospike/aerospike-client-go/v5.(*Client).Execute(0xc3ec36ab40, 0xc01bc20970?, 0xc3bc5f?, {0x146b1ff, 0x9}, {0x146dec9, 0xb}, {0xc4d69cecb0, 0x1, 0x1})
        /src/dsp/vendor/github.com/aerospike/aerospike-client-go/v5/client.go:738 +0x25c
@khaf
Copy link
Collaborator

khaf commented Mar 27, 2023

Thanks for your feedback, I've been working on it since yesterday. Can you provide a bit more information regarding your cluster? How many nodes do you have and what usually prompts the issue (nodes departing the cluster, etc.)

@khaf
Copy link
Collaborator

khaf commented Mar 27, 2023

Also, could you please paste the whole panic message so that I can see where the concurrent reads and writes are occurring? The current panic message does not include the part where the write is happening.

@khaf khaf added the bug label Mar 27, 2023
@Gaudeamus
Copy link
Author

Hi @khaf!

Cluster:
4 nodes with Ubuntu 20.04
Aerospike Enterprise Edition build 6.2.0.3

Here are all 4 aerospike-related goroutines from panic log

        goroutine 87917046 [running]:
        github.com/aerospike/aerospike-client-go/v6.PartitionForRead(0xc0123d3cb8?, 0xc0999c7a90, 0xc059c39680)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/partition.go:66 +0xa5
        github.com/aerospike/aerospike-client-go/v6.newReadCommand(0xc317e6f200, 0xc0999c7a90, 0xc059c39680, {0xc25e780d20?, 0x5, 0x5}, 0x10?)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:53 +0x90
        github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012580000?, {0xc25e780d20, 0x5, 0x5})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:355 +0x137
        .../aerospike.(*Client).Get(0x14dced9?, 0xc?, 0xc11530fa54?, {0xc25e780d20?, 0x12424a0?, 0xc7f41bb2d0?})
                /src/aerospike/client.go:126 +0x98
        .../aerospike.(*Session).GetDataByClientUserId(0xc0999c7a40, {0xc11530fa54, 0x6}, {0xc380a96618?, 0x12?})
                /src/aerospike/data.go:58 +0x14c
        .../db.aerospikeService.GetClientData({{0x7f6016179038?, 0xc0999c7a40?}, 0xc0a1322230?}, {0xc11530fa54?, 0x0?}, {0xc380a96618?, 0x1?})
                /src/services/.../db/aerospike.go:64 +0xea


        goroutine 88112780 [runnable]:
        github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized.func1()
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:489
        created by github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:489 +0xcd

        goroutine 3062 [select]:
        github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized(0xc29bc9efc0)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:515 +0x13a
        github.com/aerospike/aerospike-client-go/v6.NewCluster(0xc0a23a6d00, {0xc6095e6468, 0x3, 0x3})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:144 +0x72a
        github.com/aerospike/aerospike-client-go/v6.NewClientWithPolicyAndHost(0x12ecce0?, {0xc6095e6468, 0x3, 0x3})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:88 +0xf9
        .../aerospike.(*Session).connect.func1()
                /src/aerospike/aerospike.go:198 +0xbc
        .../aerospike.(*Session).connect(0xc0999c7a40, {0xc6095e6468, 0x3, 0x3})
                /src/aerospike/aerospike.go:209 +0x119
        .../aerospike.(*Session).watch.func1()
                /src/aerospike/aerospike.go:271 +0x229
        .../aerospike.(*Session).watch
                /src/aerospike/aerospike.go:241 +0x16a

        goroutine 65471388 [sleep]:
        time.Sleep(0xf4240)
                /usr/local/go/src/runtime/time.go:195 +0x135
        github.com/aerospike/aerospike-client-go/v6.(*baseCommand).executeAt(0xc056e81e40, {0x17aadb0, 0xc056e81e40}, 0xc0999c7a90, 0x0?, {0x0?, 0x0?, 0x217c960?}, 0x0?, 0x0)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/command.go:2472 +0x239
        github.com/aerospike/aerospike-client-go/v6.(*baseCommand).execute(0x0?, {0x17aadb0, 0xc056e81e40}, 0x0?)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/command.go:2440 +0x8a
        github.com/aerospike/aerospike-client-go/v6.(*readCommand).Execute(...)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:264
        github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012400400?, {0xc2aecac730, 0x5, 0x5})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:360 +0x259
        .../aerospike.(*Client).Get(0x14dced9?, 0xc?, 0xc2ee1a42b4?, {0xc2aecac730?, 0x12424a0?, 0xc29303ee70?})
                /src/aerospike/client.go:126 +0x98
        .../aerospike.(*Session).GetDataByClientUserId(0xc0999c7a40, {0xc2ee1a42b4, 0x6}, {0xc65be927e0?, 0x2084400?})
                /src/aerospike/data.go:58 +0x14c
        .../db.aerospikeService.GetClientData({{0x7f6016179038?, 0xc0999c7a40?}, 0xc0a1322230?}, {0xc2ee1a42b4?, 0x0?}, {0xc65be927e0?, 0x18?})
                /src/services/.../db/aerospike.go:64 +0xea

Below are two logs with some events and timestamps surrounding the crush
main log:

        34311  2023/03/24 22:48:50 last logger message
        ...crush...
        41762  2023/03/24 22:49:09 New logger started

aerospike.log

        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:32 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: Timeout
        read tcp 10.152.136.52:18548->10.152.129.108:3000: i/o timeout
        2023/03/24 22:41:54 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: Timeout
        read tcp 10.152.136.52:18734->10.152.129.108:3000: i/o timeout
        2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9BDFE2A4B43E4 10.152.152.92:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9BDFE2A4B43E4 10.152.152.92:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB915433E4B43E4 10.152.131.236:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB915433E4B43E4 10.152.131.236:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:46:50 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:52:20 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: Timeout

@khaf
Copy link
Collaborator

khaf commented Mar 30, 2023

Just a heads up that I think I have identified the root cause of this issue, and the potential fix is coming with the next release early next week.

@Gaudeamus
Copy link
Author

hi, @khaf! Could you give any updates, please?

@khaf
Copy link
Collaborator

khaf commented May 31, 2023

@Gaudeamus Sorry to have been unresponsive, I thought I had replied to you. I just don't know how I've managed to miss your message. We were dealing with a few other issues and releasing the Go client kind of fell of the cracks. I will release the fix this week.

@khaf
Copy link
Collaborator

khaf commented Oct 30, 2024

Sorry that this issue has fallen though the cracks. I just went through the client code, and it is not immediately clear to me how this issue can happen outside of constantly closing and opening a new client object.

@Gaudeamus Are you creating a new Client object on each request? Or are you creating one and reusing that same object for all your requests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants