Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vernemq pods give the error after restart the statefulset. and the application is affected by the temporary unavailability of one VerneMQ pod at a time. #392

Open
NirmalSenthil opened this issue Aug 16, 2024 · 0 comments

Comments

@NirmalSenthil
Copy link

NirmalSenthil commented Aug 16, 2024

i am running vernemq in AWS EKS setup. We build own vernemq image. Below are the setup details
Vernemq version: 2.0.1
No of pods: 2
OS: Amazon linux 2
EKS version: 1.30

Below are, i am faccing issue.

  1. i updated eks cluster to v1.30 from v1.29. After upgrade that vernemq pods going to restart infreaquntly. and also that use more cpu usage

2)when i Rollout restart the vernemq statefulset. its give some error in logs and that affect my application. my application going to crashloop when vernemq restarts. when i restart vernemq pods 1 pod is going to restart and other pod is still in running but that pods logs give some message that is restarted pods is gone in vernemq cluster. after both pods is restarted our application is still crashloop. and i restart our application pods. it is stable after that apllication pods restart.
***main isssue is when vernemq pods restarts application going to crashloop. below is the vernemq pods given logs

i) vernemq-0 pod error:

Permissions ok: Our pod vernemq-0 belongs to StatefulSet vernemq with 2 replicas
Will join an existing Kubernetes cluster with discovery node at vernemq-1.vernemq-headless.vernemq.svc.cluster.local
Did I previously leave the cluster? If so, purging old state.
Node '[email protected]' not responding to pings.
Node '[email protected]' not responding to pings.
config is OK
-config /vernemq/data/generated.configs/app.2024.09.10.06.18.09.config -args_file /vernemq/bin/../etc/vm.args -vm_args /vernemq/bin/../etc/vm.args
Exec:  /vernemq/bin/../erts-14.2.5/bin/erlexec -boot /vernemq/bin/../releases/2.0.1/vernemq               -config /vernemq/data/generated.configs/app.2024.09.10.06.18.09.config -args_file /vernemq/bin/../etc/vm.args -vm_args /vernemq/bin/../etc/vm.args                             -pa /vernemq/bin/../lib/erlio-patches -- console -noshell -noinput
Root: /vernemq/bin/..
RPC to '[email protected]' failed: {'EXIT',
                                                                               {badarg,
                                                                                [{ets,
                                                                                  lookup,
                                                                                  [clique_commands,
                                                                                   ["vmq-admin",
                                                                                    "cluster",
                                                                                    "show"]],
                                                                                  [{error_info,
                                                                                    #{cause =>
                                                                                       id,
                                                                                      module =>
                                                                                       erl_stdlib_errors}}]},
                                                                                 {clique_command,
                                                                                  match_lookup,
                                                                                  1,
                                                                                  [{file,
                                                                                    "/source/vernemq/_build/default/lib/clique/src/clique_command.erl"},
                                                                                   {line,
                                                                                    124}]},
                                                                                 {clique_command,
                                                                                  match,
                                                                                  1,
                                                                                  [{file,
                                                                                    "/source/vernemq/_build/default/lib/clique/src/clique_command.erl"},
                                                                                   {line,
                                                                                    111}]},
                                                                                 {clique,
                                                                                  run,
                                                                                  1,
                                                                                  [{file,
                                                                                    "/source/vernemq/_build/default/lib/clique/src/clique.erl"},
                                                                                   {line,
                                                                                    126}]},
                                                                                 {vmq_server_cli,
                                                                                  command,
                                                                                  1,
                                                                                  []}]}}
2024-09-10T06:18:13.156996+00:00 [notice] <0.101.0>: alarm_handler: {set,{system_memory_high_watermark,[]}}
2024-09-10T06:18:13.426210+00:00 [info] <0.217.0> vmq_swc_peer_service_manager:write_old_actor_to_disk/1:182: writing (updated) old actor <<35,213,175,33,69,193,41,33,194,242,10,232,110,164,47,73,153,209,164,162>> to disk
2024-09-10T06:18:13.441341+00:00 [info] <0.217.0> vmq_swc_peer_service_manager:write_state_to_disk/1:165: writing state {[{[{actor,<<35,213,175,33,69,193,41,33,194,242,10,232,110,164,47,73,153,209,164,162>>}],1},{[{actor,<<172,0,188,120,134,223,42,14,230,243,49,185,154,124,29,179,143,190,98,1>>}],1}],{dict,2,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[['[email protected]',{[{actor,<<35,213,175,33,69,193,41,33,194,242,10,232,110,164,47,73,153,209,164,162>>}],1}],['[email protected]',{[{actor,<<172,0,188,120,134,223,42,14,230,243,49,185,154,124,29,179,143,190,98,1>>}],1}]],[],[],[]}}},{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}} to disk <<75,2,131,80,0,0,1,185,120,1,203,96,206,97,96,96,96,202,96,2,81,140,25,76,229,172,137,201,37,249,69,185,64,158,136,242,213,245,138,174,7,53,21,15,125,226,122,145,183,68,223,115,230,197,37,139,178,18,129,138,176,168,93,195,176,167,162,237,190,22,223,179,207,134,59,103,213,200,110,238,223,151,196,8,84,155,149,193,89,206,146,146,153,92,146,200,148,40,0,132,28,137,1,137,6,25,2,89,104,32,131,17,85,12,236,38,48,81,110,19,150,90,148,151,234,27,232,80,6,162,115,11,117,13,244,96,172,140,212,196,148,156,212,226,98,152,128,94,113,89,178,94,114,78,105,113,73,106,145,94,78,126,114,98,14,86,167,226,240,86,22,14,251,12,97,198,235,146,105,31,174,160,1,133,1,60,120,24,72,9,30,144,78,0,121,113,150,245>>
2024-09-10T06:18:13.506100+00:00 [info] <0.222.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta1"
Loaded NodeClock fromDisk #{}
2024-09-10T06:18:13.563577+00:00 [info] <0.232.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta2"
Loaded NodeClock fromDisk #{}
2024-09-10T06:18:13.587267+00:00 [info] <0.241.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta3"
Loaded NodeClock fromDisk #{}
2024-09-10T06:18:13.610449+00:00 [info] <0.250.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta4"
Loaded NodeClock fromDisk #{}
2024-09-10T06:18:13.632969+00:00 [info] <0.259.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta5"
Loaded NodeClock fromDisk #{}
2024-09-10T06:18:13.649222+00:00 [info] <0.268.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta6"
Loaded NodeClock fromDisk #{}
2024-09-10T06:18:13.665340+00:00 [info] <0.277.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta7"
Loaded NodeClock fromDisk #{}
Loaded NodeClock fromDisk #{}
2024-09-10T06:18:13.685113+00:00 [info] <0.286.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta8"
2024-09-10T06:18:13.707109+00:00 [info] <0.295.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta9"
Loaded NodeClock fromDisk #{}
2024-09-10T06:18:13.734278+00:00 [info] <0.304.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta10"
Loaded NodeClock fromDisk #{}
2024-09-10T06:18:14.067392+00:00 [info] <0.210.0> vmq_metadata:start/0:31: Trying to start vmq_swc: ok
2024-09-10T06:18:14.115731+00:00 [info] <0.322.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/1"
2024-09-10T06:18:14.124711+00:00 [info] <0.323.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/2"
2024-09-10T06:18:14.130765+00:00 [info] <0.324.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/3"
2024-09-10T06:18:14.136354+00:00 [info] <0.325.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/4"
2024-09-10T06:18:14.142815+00:00 [info] <0.326.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/5"
2024-09-10T06:18:14.149499+00:00 [info] <0.327.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/6"
2024-09-10T06:18:14.155732+00:00 [info] <0.328.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/7"
2024-09-10T06:18:14.162145+00:00 [info] <0.329.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/8"
2024-09-10T06:18:14.168198+00:00 [info] <0.330.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/9"
2024-09-10T06:18:14.173999+00:00 [info] <0.331.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/10"
2024-09-10T06:18:14.179647+00:00 [info] <0.332.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/11"
2024-09-10T06:18:14.186558+00:00 [info] <0.333.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/12"
2024-09-10T06:18:14.239263+00:00 [info] <0.210.0> vmq_message_store:start/0:32: Trying to start vmq_generic_msg_store: ok
2024-09-10T06:18:14.373694+00:00 [info] <0.422.0> vmq_reg_trie:handle_info/2:256: loaded 0 subscriptions into vmq_reg_trie
2024-09-10T06:18:14.379949+00:00 [info] <0.220.0> vmq_cluster:init/0:165: cluster event handler 'vmq_cluster' registered
2024-09-10T06:18:14.407781+00:00 [error] <0.439.0> vmq_ranch_config:reconfigure_listeners_for_type/4:302: can't reconfigure mqtt listener({127,0,0,1}, 1883) with Options [{max_connections,10000},{max_connection_lifetime,0},{nr_of_acceptors,10},{mountpoint,[]},{proxy_protocol,false},{allowed_protocol_versions,[3,4,5,131]},{allow_anonymous_override,false}] due to {already_started,<0.463.0>}
2024-09-10T06:18:15.380698+00:00 [info] <0.437.0> vmq_cluster_node:handle_message/2:220: successfully connected to cluster node '[email protected]'
2024-09-10T06:18:15.402832+00:00 [info] <0.9.0> vmq_swc_peer_service:attempt_join/1:61: Cannot join a cluster, as local node '[email protected]' is non-empty. 

ii) vernemq-1 pod error:

Permissions ok: Our pod vernemq-1 belongs to StatefulSet vernemq with 2 replicas
Will join an existing Kubernetes cluster with discovery node at vernemq-0.vernemq-headless.vernemq.svc.cluster.local
Did I previously leave the cluster? If so, purging old state.
Node '[email protected]' not responding to pings.
Node '[email protected]' not responding to pings.
config is OK
-config /vernemq/data/generated.configs/app.2024.09.10.06.16.28.config -args_file /vernemq/bin/../etc/vm.args -vm_args /vernemq/bin/../etc/vm.args
Exec:  /vernemq/bin/../erts-14.2.5/bin/erlexec -boot /vernemq/bin/../releases/2.0.1/vernemq               -config /vernemq/data/generated.configs/app.2024.09.10.06.16.28.config -args_file /vernemq/bin/../etc/vm.args -vm_args /vernemq/bin/../etc/vm.args                             -pa /vernemq/bin/../lib/erlio-patches -- console -noshell -noinput
Root: /vernemq/bin/..
Node '[email protected]' not responding to pings.
2024-09-10T06:16:32.063758+00:00 [notice] <0.101.0>: alarm_handler: {set,{system_memory_high_watermark,[]}}
2024-09-10T06:16:32.351008+00:00 [info] <0.218.0> vmq_swc_peer_service_manager:write_old_actor_to_disk/1:182: writing (updated) old actor <<172,0,188,120,134,223,42,14,230,243,49,185,154,124,29,179,143,190,98,1>> to disk
2024-09-10T06:16:32.357330+00:00 [info] <0.218.0> vmq_swc_peer_service_manager:write_state_to_disk/1:165: writing state {[{[{actor,<<35,213,175,33,69,193,41,33,194,242,10,232,110,164,47,73,153,209,164,162>>}],1},{[{actor,<<172,0,188,120,134,223,42,14,230,243,49,185,154,124,29,179,143,190,98,1>>}],1}],{dict,2,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[['[email protected]',{[{actor,<<35,213,175,33,69,193,41,33,194,242,10,232,110,164,47,73,153,209,164,162>>}],1}],['[email protected]',{[{actor,<<172,0,188,120,134,223,42,14,230,243,49,185,154,124,29,179,143,190,98,1>>}],1}]],[],[],[]}}},{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}} to disk <<75,2,131,80,0,0,1,185,120,1,203,96,206,97,96,96,96,202,96,2,81,140,25,76,229,172,137,201,37,249,69,185,64,158,136,242,213,245,138,174,7,53,21,15,125,226,122,145,183,68,223,115,230,197,37,139,178,18,129,138,176,168,93,195,176,167,162,237,190,22,223,179,207,134,59,103,213,200,110,238,223,151,196,8,84,155,149,193,89,206,146,146,153,92,146,200,148,40,0,132,28,137,1,137,6,25,2,89,104,32,131,17,85,12,236,38,48,81,110,19,150,90,148,151,234,27,232,80,6,162,115,11,117,13,244,96,172,140,212,196,148,156,212,226,98,152,128,94,113,89,178,94,114,78,105,113,73,106,145,94,78,126,114,98,14,86,167,226,240,86,22,14,251,12,97,198,235,146,105,31,174,160,1,133,1,60,120,24,72,9,30,144,78,0,121,113,150,245>>
2024-09-10T06:16:32.391374+00:00 [info] <0.223.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta1"
Loaded NodeClock fromDisk #{}
2024-09-10T06:16:32.431164+00:00 [info] <0.233.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta2"
Loaded NodeClock fromDisk #{}
2024-09-10T06:16:32.447394+00:00 [info] <0.242.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta3"
Loaded NodeClock fromDisk #{}
2024-09-10T06:16:32.464135+00:00 [info] <0.251.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta4"
Loaded NodeClock fromDisk #{}
2024-09-10T06:16:32.479454+00:00 [info] <0.260.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta5"
Loaded NodeClock fromDisk #{}
2024-09-10T06:16:32.494330+00:00 [info] <0.269.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta6"
Loaded NodeClock fromDisk #{}
2024-09-10T06:16:32.514450+00:00 [info] <0.278.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta7"
Loaded NodeClock fromDisk #{}
2024-09-10T06:16:32.531313+00:00 [info] <0.287.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta8"
Loaded NodeClock fromDisk #{}
2024-09-10T06:16:32.544786+00:00 [info] <0.296.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta9"
Loaded NodeClock fromDisk #{}
2024-09-10T06:16:32.564937+00:00 [info] <0.305.0> vmq_swc_db_leveldb:open_db/4:272: Opening LevelDB SWC database at "./data/swc_meta/meta10"
Loaded NodeClock fromDisk #{}
2024-09-10T06:16:32.816986+00:00 [info] <0.208.0> vmq_metadata:start/0:31: Trying to start vmq_swc: ok
2024-09-10T06:16:32.872506+00:00 [info] <0.318.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/1"
2024-09-10T06:16:32.887528+00:00 [info] <0.319.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/2"
2024-09-10T06:16:32.894868+00:00 [info] <0.320.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/3"
2024-09-10T06:16:32.902230+00:00 [info] <0.321.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/4"
2024-09-10T06:16:32.910468+00:00 [info] <0.322.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/5"
2024-09-10T06:16:32.918080+00:00 [info] <0.324.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/6"
2024-09-10T06:16:32.925370+00:00 [info] <0.325.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/7"
2024-09-10T06:16:32.932416+00:00 [info] <0.326.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/8"
2024-09-10T06:16:32.940337+00:00 [info] <0.327.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/9"
2024-09-10T06:16:32.947690+00:00 [info] <0.328.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/10"
2024-09-10T06:16:32.954967+00:00 [info] <0.329.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/11"
2024-09-10T06:16:32.969857+00:00 [info] <0.330.0> vmq_storage_engine_leveldb:open_db/4:146: Opening LevelDB database at "./data/msgstore/12"
2024-09-10T06:16:33.047262+00:00 [info] <0.208.0> vmq_message_store:start/0:32: Trying to start vmq_generic_msg_store: ok
2024-09-10T06:16:33.215180+00:00 [info] <0.419.0> vmq_reg_trie:handle_info/2:256: loaded 0 subscriptions into vmq_reg_trie
2024-09-10T06:16:33.232147+00:00 [info] <0.221.0> vmq_cluster:init/0:165: cluster event handler 'vmq_cluster' registered
2024-09-10T06:16:33.298288+00:00 [error] <0.436.0> vmq_ranch_config:reconfigure_listeners_for_type/4:302: can't reconfigure mqtt listener({127,0,0,1}, 1883) with Options [{max_connections,10000},{max_connection_lifetime,0},{nr_of_acceptors,10},{mountpoint,[]},{proxy_protocol,false},{allowed_protocol_versions,[3,4,5,131]},{allow_anonymous_override,false}] due to {already_started,<0.460.0>}
2024-09-10T06:16:34.226634+00:00 [info] <0.434.0> vmq_cluster_node:handle_message/2:220: successfully connected to cluster node '[email protected]'
2024-09-10T06:16:34.304852+00:00 [info] <0.9.0> vmq_swc_peer_service:attempt_join/1:61: Cannot join a cluster, as local node '[email protected]' is non-empty.
2024-09-10T06:17:54.845282+00:00 [warning] <0.434.0> vmq_cluster_node:handle_message/2:257: connection to node '[email protected]' has been closed, reconnect in 1000ms
2024-09-10T06:17:55.462614+00:00 [error] <0.253.0> supervisor:do_restart/3:759: Supervisor: {local,vmq_swc_exchange_sup_meta4}. Context: child_terminated. Reason: {noproc,{gen_server,call,[{vmq_swc_edist_meta4,'[email protected]'},{apply,vmq_swc_store,rpc_node_clock,[]},infinity]}}. Offender: id={vmq_swc_exchange_fsm,'[email protected]'},pid=<0.1724.0>.
2024-09-10T06:17:55.615771+00:00 [error] <0.262.0> supervisor:do_restart/3:759: Supervisor: {local,vmq_swc_exchange_sup_meta5}. Context: child_terminated. Reason: {noproc,{gen_server,call,[{vmq_swc_edist_meta5,'[email protected]'},{apply,vmq_swc_store,rpc_node_clock,[]},infinity]}}. Offender: id={vmq_swc_exchange_fsm,'[email protected]'},pid=<0.1726.0>.
2024-09-10T06:17:55.846490+00:00 [warning] <0.1728.0> vmq_cluster_node:connect_async/2:377: can't connect to cluster node '[email protected]' due to {error,not_ready}
2024-09-10T06:17:56.062391+00:00 [warning] <0.432.0> vmq_cluster_mon:handle_info/2:127: cluster node '[email protected]' DOWN
2024-09-10T06:18:12.983361+00:00 [info] <0.432.0> vmq_cluster_mon:handle_info/2:131: cluster node '[email protected]' UP
2024-09-10T06:18:12.998854+00:00 [warning] <0.1767.0> vmq_cluster_node:connect_async/2:377: can't connect to cluster node '[email protected]' due to {error,not_ready}
2024-09-10T06:18:14.002783+00:00 [warning] <0.1771.0> vmq_cluster_node:connect_async/2:377: can't connect to cluster node '[email protected]' due to {error,not_ready}
2024-09-10T06:18:15.008512+00:00 [info] <0.434.0> vmq_cluster_node:handle_message/2:220: successfully connected to cluster node '[email protected]'

what should i do to resolve this problem. i tested with official vernemq image with helm. that offical vernemq image pods also gave that above error log above.

please review my issue and kindly response.
@ioolkos

@NirmalSenthil NirmalSenthil changed the title vernemq pod restarts after upgrading eks version v1.30 vernemq pods give the error after restart the statefulset. and the application is affected by the temporary unavailability of one VerneMQ pod at a time. Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant