Api-rest grpc failed #1715

veenadong · 2024-08-09T00:44:05Z

Mayastor 2.5.1

kubectl mayastor get volume bb1da7e5-ae9c-4af4-9835-f506abedf1e2
failed to return a response.

API-rest pod is logging:

2024-08-08T15:56:34.164032Z ERROR grpc::operations::volume::client: error: Aborted: status: Unknown, message: "transport error", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/volume/client.rs:75

  2024-08-08T15:56:34.164138Z  INFO actix_web::middleware::logger: 172.23.46.64 "GET /v0/volumes/bb1da7e5-ae9c-4af4-9835-f506abedf1e2 HTTP/1.1" 503 141 "-" "-" 0.001829

What is causing this error and how to avoid it?

mayastor-2024-08-08--23-06-57-UTC.tar.gz

The text was updated successfully, but these errors were encountered:

tiagolobocastro · 2024-08-09T08:58:38Z

From the logs seems it either crashed or you had force-deleted the agent-core pod?

If crash or force deleted the agent-core can fail to wait for the previous lease to timeout and then crash again. It seems this is bad as we miss the previous-previous logs, since only previous logs are kept.
Perhaps we need to increase the timeout here, so we don't crash in this scenario allowing us to see the proper previous logs.

veenadong · 2024-08-13T18:37:03Z

Seems when agent-core crashed, the api-rest pod somehow holds on to the old connection and does not release it.
Does increasing the number of api-rest pods or agent-core pods help in this case?

tiagolobocastro · 2024-08-13T21:06:42Z

What makes you say it doesn't release it? From the logs the very next request succeeded:

  �[2m2024-08-08T15:56:34.164032Z�[0m �[31mERROR�[0m �[1;31mgrpc::operations::volume::client�[0m�[31m: �[1;31merror�[0m�[31m: Aborted: status: Unknown, message: "transport error", details: [], metadata: MetadataMap { headers: {} }�[0m
    �[2;3mat�[0m control-plane/grpc/src/operations/volume/client.rs:75

  �[2m2024-08-08T15:56:34.164138Z�[0m �[32m INFO�[0m �[1;32mactix_web::middleware::logger�[0m�[32m: �[32m172.23.46.64 "GET /v0/volumes/bb1da7e5-ae9c-4af4-9835-f506abedf1e2 HTTP/1.1" 503 141 "-" "-" 0.001829�[0m
    �[2;3mat�[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  �[2m2024-08-08T15:57:17.565574Z�[0m �[32m INFO�[0m �[1;32mactix_web::middleware::logger�[0m�[32m: �[32m172.23.46.208 "GET /v0/nodes/onprem-node20.hstlabs.glcp.hpecorp.net/pools/onprem-node20.hstlabs.glcp.hpecorp.net-disk HTTP/1.1" 200 600 "-" "-" 0.002888�[0m
    �[2;3mat�[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

veenadong · 2024-10-08T18:18:45Z

We tried upgrading to 2.7.1, after 1 node failure, trying to get volumes/pools/nodes information:

core@sc-os-72-node1:~$ kubectl mayastor get volumes
Failed to list volumes. Error error in request: request timed out

Api-rest pods are showing:


  2024-10-08T18:02:44.031323Z ERROR grpc::operations::volume::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/volume/client.rs:76

  2024-10-08T18:02:45.528493Z  INFO actix_web::middleware::logger: 10.245.250.73 "GET /v0/volumes?max_entries=200&starting_token=0 HTTP/1.1" 503 143 "-" "-" 29.292117
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:08:42.727556Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/pool/client.rs:74

  2024-10-08T18:08:42.730715Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/pool/client.rs:74

  2024-10-08T18:08:45.331381Z  INFO actix_web::middleware::logger: 172.23.156.168 "GET /v0/nodes/sc-os-72-node3.glcpdev.cloud.hpe.com/pools/sc-os-72-node3.glcpdev.cloud.hpe.com-disk HTTP/1.1" 503 143 "-" "-" 27.097990
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:08:45.427465Z  INFO actix_web::middleware::logger: 172.23.156.168 "GET /v0/nodes/sc-os-72-node1.glcpdev.cloud.hpe.com/pools/sc-os-72-node1.glcpdev.cloud.hpe.com-disk HTTP/1.1" 503 143 "-" "-" 39.386821
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:22:07.927418Z  INFO actix_web::middleware::logger: 172.23.127.192 "PUT /v0/app-nodes/sc-os-72-node1.glcpdev.cloud.hpe.com HTTP/1.1" 503 211 "-" "-" 46.008041
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:24:08.525963Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/pool/client.rs:74

  2024-10-08T18:24:44.626883Z  INFO actix_web::middleware::logger: 172.23.156.168 "GET /v0/nodes/sc-os-72-node2.glcpdev.cloud.hpe.com/pools/sc-os-72-node2.glcpdev.cloud.hpe.com-disk HTTP/1.1" 503 143 "-" "-" 142.793763
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:25:23.124634Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/pool/client.rs:74

mayastor-2024-10-08--17-59-12-UTC.tar.gz

We tried restarting api-rest pods (we have replica set to 3, one running on each node):

Restarting the pod that has the error
Restarted all 3 api pods, it still failing
Rebooted all 3 nodes but it still failing

tiagolobocastro · 2024-10-08T18:52:16Z

Btw, I'm not sure if using replica count of 3 on the rest-api is helping, I suggest leaving it at 1: https://github.com/openebs/mayastor/issues/1728#issuecomment-2333822227

I think the first timeout is expected:

DNS error: failed to lookup address information: Temporary failure in name resolutio

The agent-core was restarting (probably it was on the failed node?), and seems DNS was also not ready yet.

For the other errors, seems we were having connection issues with etcd, seems some IO issues for the etcd storage:

{"level":"warn","ts":"2024-10-08T17:08:11.955Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":870763100169690875,"retry-timeout":"500ms"}
...
{"level":"warn","ts":"2024-10-08T17:08:24.510Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"946609bde8186189","rtt":"560.600277ms","error":"dial tcp: i/o timeout"}

Interestingly at the same time also io-engine has issues:

[2024-10-08T17:08:26.487185768+00:00 ERROR mayastor::spdk:nvme_fabric.c:596] Connect command failed, rc -5, trtype:TCP adrfam:IPv4 traddr:10.245.250.73 trsvcid:8420 subnqn:nqn.2019-05.io.openebs:724b592a-8954-43c3-a93b-947503273787   
[2024-10-08T17:08:26.487228003+00:00 ERROR mayastor::spdk:nvme_fabric.c:607] Connect command completed with error: sct 1, sc 1   
[2024-10-08T17:08:26.487238003+00:00 ERROR mayastor::spdk:nvme_tcp.c:2017] Failed to poll NVMe-oF Fabric CONNECT command   
[2024-10-08T17:08:26.487242617+00:00 ERROR mayastor::spdk:nvme_tcp.c:1806] Failed to connect tqpair=0x561e4b0c2440

I wonder, was there any issue in the cluster at around this time? And in general etcd seems to be having lots of warnings in the logs as you can see, perhaps indicating some issue with the infra?

[2024-10-08T17:01:22.750465391+00:00  INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_1' (0x561e49cc1a80) on core #1
[2024-10-08T17:01:22.750524701+00:00  INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_2' (0x561e49cc20d0) on core #2

Are you running on isolated cores? https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/additional-information/performance-tips#cpu-isolation

veenadong · 2024-10-09T03:51:39Z

"2024-10-08T17:08:26.487185768" is around when we had to restart all the nodes hoping to resolve this issue.

Scaled the api-rest to replicas=1 -- still getting timed out
Restarted agent-core -- still timed out
Restarted agent-ha pods -- still timed out
Restarted api-rest pod -- the commands started to work

But restarting all nodes did not fix this issue

veenadong · 2024-10-10T19:11:21Z

Further observations after more testing, restarting the api-rest pod does work. However, after certain amount of time (40 -60 minutes), the pod hung (ie. cannot exec into the pod). If I'm already exec into the pod, commands such as "netstat" failed to respond.

operator_diskpool log shows api calls started to fail intermittently, then all requests fail.

  2024-10-09T23:16:25.951383Z  INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:131
    in operator_diskpool::reconcile with name: sc-os-72-node1.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 164953587712, available: 2031917989888 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node1.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated

  2024-10-09T23:16:30.909512Z  INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:131
    in operator_diskpool::reconcile with name: sc-os-72-node2.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194192080896, available: 2002679496704 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node2.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated

  2024-10-09T23:17:36.227506Z  INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:131
    in operator_diskpool::reconcile with name: sc-os-72-node2.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194196275200, available: 2002675302400 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node2.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated

  2024-10-09T23:18:09.029991Z ERROR operator_diskpool::context: Pool not found, clearing status, name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:230
    in operator_diskpool::context::pool_check with name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk", status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 164970364928, available: 2031901212672 })
    in operator_diskpool::reconcile with name: sc-os-72-node1.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 164970364928, available: 2031901212672 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node1.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: reconciler requested retry

  2024-10-09T23:18:09.037230Z  INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:131
    in operator_diskpool::reconcile with name: sc-os-72-node1.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: None, capacity: 0, used: 0, available: 0 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node1.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated

  2024-10-09T23:18:12.709461Z ERROR operator_diskpool::context: Pool not found, clearing status, name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:230
    in operator_diskpool::context::pool_check with name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk", status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194196275200, available: 2002675302400 })
    in operator_diskpool::reconcile with name: sc-os-72-node2.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194196275200, available: 2002675302400 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node2.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: reconciler requested retry

mayastor-2024-10-09--23-23-04-UTC_api_rest.tar.gz

tiagolobocastro · 2024-10-14T23:17:27Z

Further observations after more testing, restarting the api-rest pod does work. However, after certain amount of time (40 -60 minutes), the pod hung (ie. cannot exec into the pod). If I'm already exec into the pod, commands such as "netstat" failed to respond.

that's interesting, if you ssh into the node can check what the process is doing? ie is it sleeping or churning cpu?

Also interesting, rest requests in general are taking a very long time, close to timeout most likely:

  �[2m2024-10-09T23:24:25.876020Z�[0m �[32m INFO�[0m �[1;32mactix_web::middleware::logger�[0m�[32m: �[32m172.23.127.192 "GET /v0/nodes/sc-os-72-node1.glcpdev.cloud.hpe.com HTTP/1.1" 200 448 "-" "-" 29.490918�[0m
    �[2;3mat�[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  �[2m2024-10-09T23:24:26.375766Z�[0m �[32m INFO�[0m �[1;32mactix_web::middleware::logger�[0m�[32m: �[32m172.23.127.198 "GET /v0/nodes/sc-os-72-node1.glcpdev.cloud.hpe.com/pools/sc-os-72-node1.glcpdev.cloud.hpe.com-disk HTTP/1.1" 200 590 "-" "-" 27.197195�[0m
    �[2;3mat�[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  �[2m2024-10-09T23:24:26.475765Z�[0m �[32m INFO�[0m �[1;32mactix_web::middleware::logger�[0m�[32m: �[32m172.23.127.198 "GET /v0/nodes/sc-os-72-node3.glcpdev.cloud.hpe.com/pools/sc-os-72-node3.glcpdev.cloud.hpe.com-disk HTTP/1.1" 200 590 "-" "-" 29.196916�[0m
    �[2;3mat�[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

27s, 29s...

Would you be able to enable jaeger tracing? https://github.com/openebs/mayastor-extensions/blob/44f76e6520f51ed16d8586bff85b99e109e30ea1/chart/values.yaml#L102
You'll need to have cert-manager installed.

veenadong assigned GlennBullingham Aug 9, 2024

tiagolobocastro unassigned GlennBullingham Aug 9, 2024

tiagolobocastro added Need more info waiting on user input Waiting for more input/information from the issue raise/users labels Aug 12, 2024

tiagolobocastro mentioned this issue Sep 6, 2024

Volume failed to attach: No online replicas are available #1728

Open

tiagolobocastro removed the waiting on user input Waiting for more input/information from the issue raise/users label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Api-rest grpc failed #1715

Api-rest grpc failed #1715

veenadong commented Aug 9, 2024

tiagolobocastro commented Aug 9, 2024

veenadong commented Aug 13, 2024

tiagolobocastro commented Aug 13, 2024

veenadong commented Oct 8, 2024 •

edited

Loading

tiagolobocastro commented Oct 8, 2024

veenadong commented Oct 9, 2024

veenadong commented Oct 10, 2024

tiagolobocastro commented Oct 14, 2024

Api-rest grpc failed #1715

Api-rest grpc failed #1715

Comments

veenadong commented Aug 9, 2024

tiagolobocastro commented Aug 9, 2024

veenadong commented Aug 13, 2024

tiagolobocastro commented Aug 13, 2024

veenadong commented Oct 8, 2024 • edited Loading

tiagolobocastro commented Oct 8, 2024

veenadong commented Oct 9, 2024

veenadong commented Oct 10, 2024

tiagolobocastro commented Oct 14, 2024

veenadong commented Oct 8, 2024 •

edited

Loading