Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Api-rest grpc failed #1715

Open
veenadong opened this issue Aug 9, 2024 · 8 comments
Open

Api-rest grpc failed #1715

veenadong opened this issue Aug 9, 2024 · 8 comments

Comments

@veenadong
Copy link

Mayastor 2.5.1

kubectl mayastor get volume bb1da7e5-ae9c-4af4-9835-f506abedf1e2
failed to return a response.

API-rest pod is logging:

2024-08-08T15:56:34.164032Z ERROR grpc::operations::volume::client: error: Aborted: status: Unknown, message: "transport error", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/volume/client.rs:75

  2024-08-08T15:56:34.164138Z  INFO actix_web::middleware::logger: 172.23.46.64 "GET /v0/volumes/bb1da7e5-ae9c-4af4-9835-f506abedf1e2 HTTP/1.1" 503 141 "-" "-" 0.001829

What is causing this error and how to avoid it?

mayastor-2024-08-08--23-06-57-UTC.tar.gz

@tiagolobocastro
Copy link
Contributor

From the logs seems it either crashed or you had force-deleted the agent-core pod?

If crash or force deleted the agent-core can fail to wait for the previous lease to timeout and then crash again. It seems this is bad as we miss the previous-previous logs, since only previous logs are kept.
Perhaps we need to increase the timeout here, so we don't crash in this scenario allowing us to see the proper previous logs.

@tiagolobocastro tiagolobocastro added Need more info waiting on user input Waiting for more input/information from the issue raise/users labels Aug 12, 2024
@veenadong
Copy link
Author

Seems when agent-core crashed, the api-rest pod somehow holds on to the old connection and does not release it.
Does increasing the number of api-rest pods or agent-core pods help in this case?

@tiagolobocastro
Copy link
Contributor

What makes you say it doesn't release it? From the logs the very next request succeeded:

  �[2m2024-08-08T15:56:34.164032Z�[0m �[31mERROR�[0m �[1;31mgrpc::operations::volume::client�[0m�[31m: �[1;31merror�[0m�[31m: Aborted: status: Unknown, message: "transport error", details: [], metadata: MetadataMap { headers: {} }�[0m
    �[2;3mat�[0m control-plane/grpc/src/operations/volume/client.rs:75

  �[2m2024-08-08T15:56:34.164138Z�[0m �[32m INFO�[0m �[1;32mactix_web::middleware::logger�[0m�[32m: �[32m172.23.46.64 "GET /v0/volumes/bb1da7e5-ae9c-4af4-9835-f506abedf1e2 HTTP/1.1" 503 141 "-" "-" 0.001829�[0m
    �[2;3mat�[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  �[2m2024-08-08T15:57:17.565574Z�[0m �[32m INFO�[0m �[1;32mactix_web::middleware::logger�[0m�[32m: �[32m172.23.46.208 "GET /v0/nodes/onprem-node20.hstlabs.glcp.hpecorp.net/pools/onprem-node20.hstlabs.glcp.hpecorp.net-disk HTTP/1.1" 200 600 "-" "-" 0.002888�[0m
    �[2;3mat�[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

@veenadong
Copy link
Author

veenadong commented Oct 8, 2024

We tried upgrading to 2.7.1, after 1 node failure, trying to get volumes/pools/nodes information:

core@sc-os-72-node1:~$ kubectl mayastor get volumes
Failed to list volumes. Error error in request: request timed out

Api-rest pods are showing:


  2024-10-08T18:02:44.031323Z ERROR grpc::operations::volume::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/volume/client.rs:76

  2024-10-08T18:02:45.528493Z  INFO actix_web::middleware::logger: 10.245.250.73 "GET /v0/volumes?max_entries=200&starting_token=0 HTTP/1.1" 503 143 "-" "-" 29.292117
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:08:42.727556Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/pool/client.rs:74

  2024-10-08T18:08:42.730715Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/pool/client.rs:74

  2024-10-08T18:08:45.331381Z  INFO actix_web::middleware::logger: 172.23.156.168 "GET /v0/nodes/sc-os-72-node3.glcpdev.cloud.hpe.com/pools/sc-os-72-node3.glcpdev.cloud.hpe.com-disk HTTP/1.1" 503 143 "-" "-" 27.097990
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:08:45.427465Z  INFO actix_web::middleware::logger: 172.23.156.168 "GET /v0/nodes/sc-os-72-node1.glcpdev.cloud.hpe.com/pools/sc-os-72-node1.glcpdev.cloud.hpe.com-disk HTTP/1.1" 503 143 "-" "-" 39.386821
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:22:07.927418Z  INFO actix_web::middleware::logger: 172.23.127.192 "PUT /v0/app-nodes/sc-os-72-node1.glcpdev.cloud.hpe.com HTTP/1.1" 503 211 "-" "-" 46.008041
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:24:08.525963Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/pool/client.rs:74

  2024-10-08T18:24:44.626883Z  INFO actix_web::middleware::logger: 172.23.156.168 "GET /v0/nodes/sc-os-72-node2.glcpdev.cloud.hpe.com/pools/sc-os-72-node2.glcpdev.cloud.hpe.com-disk HTTP/1.1" 503 143 "-" "-" 142.793763
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:25:23.124634Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/pool/client.rs:74

mayastor-2024-10-08--17-59-12-UTC.tar.gz

We tried restarting api-rest pods (we have replica set to 3, one running on each node):

Restarting the pod that has the error
Restarted all 3 api pods, it still failing
Rebooted all 3 nodes but it still failing

@tiagolobocastro
Copy link
Contributor

Btw, I'm not sure if using replica count of 3 on the rest-api is helping, I suggest leaving it at 1: https://github.com/openebs/mayastor/issues/1728#issuecomment-2333822227

I think the first timeout is expected:

DNS error: failed to lookup address information: Temporary failure in name resolutio

The agent-core was restarting (probably it was on the failed node?), and seems DNS was also not ready yet.

For the other errors, seems we were having connection issues with etcd, seems some IO issues for the etcd storage:

{"level":"warn","ts":"2024-10-08T17:08:11.955Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":870763100169690875,"retry-timeout":"500ms"}
...
{"level":"warn","ts":"2024-10-08T17:08:24.510Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"946609bde8186189","rtt":"560.600277ms","error":"dial tcp: i/o timeout"}

Interestingly at the same time also io-engine has issues:

[2024-10-08T17:08:26.487185768+00:00 ERROR mayastor::spdk:nvme_fabric.c:596] Connect command failed, rc -5, trtype:TCP adrfam:IPv4 traddr:10.245.250.73 trsvcid:8420 subnqn:nqn.2019-05.io.openebs:724b592a-8954-43c3-a93b-947503273787   
[2024-10-08T17:08:26.487228003+00:00 ERROR mayastor::spdk:nvme_fabric.c:607] Connect command completed with error: sct 1, sc 1   
[2024-10-08T17:08:26.487238003+00:00 ERROR mayastor::spdk:nvme_tcp.c:2017] Failed to poll NVMe-oF Fabric CONNECT command   
[2024-10-08T17:08:26.487242617+00:00 ERROR mayastor::spdk:nvme_tcp.c:1806] Failed to connect tqpair=0x561e4b0c2440   

I wonder, was there any issue in the cluster at around this time? And in general etcd seems to be having lots of warnings in the logs as you can see, perhaps indicating some issue with the infra?

[2024-10-08T17:01:22.750465391+00:00  INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_1' (0x561e49cc1a80) on core #1
[2024-10-08T17:01:22.750524701+00:00  INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_2' (0x561e49cc20d0) on core #2

Are you running on isolated cores? https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/additional-information/performance-tips#cpu-isolation

@veenadong
Copy link
Author

"2024-10-08T17:08:26.487185768" is around when we had to restart all the nodes hoping to resolve this issue.

Scaled the api-rest to replicas=1 -- still getting timed out
Restarted agent-core -- still timed out
Restarted agent-ha pods -- still timed out
Restarted api-rest pod -- the commands started to work

But restarting all nodes did not fix this issue

@veenadong
Copy link
Author

Further observations after more testing, restarting the api-rest pod does work. However, after certain amount of time (40 -60 minutes), the pod hung (ie. cannot exec into the pod). If I'm already exec into the pod, commands such as "netstat" failed to respond.

operator_diskpool log shows api calls started to fail intermittently, then all requests fail.

  2024-10-09T23:16:25.951383Z  INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:131
    in operator_diskpool::reconcile with name: sc-os-72-node1.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 164953587712, available: 2031917989888 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node1.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated

  2024-10-09T23:16:30.909512Z  INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:131
    in operator_diskpool::reconcile with name: sc-os-72-node2.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194192080896, available: 2002679496704 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node2.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated

  2024-10-09T23:17:36.227506Z  INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:131
    in operator_diskpool::reconcile with name: sc-os-72-node2.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194196275200, available: 2002675302400 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node2.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated

  2024-10-09T23:18:09.029991Z ERROR operator_diskpool::context: Pool not found, clearing status, name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:230
    in operator_diskpool::context::pool_check with name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk", status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 164970364928, available: 2031901212672 })
    in operator_diskpool::reconcile with name: sc-os-72-node1.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 164970364928, available: 2031901212672 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node1.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: reconciler requested retry

  2024-10-09T23:18:09.037230Z  INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:131
    in operator_diskpool::reconcile with name: sc-os-72-node1.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: None, capacity: 0, used: 0, available: 0 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node1.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated

  2024-10-09T23:18:12.709461Z ERROR operator_diskpool::context: Pool not found, clearing status, name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:230
    in operator_diskpool::context::pool_check with name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk", status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194196275200, available: 2002675302400 })
    in operator_diskpool::reconcile with name: sc-os-72-node2.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194196275200, available: 2002675302400 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node2.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: reconciler requested retry

mayastor-2024-10-09--23-23-04-UTC_api_rest.tar.gz

@tiagolobocastro tiagolobocastro removed the waiting on user input Waiting for more input/information from the issue raise/users label Oct 10, 2024
@tiagolobocastro
Copy link
Contributor

Further observations after more testing, restarting the api-rest pod does work. However, after certain amount of time (40 -60 minutes), the pod hung (ie. cannot exec into the pod). If I'm already exec into the pod, commands such as "netstat" failed to respond.

that's interesting, if you ssh into the node can check what the process is doing? ie is it sleeping or churning cpu?

Also interesting, rest requests in general are taking a very long time, close to timeout most likely:

  �[2m2024-10-09T23:24:25.876020Z�[0m �[32m INFO�[0m �[1;32mactix_web::middleware::logger�[0m�[32m: �[32m172.23.127.192 "GET /v0/nodes/sc-os-72-node1.glcpdev.cloud.hpe.com HTTP/1.1" 200 448 "-" "-" 29.490918�[0m
    �[2;3mat�[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  �[2m2024-10-09T23:24:26.375766Z�[0m �[32m INFO�[0m �[1;32mactix_web::middleware::logger�[0m�[32m: �[32m172.23.127.198 "GET /v0/nodes/sc-os-72-node1.glcpdev.cloud.hpe.com/pools/sc-os-72-node1.glcpdev.cloud.hpe.com-disk HTTP/1.1" 200 590 "-" "-" 27.197195�[0m
    �[2;3mat�[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  �[2m2024-10-09T23:24:26.475765Z�[0m �[32m INFO�[0m �[1;32mactix_web::middleware::logger�[0m�[32m: �[32m172.23.127.198 "GET /v0/nodes/sc-os-72-node3.glcpdev.cloud.hpe.com/pools/sc-os-72-node3.glcpdev.cloud.hpe.com-disk HTTP/1.1" 200 590 "-" "-" 29.196916�[0m
    �[2;3mat�[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

27s, 29s...

Would you be able to enable jaeger tracing? https://github.com/openebs/mayastor-extensions/blob/44f76e6520f51ed16d8586bff85b99e109e30ea1/chart/values.yaml#L102
You'll need to have cert-manager installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants