Can't delete disk pools and failed replicas do not rebuild. #1744

evilezh · 2024-09-25T18:04:53Z

I have pools stuck like:
k8s-node-6-nvme4n1 aio:///dev/nvme4n1?uuid=5f08abc6-6924-446e-997b-24685c967cc2 true k8s-node-6 Online 1.7TiB 1.6TiB 171.7GiB 1.6TiB
k8s-node-4-nvme3n1 /dev/nvme3n1 true k8s-node-4 Unknown 0 B 0 B 0 B
k8s-node-7-nvme2n1 /dev/nvme2n1 true k8s-node-7 Unknown 0 B 0 B 0 B
k8s-node-7-nvme3n1 /dev/nvme3n1 true k8s-node-7 Unknown 0 B 0 B 0 B

last 3 items.

I deleted CRD's but does not seem to help.
Then I noticed that there are bad replicas which references those pools (maybe it is cause why pool are not removed)

ec945ba5-b62f-43c9-8bd8-bbbadc81c7a7 861d034a-2ede-4a68-a925-837659541710 k8s-node-7 k8s-node-7-nvme3n1 Unknown
└─ 1a25f0a1-40f8-44c5-9f20-3f2949924b7c k8s-node-8 k8s-node-8-nvme3n1 Online 10GiB 10GiB 0 B

mayastor-2024-09-25--17-51-42-UTC.tar.gz

So - here are two questions ....
how to get rid of bad pools .. and how to force to re-allocate/rebuild replica ?

tiagolobocastro · 2024-10-02T13:27:59Z

For node 4, seems the pool is deadlocked. Do you have any logs older so we could identify the cause?
For node 7, seems the disk must have been swapped?

[2024-09-25T16:49:47.359131616+00:00  INFO io_engine::grpc::v1::pool:pool.rs:325] ImportPoolRequest { name: "k8s-node-7-nvme2n1", uuid: None, disks: ["/dev/nvme2n1"], pooltype: Lvs }
[2024-09-25T16:49:47.359598164+00:00 ERROR io_engine::lvs::lvs_store:lvs_store.rs:329] �[3merror�[0m�[2m=�[0mEILSEQ: Illegal byte sequence, failed to import pool k8s-node-7-nvme2n1

There's no pool on /dev/nvme2n1. I suggest you stable device links, ex: /dev/by-id/: https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/rs-configuration

Then on /dev/nvme3n1, we seem to got stuck but here I think I now get what's going on, see:
You have loaded pool node-4-nvme3n1-1 on node-7:

[2024-09-25T16:49:28.178582755+00:00  INFO io_engine::grpc::v1::pool:pool.rs:325] ImportPoolRequest { name: "k8s-node-4-nvme3n1-1", uuid: None, disks: ["/dev/nvme3n1"], pooltype: Lvs }

And then the pool7 is using the same device: /dev/nvme3n1:

[2024-09-25T16:49:28.393171631+00:00  INFO io_engine::grpc::v1::pool:pool.rs:325] ImportPoolRequest { name: "k8s-node-7-nvme3n1", uuid: None, disks: ["/dev/nvme3n1"], pooltype: Lvs }
[2024-09-25T16:49:28.465014989+00:00 ERROR io_engine::lvs::lvs_store:lvs_store.rs:329] �[3merror�[0m�[2m=�[0mEBUSY: Device or resource busy, failed to import pool /dev/nvme3n1

Although not sure why we get EBUSY here, should have returned error saying another pool exists on the same device, this may likely be a bug.

evilezh · 2024-10-09T20:27:57Z

Unfortunatelly I do not have older logs.
There was a bit of mess with multiple reboot cycles across all servers.
I agree about stable device links - question is: Can I change on the fly that ? e.g. update spec: disks ?

Story about node7 on /dev/nvme3n1 ... that disk was errored out with: Illegal byte sequence ... as it was not usable I deleted CRD. And then created new CRD, but due to some copy+paste it got wrong resource name. As you see device is online. But actually previous entry was not removed ( k8s-node-7-nvme3n1) after I deleted CRD. It is stuck there.

Node7 /dev/nvme2n1 ... disk was by accident repurposed and reformatted. Once we figured out - we wanted to remove from pool. We expected as soon it is removed from pool that mayastor will rebalance and data what was there would be re-distributed across active pools.

k8s-node-4-nvme3n1-1 and k8s-node-4-nvme3n1. k8s-node-4-nvme3n1 was failing and we did nvme wipe. I was not able to bring back to array with the same name (e.g. CRD was not there with that id as I deleted, but when I tried to re-use name it errored out) so we chose new name.

Now I have thos 3 entries stuck and I can't get rid of them. And disk replicas associated with those entries are still there. Is there api call .. or direct database edit I can do to make them go away and replicas assigned to those disks gets re-distributed across live pool ?
Also - is deleting CRD right way to remove pools ?

Maybe it is worth to mention that all that happened with an v2.2.0 mayastor version. As we couldn't resolve issue was hoping upgrade to 2.6.1 will help.

tiagolobocastro added the BUG Something isn't working label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't delete disk pools and failed replicas do not rebuild. #1744

Can't delete disk pools and failed replicas do not rebuild. #1744

evilezh commented Sep 25, 2024

tiagolobocastro commented Oct 2, 2024

evilezh commented Oct 9, 2024 •

edited

Loading

Can't delete disk pools and failed replicas do not rebuild. #1744

Can't delete disk pools and failed replicas do not rebuild. #1744

Comments

evilezh commented Sep 25, 2024

tiagolobocastro commented Oct 2, 2024

evilezh commented Oct 9, 2024 • edited Loading

evilezh commented Oct 9, 2024 •

edited

Loading