Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't delete disk pools and failed replicas do not rebuild. #1744

Open
evilezh opened this issue Sep 25, 2024 · 2 comments
Open

Can't delete disk pools and failed replicas do not rebuild. #1744

evilezh opened this issue Sep 25, 2024 · 2 comments
Labels
BUG Something isn't working

Comments

@evilezh
Copy link

evilezh commented Sep 25, 2024

I have pools stuck like:
k8s-node-6-nvme4n1 aio:///dev/nvme4n1?uuid=5f08abc6-6924-446e-997b-24685c967cc2 true k8s-node-6 Online 1.7TiB 1.6TiB 171.7GiB 1.6TiB
k8s-node-4-nvme3n1 /dev/nvme3n1 true k8s-node-4 Unknown 0 B 0 B 0 B
k8s-node-7-nvme2n1 /dev/nvme2n1 true k8s-node-7 Unknown 0 B 0 B 0 B
k8s-node-7-nvme3n1 /dev/nvme3n1 true k8s-node-7 Unknown 0 B 0 B 0 B

last 3 items.

I deleted CRD's but does not seem to help.
Then I noticed that there are bad replicas which references those pools (maybe it is cause why pool are not removed)

ec945ba5-b62f-43c9-8bd8-bbbadc81c7a7 861d034a-2ede-4a68-a925-837659541710 k8s-node-7 k8s-node-7-nvme3n1 Unknown
└─ 1a25f0a1-40f8-44c5-9f20-3f2949924b7c k8s-node-8 k8s-node-8-nvme3n1 Online 10GiB 10GiB 0 B

mayastor-2024-09-25--17-51-42-UTC.tar.gz

So - here are two questions ....
how to get rid of bad pools .. and how to force to re-allocate/rebuild replica ?

@tiagolobocastro tiagolobocastro added the BUG Something isn't working label Oct 2, 2024
@tiagolobocastro
Copy link
Contributor

For node 4, seems the pool is deadlocked. Do you have any logs older so we could identify the cause?
For node 7, seems the disk must have been swapped?

[2024-09-25T16:49:47.359131616+00:00  INFO io_engine::grpc::v1::pool:pool.rs:325] ImportPoolRequest { name: "k8s-node-7-nvme2n1", uuid: None, disks: ["/dev/nvme2n1"], pooltype: Lvs }
[2024-09-25T16:49:47.359598164+00:00 ERROR io_engine::lvs::lvs_store:lvs_store.rs:329] �[3merror�[0m�[2m=�[0mEILSEQ: Illegal byte sequence, failed to import pool k8s-node-7-nvme2n1

There's no pool on /dev/nvme2n1. I suggest you stable device links, ex: /dev/by-id/: https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/rs-configuration

Then on /dev/nvme3n1, we seem to got stuck but here I think I now get what's going on, see:
You have loaded pool node-4-nvme3n1-1 on node-7:

[2024-09-25T16:49:28.178582755+00:00  INFO io_engine::grpc::v1::pool:pool.rs:325] ImportPoolRequest { name: "k8s-node-4-nvme3n1-1", uuid: None, disks: ["/dev/nvme3n1"], pooltype: Lvs }

And then the pool7 is using the same device: /dev/nvme3n1:

[2024-09-25T16:49:28.393171631+00:00  INFO io_engine::grpc::v1::pool:pool.rs:325] ImportPoolRequest { name: "k8s-node-7-nvme3n1", uuid: None, disks: ["/dev/nvme3n1"], pooltype: Lvs }
[2024-09-25T16:49:28.465014989+00:00 ERROR io_engine::lvs::lvs_store:lvs_store.rs:329] �[3merror�[0m�[2m=�[0mEBUSY: Device or resource busy, failed to import pool /dev/nvme3n1

Although not sure why we get EBUSY here, should have returned error saying another pool exists on the same device, this may likely be a bug.

@evilezh
Copy link
Author

evilezh commented Oct 9, 2024

Unfortunatelly I do not have older logs.
There was a bit of mess with multiple reboot cycles across all servers.
I agree about stable device links - question is: Can I change on the fly that ? e.g. update spec: disks ?

Story about node7 on /dev/nvme3n1 ... that disk was errored out with: Illegal byte sequence ... as it was not usable I deleted CRD. And then created new CRD, but due to some copy+paste it got wrong resource name. As you see device is online. But actually previous entry was not removed ( k8s-node-7-nvme3n1) after I deleted CRD. It is stuck there.

Node7 /dev/nvme2n1 ... disk was by accident repurposed and reformatted. Once we figured out - we wanted to remove from pool. We expected as soon it is removed from pool that mayastor will rebalance and data what was there would be re-distributed across active pools.

k8s-node-4-nvme3n1-1 and k8s-node-4-nvme3n1. k8s-node-4-nvme3n1 was failing and we did nvme wipe. I was not able to bring back to array with the same name (e.g. CRD was not there with that id as I deleted, but when I tried to re-use name it errored out) so we chose new name.

Now I have thos 3 entries stuck and I can't get rid of them. And disk replicas associated with those entries are still there. Is there api call .. or direct database edit I can do to make them go away and replicas assigned to those disks gets re-distributed across live pool ?
Also - is deleting CRD right way to remove pools ?

Maybe it is worth to mention that all that happened with an v2.2.0 mayastor version. As we couldn't resolve issue was hoping upgrade to 2.6.1 will help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants