Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restore fails with "waiting for another restore to finish" #887

Open
Lobo75 opened this issue Sep 3, 2024 · 2 comments
Open

restore fails with "waiting for another restore to finish" #887

Lobo75 opened this issue Sep 3, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Lobo75
Copy link

Lobo75 commented Sep 3, 2024

Report

I found this community message with the same issue https://forums.percona.com/t/manually-restoring-multiple-times-is-not-working/27301 however the link to the jira ticket is invalid.

I ran into this same error on the percona operator 2.4.1 where I had to restore a database after a cluster failure. The first restore failed due to an invalid restore time selected, it could not find a valid restore point. Trying to fix the restore time was not possible, doing a kubectl delete on the restore yaml showed the restore was deleted but the operator did not seem to know it. All further attempts to restore using different name also failed as pointed out in the community forum.

I could not find a way to even list what the operator thought was running as to restores.

As a further test I deleted the cluster and re-created it with the same name. The percona operator saw the new cluster and tried to re-start the failed restore again and again. It finally gave up after 5 more attempts. Even deleting the cluster does not signal to the operator to remove any failed or in progress restores.

There needs to be a way to list the restores and completely delete them so a new one can be started.

More about the problem

I would have expected that doing a kubectl delete on the restore yaml would have killed any further restore attempts.

Steps to reproduce

See the Community message posted above.

Versions

  1. Kubernetes 1.27.11
  2. Operator 2.4.1
  3. Database postgres 15

Anything else?

This is a very serious problem when you cannot fix a failed restore and the database is down because of the restore attempt.

@Lobo75 Lobo75 added the bug Something isn't working label Sep 3, 2024
@hors
Copy link
Collaborator

hors commented Sep 6, 2024

@Lobo75 I have reproduced it and we will fix it in the next PG release 2.6.0
STR:
run restore with some wrong option e.g. time
do not wait until restore fails completely (just the first try) and remove the restore object manually
As a result, the restore section was not removed from pg object

spec:
  backup:      
    restore:
      enabled: true
      options:
      - --target="2022-11-30 15:12:11+03"
      - --type=time
      repoName: repo1
      resources: {} 

the workaround is to remove it manually

kubectl edit pg cluster1

    restore:
      enabled: true
      options:
      - --target="2022-11-30 15:12:11+03"
      - --type=time
      repoName: repo1
      resources: {} 

or just set

spec:
  backup:      
    restore:
      enabled: false

and only after that run new restore

@hors
Copy link
Collaborator

hors commented Sep 6, 2024

I have created a task for this bug https://perconadev.atlassian.net/browse/K8SPG-637

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants