Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation in large deployments after upgrade from PSMDB operator 1.14.0 to 1.16.1 #1635

Open
MatzeScandio opened this issue Aug 30, 2024 · 4 comments
Labels

Comments

@MatzeScandio
Copy link

Report

Performance degradation in large deployments after upgrade from PSMDB operator 1.14.0 to 1.16.1

  • we tested the upgrade in our DEV environment and did not see any issues with performance
  • After upgrading the operator in our PROD environment we noticed a significant slowdown
  • the PROD environment is significantly larger with ~90 PSMDBs and a retention period of 30 days resulting in 2700 psmdb-backup objects
  • while the creation of a new database in version 1.14.0 took about 5 minutes it took ~6 hours to create a new psmdb database with operator version 1.16.1__

More about the problem

Analysis

  1. we identified an unusual high amount of calls to the backup API (/apis/psmdb.percona.com/v1/namespaces/mongodb/perconaservermongodbbackups) as the main contributor to this behaviour

    • we have a lot of psmdb-backup resources (about 2700)
    • the response time of one request is ~1s and the API is called roughly 55 times a minute
    • the API request is performed in pkg/controller/perconaservermongodb/backup.go#L211
    • we also noticed that, instead of doing 1 request per call to the API, the reconcile function is calling it 90 times each time (equivalent to number of deployed PSMDBs)
  2. the actual bug seems to be in pkg/controller/perconaservermongodb/backup.go#L145

    • the for loop iterates over all cronjobs and compares their names to the custom resource backup tasks
    • as each db has a backup job with the same name (called 'daily'), this condition is matching for all 90 cron.backupjobs and thus the subsequent call of oldScheduledBackups() also happens 90 times
  3. we are unsure why it didn't happen before version 1.16.1 as the backup code is mostly unchanged

    • we suspect the caching behaviour changed and therefore this bug is now more visible

Workaround

  1. rename each backup task to have a unique identifier

    • e.g. for a database xyz -> 'daily-xyz' instead of 'daily'
    • this ensures only 1 is call made to the API per reconcile request
    • caveat: the info log line pkg/controller/perconaservermongodb/backup.go#L163 will now be printed 89 times per reconcile call
  2. disable the cleanup for each backup task by setting keep=0 and write a custom k8s cronjob that deletes any psmdb-backup older than 30 days

    • works if there is no need for individual retention periods per db

    • eliminates API requests alltogether, speeding up the reconcile calls significantly

Steps to reproduce

  1. create 5 databases, enable backups for each and create a backup task named 'daily' and set the keep attribute to something above 0
  2. monitor the kubernetes API calls for psmdb-backup resources
  3. for each reconcile call of the psmdb object there should be 5 requests to the API: /apis/psmdb.percona.com/v1/namespaces/mongodb/perconaservermongodbbackups?labelSelector=ancestor%3Ddaily%2Ccluster%3D<db-name>

With just 5 databases and a limited number of backups this will of course not result in a slowdown, but you will be able to see the repeated calls to the API endpoint.

Alternatively

  1. create 5 databases, enable backups for each and use unique names this time. set the keep attribute to something above 0
  2. in the logs you should see a lot of 'deleting outdated backup job' events (4 log lines per psmdb reconcile call)

Versions

  • Kubernetes: AWS-EKS 1.24
  • Operator: 1.16.1
  • Database: 5.0.23-20

Anything else?

  • feel free to ask in case of any unclarities
@spron-in
Copy link
Collaborator

spron-in commented Sep 3, 2024

Hey @MatzeScandio - thanks for sharing.

We will have a look.
Just out of curiousity few questions:

  1. seems you EKS version is quite old - any reason for that?
  2. we usually recommend to run around 10-15 clusters per operator pod, mostly to have some blast radius control. We usually see users doing that through tenant management, having clusters spread across multiple namespaces. Have you thought about it? I'm curious to learn more about your use case.

@hors
Copy link
Collaborator

hors commented Sep 3, 2024

Hi @MatzeScandio, I need to know more about your clusters. For example, do you use PITR? Can you provide one of your CRs?

@MatzeScandio
Copy link
Author

Hi @spron-in - thanks for your reply. To answer your questions:

  1. It was a management decision to prioritize other tasks first. AWS offers extended support of 1.24 until beginning of next year.
  2. It is a self-service where users can request new databases via a service broker. It was not initially planned to handle the amount of DBs we have now from the beginning, but the service got increasingly popular and we have two big operator deployments one with 90 DBs and a second with 200 DBs. We are aware we need to redesign our deployment, but as for the EKS upgrade it is currently not prioritized. Thanks for the recommended number of clusters per operator, I will bring it up in our discussion regarding the redesign.

@MatzeScandio
Copy link
Author

MatzeScandio commented Sep 4, 2024

Hi @hors - thanks for your reply.

We currently have PITR disabled.

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  name: xyz
  namespace: mongodb
spec:
  backup:
    enabled: true
    image: <REDACTED>.dkr.ecr.eu-central-1.amazonaws.com/percona/percona-backup-mongodb:2.0.5
    pitr:
      enabled: false
    serviceAccountName: psmdb-operator
    storages:
      backup-mongodb:
        s3:
          bucket: <REDACTED>
          credentialsSecret: <REDACTED>
          prefix: xyz
          region: eu-central-1
        type: s3
    tasks:
    - compressionType: gzip
      enabled: true
      keep: 0
      name: daily-xyz
      schedule: 51 3 * * *
      storageName: backup-mongodb
  clusterServiceDNSMode: External
  crVersion: 1.16.1
  image: <REDACTED>.dkr.ecr.eu-central-1.amazonaws.com/percona/percona-server-mongodb:5.0.23-20
  imagePullPolicy: Always
  platform: kubernetes
  pmm:
    enabled: false
    image: <REDACTED>.dkr.ecr.eu-central-1.amazonaws.com/percona/pmm-client:2.35.0
    serverHost: monitoring-service
  replsets:
  - affinity:
      antiAffinityTopologyKey: failure-domain.beta.kubernetes.io/zone
    arbiter:
      enabled: false
      size: 0
    expose:
      enabled: true
    name: rs0
    podDisruptionBudget:
      maxUnavailable: 1
    resources:
      requests:
        cpu: 150m
        memory: 640Mi
    size: 3
    volumeSpec:
      persistentVolumeClaim:
        resources:
          requests:
            storage: 1Gi
        storageClassName: gp3-xfs

This is after applying the workaround and disabling the cleanup of old backups. Before the workaround the backup task block was:

    tasks:
    - compressionType: gzip
      enabled: true
      keep: 30
      name: daily
      schedule: 51 3 * * *
      storageName: backup-mongodb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants