Performance degradation in large deployments after upgrade from PSMDB operator 1.14.0 to 1.16.1 #1635

MatzeScandio · 2024-08-30T09:26:50Z

Report

Performance degradation in large deployments after upgrade from PSMDB operator 1.14.0 to 1.16.1

we tested the upgrade in our DEV environment and did not see any issues with performance
After upgrading the operator in our PROD environment we noticed a significant slowdown
the PROD environment is significantly larger with ~90 PSMDBs and a retention period of 30 days resulting in 2700 psmdb-backup objects
while the creation of a new database in version 1.14.0 took about 5 minutes it took ~6 hours to create a new psmdb database with operator version 1.16.1__

Steps to reproduce

create 5 databases, enable backups for each and create a backup task named 'daily' and set the keep attribute to something above 0
monitor the kubernetes API calls for psmdb-backup resources
for each reconcile call of the psmdb object there should be 5 requests to the API: /apis/psmdb.percona.com/v1/namespaces/mongodb/perconaservermongodbbackups?labelSelector=ancestor%3Ddaily%2Ccluster%3D<db-name>

With just 5 databases and a limited number of backups this will of course not result in a slowdown, but you will be able to see the repeated calls to the API endpoint.

Alternatively

create 5 databases, enable backups for each and use unique names this time. set the keep attribute to something above 0
in the logs you should see a lot of 'deleting outdated backup job' events (4 log lines per psmdb reconcile call)

Versions

Kubernetes: AWS-EKS 1.24
Operator: 1.16.1
Database: 5.0.23-20

Anything else?

feel free to ask in case of any unclarities

The text was updated successfully, but these errors were encountered:

spron-in · 2024-09-03T19:19:25Z

Hey @MatzeScandio - thanks for sharing.

We will have a look.
Just out of curiousity few questions:

seems you EKS version is quite old - any reason for that?
we usually recommend to run around 10-15 clusters per operator pod, mostly to have some blast radius control. We usually see users doing that through tenant management, having clusters spread across multiple namespaces. Have you thought about it? I'm curious to learn more about your use case.

hors · 2024-09-03T21:33:48Z

Hi @MatzeScandio, I need to know more about your clusters. For example, do you use PITR? Can you provide one of your CRs?

MatzeScandio · 2024-09-04T08:23:17Z

Hi @spron-in - thanks for your reply. To answer your questions:

It was a management decision to prioritize other tasks first. AWS offers extended support of 1.24 until beginning of next year.
It is a self-service where users can request new databases via a service broker. It was not initially planned to handle the amount of DBs we have now from the beginning, but the service got increasingly popular and we have two big operator deployments one with 90 DBs and a second with 200 DBs. We are aware we need to redesign our deployment, but as for the EKS upgrade it is currently not prioritized. Thanks for the recommended number of clusters per operator, I will bring it up in our discussion regarding the redesign.

MatzeScandio · 2024-09-04T08:24:55Z

Hi @hors - thanks for your reply.

We currently have PITR disabled.

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  name: xyz
  namespace: mongodb
spec:
  backup:
    enabled: true
    image: <REDACTED>.dkr.ecr.eu-central-1.amazonaws.com/percona/percona-backup-mongodb:2.0.5
    pitr:
      enabled: false
    serviceAccountName: psmdb-operator
    storages:
      backup-mongodb:
        s3:
          bucket: <REDACTED>
          credentialsSecret: <REDACTED>
          prefix: xyz
          region: eu-central-1
        type: s3
    tasks:
    - compressionType: gzip
      enabled: true
      keep: 0
      name: daily-xyz
      schedule: 51 3 * * *
      storageName: backup-mongodb
  clusterServiceDNSMode: External
  crVersion: 1.16.1
  image: <REDACTED>.dkr.ecr.eu-central-1.amazonaws.com/percona/percona-server-mongodb:5.0.23-20
  imagePullPolicy: Always
  platform: kubernetes
  pmm:
    enabled: false
    image: <REDACTED>.dkr.ecr.eu-central-1.amazonaws.com/percona/pmm-client:2.35.0
    serverHost: monitoring-service
  replsets:
  - affinity:
      antiAffinityTopologyKey: failure-domain.beta.kubernetes.io/zone
    arbiter:
      enabled: false
      size: 0
    expose:
      enabled: true
    name: rs0
    podDisruptionBudget:
      maxUnavailable: 1
    resources:
      requests:
        cpu: 150m
        memory: 640Mi
    size: 3
    volumeSpec:
      persistentVolumeClaim:
        resources:
          requests:
            storage: 1Gi
        storageClassName: gp3-xfs

This is after applying the workaround and disabling the cleanup of old backups. Before the workaround the backup task block was:

    tasks:
    - compressionType: gzip
      enabled: true
      keep: 30
      name: daily
      schedule: 51 3 * * *
      storageName: backup-mongodb

MatzeScandio added the bug label Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation in large deployments after upgrade from PSMDB operator 1.14.0 to 1.16.1 #1635

Performance degradation in large deployments after upgrade from PSMDB operator 1.14.0 to 1.16.1 #1635

MatzeScandio commented Aug 30, 2024

spron-in commented Sep 3, 2024

hors commented Sep 3, 2024

MatzeScandio commented Sep 4, 2024

MatzeScandio commented Sep 4, 2024 •

edited

Loading

Performance degradation in large deployments after upgrade from PSMDB operator 1.14.0 to 1.16.1 #1635

Performance degradation in large deployments after upgrade from PSMDB operator 1.14.0 to 1.16.1 #1635

Comments

MatzeScandio commented Aug 30, 2024

Report

More about the problem

Analysis

Workaround

Steps to reproduce

Alternatively

Versions

Anything else?

spron-in commented Sep 3, 2024

hors commented Sep 3, 2024

MatzeScandio commented Sep 4, 2024

MatzeScandio commented Sep 4, 2024 • edited Loading

MatzeScandio commented Sep 4, 2024 •

edited

Loading