Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Volume Replacement feature with support for pd, tidb, tikv #5150

Merged
merged 34 commits into from
Aug 30, 2023

Conversation

anish-db
Copy link
Contributor

@anish-db anish-db commented Jul 6, 2023

What problem does this PR solve?

Implement a generic feature to make volume changes to tidb via replacement.
Detects volume changes (storageclass change, size change, or number of volumes change)
Notably change in number of volumes such as adding a separate TiKV raft disk is poorly handled by operator today.

Deletes replicas one by one and replaces with new disks that use the new volume parameters in a graceful way similar to how scale in is implemented for components today. (e.g for TiKV it deletes stores and waits for transition).
Adds one spare replica (for PD & TiKV) for handling load while the current one is being deleted.

PR Implements support for TiDB, PD, TiKV components.

More details of problem described in https://docs.google.com/document/d/1WpAYVcF1fXpcuVH6mFlz9v0fbLG3EA9r2gsDgFdqW8E/edit#heading=h.sv3296iop7bh

What is changed and how does it work?

Detailed List of changes:

  • Refactor some volume comparing utils from pod_vol_modifier.go & pvc_modifier.go in to vol_compare_utils.go for reuse in this feature as well as some new functions to compare volume changes in pods, etc.
  • Explicitly detect extra volumes (or missing volumes) when comparing statefulset and throw an error as unsupported for pvc_modifier's use case. (Which was previously ignored silently)
  • Add a tidb.pingcap.com/replace-disk annotation that can be used to trigger a pvc replace implemented in pkg/controller/tidbcluster/pod_control.go.
    • TiKV
      1. Get Store ID for Pod via Status/Label. Ensure label is set with store id, this is important to avoid races during scale down later.
      2. Ensure all TiKVs are currently ready and stable.
      3. If store is Up, delete it via PD and wait for it to become Tombstone
      4. Delete all PVCs associated with pod, then delete the pod itself.
      5. StatefulSet will automatically. recreate a new replacement which will come up on its own.
    • TiDB: If other dependent components & TiDB instances are stable, go ahead and delete the pod's PVCs & the pod itself.
    • PD:
      1. Get Member ID for this PD Pod via label
      2. If this pod is leader, transfer leader to next available member.
      3. If this pod is still part of PD members & all PD's are ready, delete it as a member
      4. Delete PVCs & the Pod, let statefulset recreate it.
  • Add a VolReplaceInProgress Status field to tidbcluster definition, for TiKV, TiDB & PD components. + supporting functions for it in component_status.go
  • Add a new PvcReplacer component that orchestrates new feature with two functions UpdateStatus() & Sync()
    • In tidb_cluster_control.go UpdateStatus() is run before component member manager syncs, and Sync() is run after.
    • UpdateStatus() is responsible for detecting storage volume changes and sets the VolReplaceInProgress status field for the respective component.
    • Each supported component's member manager Sync() does the following when a VolReplaceInProgress is set:
      • For TiKV and PD an extra desired replica is configured, which triggers ScaleOut
      • The StatefulSet's update strategy is set to OnDelete, so that config+volume changes can be rolled out atomically (by deleting pod with annotation)
      • Any actual config changes to pod spec is suppressed from being updated by member manager itself (bypassing it's upgrade routines)
    • PvcReplacer's Sync() waits until scaling is complete, then deletes statefulset orphaning pods & waits for member manager to recreate it (with new config), then applies the tidb.pingcap.com/replace-disk annotation to each pod one by one.
  • Add a tidb.pingcap.com/tikv-no-active-store-since annotation for TiKV pods (set by pkg/controller/pod_control.go) when a TiKV pod has a valid store id in label but no active stores from PD (via status tikv stores fields). Used to detect Tombstone stores that have been GC'ed too soon (due to a potential PD early GC race when stores are deleted immediately after creation)
  • Handle an edge case in TiKV scale in where a pod's store id can be missing from PD and also not available as tombstone (due to PD GC race). Use no-active-store-since annotation above to assume tombstone'd after some waiting time and scale it in.
  • Move MetaManager.Sync() in tidb_cluster_control.go which sets tikv store labels to run before component member manager Sync's so that label is updated, even if member managers error.

Code changes

  • Has Go code change
  • Has CI related scripts change

Tests

  • Unit test
  • E2E test
  • Manual test
  • No code

Side effects

  • Breaking backward compatibility
  • Other side effects:

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.

Support a generic disk replacement mechanism for making pvc changes behind a new feature (default off) for TiDB, TiKV & PD components.

Manual Test

Here's a procedure to manually test this feature on a test cluster:

Enable feature by setting features: [VolumeReplacing=true] in charts/tidb-operator/values.yaml
Start a local operator using ./hack/local-up-operator.sh

Apply the basic tidb cluster but with 3 tikv replicas:

cp examples/basic/tidb-cluster.yaml /tmp/basic.yaml
patch /tmp/basic.yaml << EOF
33c33
<     replicas: 1
---
>     replicas: 3
EOF
kubectl --context kind-kind apply -f /tmp/basic.yaml

Once cluster is up, apply an update adding a second volume with a config change to be rolled out atomically:

cp /tmp/basic.yaml /tmp/twodisk.yaml
patch /tmp/twodisk.yaml << EOF
47a48,53
>       raft-engine:
>         dir: /var/lib/raftlog/raftdb
>     storageVolumes:
>       - name: raftdb-volume
>         storageSize: 1Gi
>         mountPath: /var/lib/raftlog
EOF
kubectl --context kind-kind apply -f /tmp/twodisk.yaml

Verify that new temporary tikv replica is created, new volumes created by deleting & replace, and spare replica cleaned up.
Takes about ~10min for test cluster.

@sre-bot
Copy link
Contributor

sre-bot commented Jul 6, 2023

CLA assistant check
All committers have signed the CLA.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jul 6, 2023

Welcome @anish-db! It looks like this is your first PR to pingcap/tidb-operator 🎉

@ti-chi-bot ti-chi-bot bot added the size/XXL label Jul 6, 2023
@anish-db anish-db changed the title Draft PR for Volume Replacement feature Add Volume Replacement feature with support for pd, tidb, tikv Jul 15, 2023
@ti-chi-bot ti-chi-bot bot removed the needs-rebase label Jul 15, 2023
@anish-db anish-db force-pushed the diskchange branch 3 times, most recently from a27ac67 to b86f219 Compare July 18, 2023 00:08
@anish-db anish-db marked this pull request as ready for review July 18, 2023 00:10
@ti-chi-bot ti-chi-bot bot requested a review from lichunzhu July 18, 2023 00:10
@ti-chi-bot ti-chi-bot bot removed the needs-rebase label Jul 25, 2023
Copy link
Contributor

@hanlins hanlins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @anish-db, thanks for your patience! I had an initial round of review, I'm still reading into the details but I don't want to put you in the wait for so long. Maybe we can start from these initial comments before I finish the review entirely? And don't hesitate to let me know if you have some different ideas in these comments, thanks!

charts/tidb-operator/values.yaml Outdated Show resolved Hide resolved
pkg/apis/pingcap/v1alpha1/types.go Outdated Show resolved Hide resolved
pkg/controller/pod_control.go Outdated Show resolved Hide resolved
pkg/features/features.go Outdated Show resolved Hide resolved
pkg/manager/volumes/pvc_replacer.go Show resolved Hide resolved
pkg/manager/volumes/vol_compare_utils.go Outdated Show resolved Hide resolved
pkg/manager/volumes/vol_compare_utils.go Show resolved Hide resolved
pkg/manager/volumes/vol_compare_utils.go Show resolved Hide resolved
pkg/manager/volumes/vol_compare_utils.go Show resolved Hide resolved
Comment on lines +212 to +198
// If desired sc name is unspecified , any sc used by pvc is okay. (Default storage class).
if desiredScName != "" && ignoreNil(pvc.Spec.StorageClassName) != desiredScName {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also return true if desiredScName is set to empty string? If let's say a sc name is changed from a non-empty sc name to the empty string, I think user's intention is to switch from the original sc to the default sc, right? And maybe we should propagate the sc name to pvc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If desiredScName is blank, the pvc storage class will still not be blank but use whatever is the default storage class configured in the k8s cluster. It is a little tricky to find out which is default and use it to compare because the default might change later and trigger a change in disks with operator.

Existing pvc modifier also uses this behaviour (ignore changes if desired sc is blank), e.g.:

if cur != "" && pre != cur {

if scName != "" && oldSc != scName {

So I think safer behavior is to not change storage class if it is not explicitly specified.

@csuzhangxc
Copy link
Member

/test pull-e2e-kind
/test pull-e2e-kind-across-kubernetes
/test pull-e2e-kind-basic

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 29, 2023

@csuzhangxc: The label(s) /pull-e2e-kind, /pull-e2e-kind-across-kubernetes, /pull-e2e-kind-basic cannot be applied, because the repository doesn't have them.

In response to this:

/test pull-e2e-kind
/test pull-e2e-kind-across-kubernetes
/test pull-e2e-kind-basic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 29, 2023

@csuzhangxc: No presubmit jobs available for pingcap/tidb-operator@master

In response to this:

/test pull-e2e-kind
/test pull-e2e-kind-across-kubernetes
/test pull-e2e-kind-basic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@csuzhangxc
Copy link
Member

/test pull-e2e-kind
/test pull-e2e-kind-tikv-scale-simultaneously
/test pull-e2e-kind-tngm

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 29, 2023

@csuzhangxc: The label(s) /pull-e2e-kind, /pull-e2e-kind-tikv-scale-simultaneously, /pull-e2e-kind-tngm cannot be applied, because the repository doesn't have them.

In response to this:

/test pull-e2e-kind
/test pull-e2e-kind-tikv-scale-simultaneously
/test pull-e2e-kind-tngm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 29, 2023

@csuzhangxc: No presubmit jobs available for pingcap/tidb-operator@master

In response to this:

/test pull-e2e-kind
/test pull-e2e-kind-tikv-scale-simultaneously
/test pull-e2e-kind-tngm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hanlins

This comment was marked as outdated.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 30, 2023

@hanlins: The label(s) /pull-e2e-kind cannot be applied, because the repository doesn't have them.

In response to this:

/test pull-e2e-kind

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 30, 2023

@hanlins: No presubmit jobs available for pingcap/tidb-operator@master

In response to this:

/test pull-e2e-kind

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hanlins
Copy link
Contributor

hanlins commented Aug 30, 2023

/test pull-e2e-kind

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 30, 2023

@hanlins: The label(s) /pull-e2e-kind cannot be applied, because the repository doesn't have them.

In response to this:

/test pull-e2e-kind

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 30, 2023

@hanlins: No presubmit jobs available for pingcap/tidb-operator@master

In response to this:

/test pull-e2e-kind

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hanlins
Copy link
Contributor

hanlins commented Aug 30, 2023

/approve
/lgtm

@ti-chi-bot ti-chi-bot bot added the lgtm label Aug 30, 2023
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 30, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hanlins

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 30, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-08-03 00:54:52.833889036 +0000 UTC m=+319576.776237567: ☑️ agreed by hanlins.
  • 2023-08-04 22:19:29.488779231 +0000 UTC m=+483053.431127757: ✖️🔁 reset by anish-db.
  • 2023-08-24 17:58:14.917759209 +0000 UTC m=+1431459.466775195: ☑️ agreed by hanlins.
  • 2023-08-25 10:15:22.310145562 +0000 UTC m=+1490086.859161549: ✖️🔁 reset by ti-chi-bot[bot].
  • 2023-08-30 21:30:18.834890047 +0000 UTC m=+1962583.383906018: ☑️ agreed by hanlins.

@ti-chi-bot ti-chi-bot bot merged commit b62ad59 into pingcap:master Aug 30, 2023
4 checks passed
@csuzhangxc
Copy link
Member

/cherry-pick release-1.5

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 31, 2023

@csuzhangxc: The label(s) /release-1.5 cannot be applied, because the repository doesn't have them.

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

@csuzhangxc: new pull request created to branch release-1.5: #5274.

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot pushed a commit to ti-chi-bot/tidb-operator that referenced this pull request Aug 31, 2023
csuzhangxc added a commit that referenced this pull request Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants