Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Using Segment Replication for Cross Cluster Replication #550

Closed
ankitkala opened this issue Sep 9, 2022 · 0 comments
Closed

[RFC] Using Segment Replication for Cross Cluster Replication #550

ankitkala opened this issue Sep 9, 2022 · 0 comments
Labels
enhancement New feature or request roadmap

Comments

@ankitkala
Copy link
Member

ankitkala commented Sep 9, 2022

What are you proposing?

Current implementation of CCR uses logical replication where we replay all the leader shard’s operations on the follower's primary shard. With the ongoing effort for Segment Replication, local replica will simply sync the segments stored on the disk from primary shard offering significantly better throughput (documented here (opensearch-project/OpenSearch#2229)). This documents proposes the design for Cross Cluster Replication using the Segment Replication.

What problems are you trying to solve?

This change should enable the support for Segment Replication for CCR use cases.
Pros:

  • Lower CPU utilisation on follower cluster as follower doesn't need to execute the same operations again.
  • Pause and resume:
    • Ability to pause & resume even after 12 hours: With logical, we only support resume till 12 hours(retention lease ttl). After 12 hours, since the translog operations on leader shard won't be available, user has to to delete all the data on follower and restart the replication. This issue will become obsolete with Segment replication.
    • Easy pause/resume on ingestion heavy cluster: With logical replication, after resume, follower cluster can have a huge backlog of translog operations to be replayed and can struggle to cope up with leader. It can also make the cluster unstable as the operations are fetched on follower and executed. With segment replication, we can resume anytime and follower cluster will only request diff in segments.
  • Lower overhead on Leader cluster: With Segment replication and remote storage integration, CCR can work with almost zero overhead on leader cluster where follower directly communicates with follower for fetching the operations.

Cons:

  • Higher network utilisation due to Segment merges.
  • Might not be able to do Active-Active replication with Segment Replication. We can still keep supporting logical replication for this though.
  • Reliance on refresh interval can affect the overall performance.

What is the user experience going to be?

There should be no change in the user experience. Customer would still be using existing CCR APIs for configuring the replication. Depending on the replication model used between primary and replica on the leader cluster, CCR will rely on the same.

Why should it be built? Any reason not to?

Reasons to build:

  • All the "Pros" mentioned above.
  • Aligned with the OpenSearch's strategy for local replication.

Reasons not to build:

  • We already have logical replication for CCR and can also explore investing the efforts into improving that instead.

What will it take to execute?

  • [OpenSearch] Refactor the existing Segment Replication implementation to use pull model(reference).
  • [OpenSearch] Add support for Segment copy from remote clusters.
  • [Cross-cluster replication]Introduce support for Segment Replication in CCR plugin()
@ankitkala ankitkala added enhancement New feature or request roadmap labels Sep 9, 2022
@ankitkala ankitkala changed the title [PLACEHOLDER][RFC] Using Segment Replication for Cross Cluster Replication [RFC] Using Segment Replication for Cross Cluster Replication Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request roadmap
Projects
None yet
Development

No branches or pull requests

1 participant