[Segment Replication] Add background checkpoint sync making SegRep more resilient to network failure #10652

mch2 · 2023-10-16T21:00:16Z

Is your feature request related to a problem? Please describe.
Today SegRep relies on a transport layer call from primary shards to replicas alerting them there are new segments to sync. When a replica finishes a sync it will report back to the primary shard of its completion. This ensures that primaries track the state of all of their replicas and use this to enforce SR backpressure if they fall too far behind.

These transport layer calls are made using a RetryableTransportClient, but if they were to outright fail it would require a future write and a primary refresh for the replica to fully sync. This would mean replicas may never catch up or know to sync.

SR pressure was implemented to help mitigate lagging replicas by blocking writes giving replicas time to catch up. However, in this scenario the replicas would never catch up and writes on the particular shard could be indefinitely blocked until a flush is triggered.

Describe the solution you'd like
As a safety mechanism add a background sync, similar to RetentionLease sync, from each primary that sends its latest checkpoint to its known stale replicas with the latest replication checkpoint and return the replica's current checkpoint to update tracking state.

Describe alternatives you've considered
Switch to a pull model from replicas. - #4577

mch2 added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 16, 2023

anasalkouz added Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed untriaged labels Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Add background checkpoint sync making SegRep more resilient to network failure #10652

[Segment Replication] Add background checkpoint sync making SegRep more resilient to network failure #10652

mch2 commented Oct 16, 2023

[Segment Replication] Add background checkpoint sync making SegRep more resilient to network failure #10652

[Segment Replication] Add background checkpoint sync making SegRep more resilient to network failure #10652

Comments

mch2 commented Oct 16, 2023