[Segment Replication] Add background checkpoint sync making SegRep more resilient to network failure #10652
Labels
enhancement
Enhancement or improvement to existing feature or request
Indexing:Replication
Issues and PRs related to core replication framework eg segrep
Is your feature request related to a problem? Please describe.
Today SegRep relies on a transport layer call from primary shards to replicas alerting them there are new segments to sync. When a replica finishes a sync it will report back to the primary shard of its completion. This ensures that primaries track the state of all of their replicas and use this to enforce SR backpressure if they fall too far behind.
These transport layer calls are made using a RetryableTransportClient, but if they were to outright fail it would require a future write and a primary refresh for the replica to fully sync. This would mean replicas may never catch up or know to sync.
SR pressure was implemented to help mitigate lagging replicas by blocking writes giving replicas time to catch up. However, in this scenario the replicas would never catch up and writes on the particular shard could be indefinitely blocked until a flush is triggered.
Describe the solution you'd like
As a safety mechanism add a background sync, similar to RetentionLease sync, from each primary that sends its latest checkpoint to its known stale replicas with the latest replication checkpoint and return the replica's current checkpoint to update tracking state.
Describe alternatives you've considered
Switch to a pull model from replicas. - #4577
The text was updated successfully, but these errors were encountered: