[Segment Replication] Remote store integration high level design. #4555

mch2 · 2022-09-19T16:55:20Z

Before starting a POC for this integration we should start with a high level design.

Some questions that come to mind.

How do we compute and send the list of segments to fetch from the store?
How do we fetch the SegmentInfos byte[] required?
Are there any changes to failover steps? Currently we commit on replicas during failover and replay from xlog, can we ensure the newly selected replica always has the latest segments from store before promotion?
How would we guarantee segments are present before fetching from remote store?
Should we use existing push segrep design or have replicas poll remote store for updates? If so, how would this work?

lower lvl questions:

Do we still need commit points on replicas?
Can we now disable fsyncs on primaries?

ankitkala · 2022-09-22T07:45:26Z

Also, How do we ensure that the segments which are being copied by replicas doesn't get deleted by primary?

itiyama · 2022-09-23T01:32:36Z

Based on discussion on writes on NRT replica with translog, it does not look like primary and replicas are completely decoupled for translog writes. What would we then gain by decoupling primary and replicas completely during segment copy? I believe that we should start from this as that will inform and potentially simplify a lot of the design trade-offs you captured above.

Bukhtawar · 2022-09-23T08:37:41Z

Should we use existing push segrep design or have replicas poll remote store for updates? If so, how would this work?

I definitely think we should valuate a poll based model for checkpoints and segment transfer on replica, this should simplify cases not only with remote segment store but also with CCR use cases where we currently have remote cluster poll for operations on the leader cluster.

How do we compute and send the list of segments to fetch from the store?

Now if we go with polling mechanism the replicas can request for segments from the primary or the remote store alike based on the diff between the latest checkpoint and the last processed checkpoint

Are there any changes to failover steps? Currently we commit on replicas during failover and replay from xlog, can we ensure the newly selected replica always has the latest segments from store before promotion?

I guess we should do that @sachinpkale would be sharing the related issue around this, translog replay might be expensive

@itiyamas I guess remote translogs would only be relying on a replication proxy for the control flow(single writer invariant), the data flow however is decoupled. This is only to support existing blob stores which don't support OCC but eventually the goal is to provide extension points for stores that support OCC for single writer use cases which might totally decouple the translog flow in such cases.

Decoupling primary and replica for segment copy will help optimise lots of cases around recovery, primarily deployments/failovers, CCR replication etc. With remote store we could also think about publishing checkpoints to a remote queue like Kafka, ActiveMQ which should effectively decouple the segment copy and can be consumed not only internally within the cluster but across clusters for CCR with the appropriate access controls.

sachinpkale · 2022-09-23T11:50:04Z

I guess we should do that @sachinpkale would be sharing the related issue around this, translog replay might be expensive

Created an issue to track this: #4578. Downloading segments for lagging replica from remote segment store would definitely be faster than replaying all the operations from translog.

We may need to think about common abstraction which will be used to get segments downloaded from remote segment store. Once SegRep is integrated with remote segment store, we can use the same APIs to download segments during failover.

itiyama · 2022-09-29T09:12:55Z

@itiyamas I guess remote translogs would only be relying on a replication proxy for the control flow(single writer invariant), the data flow however is decoupled. This is only to support existing blob stores which don't support OCC but eventually the goal is to provide extension points for stores that support OCC for single writer use cases which might totally decouple the translog flow in such cases.

Do you mean that primary term check is control flow and actual document or segments is data flow?

Bukhtawar · 2022-09-29T10:55:09Z

Do you mean that primary term check is control flow and actual document or segments is data flow?

Yes thats how I would look at it :)

dreamer-89 · 2022-09-29T22:53:04Z

Thank you folks for the inputs. The comments really helped in identifying different problem areas and the probable solutions.

From POC perspective, pull based model where replica directly polls the remote store for new replication checkpoints seems better bit. The data deletion from remote store is not considered as part of this POC though remote store currently purges older than N commit points. Also, there are few changes which may needed on remote store side, created issue for tracking #4628. Captured design choices below, please review and provide feedback.

Design consideration

Replication model

Pull based mechanism provides certain advantages, works well with CCR issue (#4577), and provide simplication comment (#4555 (comment)). It is better to provide both options and evaluate based on benchmarking results. As part of this POC, poll based mechanism will be attempted.

1. Push based mechanism

Primary in charge of starting replication by sending shard refresh checkpoints. Replica compares these checkpoints with its store and initiates replication when it falls behind. Primary maintains state of on-going replications to prevent data deletion. This is existing approach.

Pros

Existing design, may need minor changes for remote store integration

Cons

Does not gel well with CCR.

2. Pull based mechanism

Replica is in-charge of updating their local copy of data. Replica pings remote store to check if there is any new data. Need to trade off with network consumption Vs how soon data is available.

Pros

Works well with CCR
Aligns better with de-coupling primary (writer) and replica (reader) shards.

Cons

Not implemented right now. May need more effort.

Refresh Vs Commit

The existing implementation of segment replication supports incremental back up of segment files on replica. This is meaningful for disk based stores as it saves fsyncs on primary. The primary shares the in-memory SegmentInfos over transport with the replica nodes, which is cumbersome. With remote store, where fsyncs are not mandatory on primary, it is better to perform a commit rather than refresh, as replicas can generate SegmentInfos from the commit points and there is no need for primary to share the SegmentInfos separately with replica. With commit points, the failover & recovery workflows are also simplified and will be attempted in the POC.

Data purging

Data once stored on remote store needs to be cleaned up as well. This deletion shouldn't impact the on-going file copy operations on replica node. From POC perspective, data will not be deleted from remote store. Below are some options which can be considered as future enhancements.

1. Background process on primary

On replica node, a background process ping replicas to fetch on-going replication checkpoints and cleans up commit points older than N, where N is least replication checkpoint active in replication amont all replicas. This approach couples primary and replica nodes and is not a scalable solution

2. Distributed queue

Similar to previous approach, but instead of transport calls from primary to replica, primary uses a distributed queue to synchronize active replication checkpoints. When event is processed by all replicas, event can be purged (exact N semantics) and cleaned up from remote store.

Pros:

Primary & replica nodes are decoupled

Cons:

With queue storage on primary, leaves memory & storage footprint on primary
Need to identify behavior on small grade instances (t2/t3)

3. Deletion policies ( cleanup after M days, or keep only last N commits)

No synchronization among nodes to identify active replication checkpoints. Data is instead deleted based on data deletion policy. This completely decouples primary and replica nodes and simplify the design. Remote store by default keeps last 10 commits on store. Replica nodes handle file not found exception and retry for latest commit point. The solution can result in replica starvation when primary is aggressively committing. This solution is attempted for POC and doesn't need any extra work on integration side.

ankitkala · 2022-10-06T16:21:32Z

Overall proposal looks good to me. Few open questions that we probably need to elaborate.

How often do we want to commit? Wouldn't it make the data consistency worse between primary and replicas?
Can we also expand on the the control flow for replica to access primary's remote store?
Just to confim, for pull based method, we'd be syncing Replication checkpoint to the remote store as well right?

Bukhtawar · 2022-10-06T16:43:19Z

Refresh Vs Commit

We need to make sure we are not breaking the refresh semantics with remote store. So in essence its not one vs the other, we need to ensure we add features without changing staleness

Avoid update disk store

I think we also need to think from a failover perspective, we cannot have commits only in memory as it will increase the failover time and impact availability.

Deletion policy

A reasonable retention policy should be fine as long as there is no additional cost or trade-offs

sachinpkale · 2022-10-07T05:52:52Z

With remote store, where fsyncs are not mandatory on primary, it is better to perform a commit rather than refresh

Does this replace refresh operation with commit? Commit also performs other operations like translog purge. Let's take a look at all the operations involved in commit flow.

Also, this would require recovery flow to be integrated with remote store. Otherwise, recovery can cause data loss.

dreamer-89 · 2022-10-10T19:10:26Z

Thanks @ankitkala for the comments. Please find my response below.

How often do we want to commit?

The commit is performed on shard refreshes.

Wouldn't it make the data consistency worse between primary and replicas?

Can you please elaborate more on this ?

Can we also expand on the the control flow for replica to access primary's remote store?

I am thinking of having a poller process on replica nodes which scans remote store for commit points. Poller captures the latest commit point and passes it on to TargetService on replica node. This service compares the commit point with last seen and start file copy operations for new commit points.

Just to confim, for pull based method, we'd be syncing Replication checkpoint to the remote store as well right?

I think the poller process on replica should be able to read commit points on remote store. With this, I think separate replication checkpoints will not be needed.

mch2 · 2022-10-11T00:50:53Z

The existing implementations of both segrep & remote store operate on refresh, not commit. So in order for replicas to fetch the latest checkpoint (which is only in memory on the primary after a refresh operation with no flush) we'd need to create a new file representing the in memory SegmentInfos at a refresh point and write/upload it. Right now with the node->node implementation that is sent directly from memory as a byte[]. The idea to create an fsyncless commit instead of the normal refresh and avoid creating the new file. The fsync removal makes this about as expensive as a regular refresh, which we are able to do withe the durability guarantee of the remote store.

So the flow would be.
Commit -> push to remote store -> on upload ack purge the xlog & push checkpoint to replicas.

During failover replicas can either fetch from remote store & then replay from xlog, or directly replay from xlog a larget set of ops. All ops should be durably persisted.
It's not guaranteed the primary will have all segments during a restart, so we'd have to fetch any diff in segments from the store before starting the engine.

The alternative is we replicate after normal commit which will cause unacceptable staleness.

We need to make sure we are not breaking the refresh semantics with remote store. So in essence its not one vs the other, we need to ensure we add features without changing staleness

With what we are proposing, the refresh operation would need to be extended when these features are enabled to effectively execute what a flush does today without the xlog purge. This should not impact staleness however.

I think we also need to think from a failover perspective, we cannot have commits only in memory as it will increase the failover time and impact availability.

@Bukhtawar Are you referring about the absence of fsync here or the need to write the incremental refresh's SegmentInfos to disk if we are pushing a new file to the store?

dreamer-89 · 2022-10-14T18:00:57Z

Refresh Vs Commit

We need to make sure we are not breaking the refresh semantics with remote store. So in essence its not one vs the other, we need to ensure we add features without changing staleness

Thank you @Bukhtawar for the feedback and calling this out. Yes, there will be changes in refresh mechanism, but the advantages we seek from commit only backup seems to outweigh this short-coming as also mentioned in last comment by @mch2. We plan to call this out clearly in the documentation and perform benchmarking/tests to verify the claim.

Avoid update disk store

I think we also need to think from a failover perspective, we cannot have commits only in memory as it will increase the failover time and impact availability.

In order to identify the failover time and availability, we need to add the tests/benchmark. We can make fsync-less commits bounded in order to keep failover/recovery time bounded. One way is to ensure is to ensure at least last N commit is fsynced.

Deletion policy

A reasonable retention policy should be fine as long as there is no additional cost or trade-offs

The deletion policycleanup after M days, or keep only last N commits de-couples primary and replica. The remaining approaches need some sort of collaboration among nodes. For POC, I am planning to rely on existing remote store cleanup policy (keep only last N commits) which keeps only last 10 commits. With PITR (14 days data retention) implementation on remote store side provides the cleanup after M days functionality.

dreamer-89 · 2022-10-20T05:38:34Z

Closing this issue and POC work is tracked in #4536

mch2 added enhancement Enhancement or improvement to existing feature or request untriaged labels Sep 19, 2022

mch2 mentioned this issue Sep 19, 2022

[META] [Segment Replication] Integration with remote store #4448

Closed

8 tasks

mch2 removed the untriaged label Sep 19, 2022

anasalkouz assigned dreamer-89 Sep 19, 2022

tlfeng added the distributed framework label Sep 20, 2022

mch2 mentioned this issue Sep 26, 2022

Use pull based model for Segment Replication #4577

Open

dreamer-89 mentioned this issue Sep 28, 2022

[Remote Store] Changes requested for segment replication integration. #4628

Open

dreamer-89 mentioned this issue Oct 7, 2022

POC on remote store integration #4536

Closed

dreamer-89 mentioned this issue Oct 14, 2022

Add segment replication source implementation for remote store integration #4793

Closed

dreamer-89 closed this as completed Oct 20, 2022

sachinpkale mentioned this issue Dec 15, 2022

[Remote Store] Segment Replication checkpoint publication to sync with remote store upload #5578

Closed

ankitkala mentioned this issue Apr 6, 2023

Allow replica to fetch segments from remote store instead of leader node #7028

Closed

6 tasks

mch2 mentioned this issue May 3, 2023

Design proposal for integration #4537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Remote store integration high level design. #4555

[Segment Replication] Remote store integration high level design. #4555

mch2 commented Sep 19, 2022 •

edited

Loading

ankitkala commented Sep 22, 2022

itiyama commented Sep 23, 2022

Bukhtawar commented Sep 23, 2022

sachinpkale commented Sep 23, 2022

itiyama commented Sep 29, 2022

Bukhtawar commented Sep 29, 2022

dreamer-89 commented Sep 29, 2022 •

edited

Loading

ankitkala commented Oct 6, 2022 •

edited

Loading

Bukhtawar commented Oct 6, 2022

sachinpkale commented Oct 7, 2022

dreamer-89 commented Oct 10, 2022 •

edited

Loading

mch2 commented Oct 11, 2022 •

edited

Loading

dreamer-89 commented Oct 14, 2022 •

edited

Loading

dreamer-89 commented Oct 20, 2022

[Segment Replication] Remote store integration high level design. #4555

[Segment Replication] Remote store integration high level design. #4555

Comments

mch2 commented Sep 19, 2022 • edited Loading

ankitkala commented Sep 22, 2022

itiyama commented Sep 23, 2022

Bukhtawar commented Sep 23, 2022

sachinpkale commented Sep 23, 2022

itiyama commented Sep 29, 2022

Bukhtawar commented Sep 29, 2022

dreamer-89 commented Sep 29, 2022 • edited Loading

Design consideration

Replication model

1. Push based mechanism

2. Pull based mechanism

Refresh Vs Commit

Data purging

1. Background process on primary

2. Distributed queue

3. Deletion policies ( cleanup after M days, or keep only last N commits)

ankitkala commented Oct 6, 2022 • edited Loading

Bukhtawar commented Oct 6, 2022

sachinpkale commented Oct 7, 2022

dreamer-89 commented Oct 10, 2022 • edited Loading

mch2 commented Oct 11, 2022 • edited Loading

dreamer-89 commented Oct 14, 2022 • edited Loading

dreamer-89 commented Oct 20, 2022

mch2 commented Sep 19, 2022 •

edited

Loading

dreamer-89 commented Sep 29, 2022 •

edited

Loading

ankitkala commented Oct 6, 2022 •

edited

Loading

dreamer-89 commented Oct 10, 2022 •

edited

Loading

mch2 commented Oct 11, 2022 •

edited

Loading

dreamer-89 commented Oct 14, 2022 •

edited

Loading