Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Remote store integration high level design. #4555

Closed
Tracked by #4448
mch2 opened this issue Sep 19, 2022 · 14 comments
Closed
Tracked by #4448

[Segment Replication] Remote store integration high level design. #4555

mch2 opened this issue Sep 19, 2022 · 14 comments
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@mch2
Copy link
Member

mch2 commented Sep 19, 2022

Before starting a POC for this integration we should start with a high level design.

Some questions that come to mind.

  1. How do we compute and send the list of segments to fetch from the store?
  2. How do we fetch the SegmentInfos byte[] required?
  3. Are there any changes to failover steps? Currently we commit on replicas during failover and replay from xlog, can we ensure the newly selected replica always has the latest segments from store before promotion?
  4. How would we guarantee segments are present before fetching from remote store?
  5. Should we use existing push segrep design or have replicas poll remote store for updates? If so, how would this work?

lower lvl questions:

  1. Do we still need commit points on replicas?
  2. Can we now disable fsyncs on primaries?
@mch2 mch2 added enhancement Enhancement or improvement to existing feature or request untriaged labels Sep 19, 2022
@mch2 mch2 removed the untriaged label Sep 19, 2022
@ankitkala
Copy link
Member

Also, How do we ensure that the segments which are being copied by replicas doesn't get deleted by primary?

@itiyama
Copy link

itiyama commented Sep 23, 2022

Based on discussion on writes on NRT replica with translog, it does not look like primary and replicas are completely decoupled for translog writes. What would we then gain by decoupling primary and replicas completely during segment copy? I believe that we should start from this as that will inform and potentially simplify a lot of the design trade-offs you captured above.

@Bukhtawar
Copy link
Collaborator

Should we use existing push segrep design or have replicas poll remote store for updates? If so, how would this work?

I definitely think we should valuate a poll based model for checkpoints and segment transfer on replica, this should simplify cases not only with remote segment store but also with CCR use cases where we currently have remote cluster poll for operations on the leader cluster.

How do we compute and send the list of segments to fetch from the store?

Now if we go with polling mechanism the replicas can request for segments from the primary or the remote store alike based on the diff between the latest checkpoint and the last processed checkpoint

Are there any changes to failover steps? Currently we commit on replicas during failover and replay from xlog, can we ensure the newly selected replica always has the latest segments from store before promotion?

I guess we should do that @sachinpkale would be sharing the related issue around this, translog replay might be expensive

@itiyamas I guess remote translogs would only be relying on a replication proxy for the control flow(single writer invariant), the data flow however is decoupled. This is only to support existing blob stores which don't support OCC but eventually the goal is to provide extension points for stores that support OCC for single writer use cases which might totally decouple the translog flow in such cases.

Decoupling primary and replica for segment copy will help optimise lots of cases around recovery, primarily deployments/failovers, CCR replication etc. With remote store we could also think about publishing checkpoints to a remote queue like Kafka, ActiveMQ which should effectively decouple the segment copy and can be consumed not only internally within the cluster but across clusters for CCR with the appropriate access controls.

@sachinpkale
Copy link
Member

I guess we should do that @sachinpkale would be sharing the related issue around this, translog replay might be expensive

Created an issue to track this: #4578. Downloading segments for lagging replica from remote segment store would definitely be faster than replaying all the operations from translog.

We may need to think about common abstraction which will be used to get segments downloaded from remote segment store. Once SegRep is integrated with remote segment store, we can use the same APIs to download segments during failover.

@itiyama
Copy link

itiyama commented Sep 29, 2022

@itiyamas I guess remote translogs would only be relying on a replication proxy for the control flow(single writer invariant), the data flow however is decoupled. This is only to support existing blob stores which don't support OCC but eventually the goal is to provide extension points for stores that support OCC for single writer use cases which might totally decouple the translog flow in such cases.

Do you mean that primary term check is control flow and actual document or segments is data flow?

@Bukhtawar
Copy link
Collaborator

Do you mean that primary term check is control flow and actual document or segments is data flow?

Yes thats how I would look at it :)

@dreamer-89
Copy link
Member

dreamer-89 commented Sep 29, 2022

Thank you folks for the inputs. The comments really helped in identifying different problem areas and the probable solutions.

From POC perspective, pull based model where replica directly polls the remote store for new replication checkpoints seems better bit. The data deletion from remote store is not considered as part of this POC though remote store currently purges older than N commit points. Also, there are few changes which may needed on remote store side, created issue for tracking #4628. Captured design choices below, please review and provide feedback.


Design consideration

Replication model

Pull based mechanism provides certain advantages, works well with CCR issue (#4577), and provide simplication comment (#4555 (comment)). It is better to provide both options and evaluate based on benchmarking results. As part of this POC, poll based mechanism will be attempted.

1. Push based mechanism

Primary in charge of starting replication by sending shard refresh checkpoints. Replica compares these checkpoints with its store and initiates replication when it falls behind. Primary maintains state of on-going replications to prevent data deletion. This is existing approach.

Pros

  • Existing design, may need minor changes for remote store integration

Cons

  • Does not gel well with CCR.

2. Pull based mechanism

Replica is in-charge of updating their local copy of data. Replica pings remote store to check if there is any new data. Need to trade off with network consumption Vs how soon data is available.

Pros

  • Works well with CCR
  • Aligns better with de-coupling primary (writer) and replica (reader) shards.

Cons

  • Not implemented right now. May need more effort.

Refresh Vs Commit

The existing implementation of segment replication supports incremental back up of segment files on replica. This is meaningful for disk based stores as it saves fsyncs on primary. The primary shares the in-memory SegmentInfos over transport with the replica nodes, which is cumbersome. With remote store, where fsyncs are not mandatory on primary, it is better to perform a commit rather than refresh, as replicas can generate SegmentInfos from the commit points and there is no need for primary to share the SegmentInfos separately with replica. With commit points, the failover & recovery workflows are also simplified and will be attempted in the POC.

Data purging

Data once stored on remote store needs to be cleaned up as well. This deletion shouldn't impact the on-going file copy operations on replica node. From POC perspective, data will not be deleted from remote store. Below are some options which can be considered as future enhancements.

1. Background process on primary

On replica node, a background process ping replicas to fetch on-going replication checkpoints and cleans up commit points older than N, where N is least replication checkpoint active in replication amont all replicas. This approach couples primary and replica nodes and is not a scalable solution

2. Distributed queue

Similar to previous approach, but instead of transport calls from primary to replica, primary uses a distributed queue to synchronize active replication checkpoints. When event is processed by all replicas, event can be purged (exact N semantics) and cleaned up from remote store.

Pros:

  • Primary & replica nodes are decoupled

Cons:

  • With queue storage on primary, leaves memory & storage footprint on primary
  • Need to identify behavior on small grade instances (t2/t3)

3. Deletion policies ( cleanup after M days, or keep only last N commits)

No synchronization among nodes to identify active replication checkpoints. Data is instead deleted based on data deletion policy. This completely decouples primary and replica nodes and simplify the design. Remote store by default keeps last 10 commits on store. Replica nodes handle file not found exception and retry for latest commit point. The solution can result in replica starvation when primary is aggressively committing. This solution is attempted for POC and doesn't need any extra work on integration side.

remote store

@ankitkala
Copy link
Member

ankitkala commented Oct 6, 2022

Overall proposal looks good to me. Few open questions that we probably need to elaborate.

  • How often do we want to commit? Wouldn't it make the data consistency worse between primary and replicas?
  • Can we also expand on the the control flow for replica to access primary's remote store?
  • Just to confim, for pull based method, we'd be syncing Replication checkpoint to the remote store as well right?

@Bukhtawar
Copy link
Collaborator

Refresh Vs Commit

We need to make sure we are not breaking the refresh semantics with remote store. So in essence its not one vs the other, we need to ensure we add features without changing staleness

Avoid update disk store

I think we also need to think from a failover perspective, we cannot have commits only in memory as it will increase the failover time and impact availability.

Deletion policy

A reasonable retention policy should be fine as long as there is no additional cost or trade-offs

@sachinpkale
Copy link
Member

With remote store, where fsyncs are not mandatory on primary, it is better to perform a commit rather than refresh

Does this replace refresh operation with commit? Commit also performs other operations like translog purge. Let's take a look at all the operations involved in commit flow.

Also, this would require recovery flow to be integrated with remote store. Otherwise, recovery can cause data loss.

@dreamer-89
Copy link
Member

dreamer-89 commented Oct 10, 2022

Thanks @ankitkala for the comments. Please find my response below.

  • How often do we want to commit?

The commit is performed on shard refreshes.

Wouldn't it make the data consistency worse between primary and replicas?

Can you please elaborate more on this ?

  • Can we also expand on the the control flow for replica to access primary's remote store?

I am thinking of having a poller process on replica nodes which scans remote store for commit points. Poller captures the latest commit point and passes it on to TargetService on replica node. This service compares the commit point with last seen and start file copy operations for new commit points.

  • Just to confim, for pull based method, we'd be syncing Replication checkpoint to the remote store as well right?

I think the poller process on replica should be able to read commit points on remote store. With this, I think separate replication checkpoints will not be needed.

@mch2
Copy link
Member Author

mch2 commented Oct 11, 2022

The existing implementations of both segrep & remote store operate on refresh, not commit. So in order for replicas to fetch the latest checkpoint (which is only in memory on the primary after a refresh operation with no flush) we'd need to create a new file representing the in memory SegmentInfos at a refresh point and write/upload it. Right now with the node->node implementation that is sent directly from memory as a byte[]. The idea to create an fsyncless commit instead of the normal refresh and avoid creating the new file. The fsync removal makes this about as expensive as a regular refresh, which we are able to do withe the durability guarantee of the remote store.

So the flow would be.
Commit -> push to remote store -> on upload ack purge the xlog & push checkpoint to replicas.

During failover replicas can either fetch from remote store & then replay from xlog, or directly replay from xlog a larget set of ops. All ops should be durably persisted.
It's not guaranteed the primary will have all segments during a restart, so we'd have to fetch any diff in segments from the store before starting the engine.

The alternative is we replicate after normal commit which will cause unacceptable staleness.

We need to make sure we are not breaking the refresh semantics with remote store. So in essence its not one vs the other, we need to ensure we add features without changing staleness

With what we are proposing, the refresh operation would need to be extended when these features are enabled to effectively execute what a flush does today without the xlog purge. This should not impact staleness however.

I think we also need to think from a failover perspective, we cannot have commits only in memory as it will increase the failover time and impact availability.

@Bukhtawar Are you referring about the absence of fsync here or the need to write the incremental refresh's SegmentInfos to disk if we are pushing a new file to the store?

@dreamer-89
Copy link
Member

dreamer-89 commented Oct 14, 2022

Refresh Vs Commit

We need to make sure we are not breaking the refresh semantics with remote store. So in essence its not one vs the other, we need to ensure we add features without changing staleness

Thank you @Bukhtawar for the feedback and calling this out. Yes, there will be changes in refresh mechanism, but the advantages we seek from commit only backup seems to outweigh this short-coming as also mentioned in last comment by @mch2. We plan to call this out clearly in the documentation and perform benchmarking/tests to verify the claim.

Avoid update disk store

I think we also need to think from a failover perspective, we cannot have commits only in memory as it will increase the failover time and impact availability.

In order to identify the failover time and availability, we need to add the tests/benchmark. We can make fsync-less commits bounded in order to keep failover/recovery time bounded. One way is to ensure is to ensure at least last N commit is fsynced.

Deletion policy

A reasonable retention policy should be fine as long as there is no additional cost or trade-offs

The deletion policycleanup after M days, or keep only last N commits de-couples primary and replica. The remaining approaches need some sort of collaboration among nodes. For POC, I am planning to rely on existing remote store cleanup policy (keep only last N commits) which keeps only last 10 commits. With PITR (14 days data retention) implementation on remote store side provides the cleanup after M days functionality.

@dreamer-89
Copy link
Member

Closing this issue and POC work is tracked in #4536

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

7 participants