Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Writes on NRT Replica with Remote Translog #3706

Closed
Tracked by #5671
Bukhtawar opened this issue Jun 25, 2022 · 33 comments
Closed
Tracked by #5671

[Discuss] Writes on NRT Replica with Remote Translog #3706

Bukhtawar opened this issue Jun 25, 2022 · 33 comments
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Jun 25, 2022

Is your feature request related to a problem? Please describe.
For a remote translog in NRTReplicationEngine we ideally don't need to forward requests to the replica since the translog has already been durably persisted to a remote store. However in the event of a network partition where the primary and replica gets partitioned off, during this period, the primary might continue to ack writes and the replica(now promoted to a new primary) also does the same. This might continue longer espl with the #3045 even when there is no active master. As a result we can run into issues where we can have divergent writes and data integrity issues.
To avoid this we need to have a proposal to mitigate such risk
Also see #3237

Describe the solution you'd like

  1. We do a no-op replication to all replica copies, this is similar to what we do today except for the No Op
  2. We could have a mechanism sitting in the remote store to ensure both primary and replica reconcile their views based on a common agreed upon protocol based on versioning. If replica gets promoted it bumps up the version which the stale primary should be able to process and fail to ack subsequent writes.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@ashking94
Copy link
Member

ashking94 commented Jul 7, 2022

Well, I was thinking about this problem and realised that this would exist with segment uploads as well. Below is an use case -

  1. Writer 1 creates segments and issues a write to storage. Filename - f1- contents C1
  2. Writer 1 dies.
  3. Writer 2 wakes up and issues a write to storage. Filename - f1- contents c2
  4. With no OCC, we really can’t ensure that c1 does not override c2. Even if the file names are different, we will need a meta location that defines the segment location or the segment to writer mapping. The writes for that metadata should be conditional to ensure consistency.
    LMK your thoughts.
    cc @itiyamas

@ashking94
Copy link
Member

ashking94 commented Jul 15, 2022

The above scenario has been detailed and elaborated further in #3906 by @sachinpkale.

@ashking94
Copy link
Member

ashking94 commented Jul 15, 2022

Continuing further on the original point of discussion - Avoiding divergent writes with segment replication & remote translog

1. Context

Along with segment replication, we plan to have only one source of truth for storing the translog in remote store. Currently, in NRT segment replication mode, the segments are replicated asynchronously and the primary is responsible for forwarding the writes requests to replicas where the same request is replayed on primary without queuing the request in indexing buffer but only keeping track of the translog. Having translog in replicas help to have durability in check and as well as for the shard to validate if it is still the primary. On segments, only primary is responsible for its creation and it also notifies the replicas to sync the segment by using NRT checkpoints.

With remote translog on top of NRT segment replication, the remote store becomes the single source of truth for translog where just the current primary will be uploading the translog files. This obsoletes 1) the need for generation of translog on replicas as they can easily be replayed from the remote store in case of failovers/failures and 2) storing the translog on local disk on the primary. However, this does not necessarily eliminates the need for replication as the replication call also helps in validation for the assuming shard of being the real primary or not.

Now, if we decide to use remote store for translog and also stop replication requests, we get into this situation where due to network partition, the older primary can continue to accept write requests and cause divergent writes. The issue is further accentuated due to changes done to allow writes in #3621.

Invariants

  • Durability - An acknowledged request will always be honoured in case of machine failures, failovers, recovery.
  • Correctness & Integrity - Although durability guarantees that the data does not get lost once acknowledged. However, When the replica promotion happens, all the translogs must get replayed and should be correct in terms of ordering.

As mentioned in the current issue description, there are 2 possible options that are being discussed in further detail in following sections -

  1. No-op replication - This is similar to replication, just that this would be no op. This way the partitioned primary can realise that it is no more a primary whenever the primary term changes. There is still a concern that must be addressed while dealing with indices with no replicas - more details in section 4. Other areas of concern.
  2. Using remote store that guarantees read-after-write consistency (like s3) - We will use a remote store for storing the primary term which can help decide which is the active primary and which is the stale primary.

More details on the above 2 possible options follows below -

2. No-op replication

Currently if there are replicas configured, write requests are replicated synchronously across the insync replicas. This offers durability if the primary shard copy fails. When this happens, master leader promotes one of the insync replicas to become primary. If there are no insync replica, then index would show with red status. Now, with remote store for translog along with segment replication, we don’t need the translogs to be stored locally on the primary as well as the replicas. So, technically we can skip this part of keeping the write request in the indexing buffer and also skip the sync/async persistence of translogs on the replica’s data volume. This would require additional changes as mentioned in #3786. We can continue to use primary term validation logic that helps the stale partitioned primary to realise that it is no more a primary.

2.1 Concerns

  • If there were no replicas, the stale primary would keep accepting write requests given we have Changing default no_master_block from write to metadata_write #3621. If auto restore kicks in, both stale and new primary can continue to accept requests because the connection between the stale primary and master (or other nodes) is no more. Stale primary would not know when to stop accepting requests. How do we make sure that the stale primary has stopped accepting the write requests?

3. Using remote store that guarantees read-after-write consistency

The primary term validation that we want to achieve in the above section 2. No-op replication strategy can also be modelled using a remote store. The following could be a strategy that the primary can follow -

  • The flow whenever a replica is promoted to primary would look like below -
    • It reads the CurrentPrimaryTerm from remote store and validates the new primary term to be published is greater than the current primary term. Leader/Primary term metadata prefix for remote store: Base64UUID/IndexUUID/ShardID with filename as CurrentPrimaryTerm.
    • It publishes the new primary term to the remote store.
    • It starts downloading translogs basis the NRT checkpoint of the replica as segment replication would have copied over the segment files upto certain NRT checkpoint.
    • There might be a need to copy over the tanslog files from the previous primary term path to the new current primary term path in case of primary failing before the segment creation happens on the new primary. But this would depend on how translog replaying happens. I need to read more on this. TODO.
  • The flow during an write request would look like below -
    • Primary receives the write request.
    • It stores the request in the indexing buffer and uploads the translog to the remote store. Translog prefix for remote store: Base64UUID/IndexUUID/ShardID/PrimaryTerm.
    • It reads the CurrentPrimaryTerm and validates against the local primary term.
    • If the local primary term and the CurrentPrimaryTerm don’t match, we respond with failure back to the client. However, since the data has been persisted and the new primary could have read the unacknowledged translog, we can allow that to happen and live without the need for a rollback.
      • There is a possibility that unacknowledged or failed write requests can get persisted. For such cases, we can use the last modified time of the translog file to determine whether to keep or discard it for replaying the translogs.

3.1 Concerns

  • The remote store can become a single point of failure for all indexing operations and replica promotion. If the remote store is down, there can repetitive failovers or primary shard. There will be a need to devise a strategy on how the cluster behaves when the remote store is down irrespective of which strategy is selected to avoid divergent writes.

4. Other areas of concern

  • With just segment replication & changes to allow write in absence of master Changing default no_master_block from write to metadata_write #3621, primary in partitioned node will continue to accept writes until it gets to know that it is no more primary from the replicas. In the case of no replicas, the partitioned primary will continue to accept writes indefinitely inspite of master not reachable. This behaviour is due to the change as mentioned in the #.
  • This needs to be validated with code - During peer recovery handoff, can there be a case where a replica can get marked in sync and the old primary primary gets partitioned off just before the cluster state is published. Could this lead to a scenario where old primary and new primary can continue to act as primary in their individual capacity.

5. Considerations

  • Translog durability guarantees
    • Like today, we can continue to use a setting similar to index.translog.durability which user can decide how frequently are the translogs persisted to the remote store. Both approaches provides this but at the cost of performance, correctness.
  • Performance
    • Transport calls would be cheaper in terms of latency in comparison to remote store calls (which checks the latest primary term).
    • Remote store latency, throughput and throttling limits should be in check so that the remote store does not become the weakest link in the system.
  • Plugin Extensibility (of primary term validation check)
    • No-op replication can continue to work irrespective of choice of remote store for segment & translog.
    • Remote store should be able to handle the abstraction of storing primary term metadata.
  • Correctness
    • Since replication has already been the backbone for avoiding divergents, No-op replication approach can relied upon.
    • For remote store approach, the flow discussed above will work as long as the underlying remote store offers read-after-write consistency.
  • Security
    • No changes in No-op replication approach.
    • OS would require CRUD access on the S3 bucket with the appropriate prefix.
  • Operational Work
    • Ease of implementation & management
      • No-op replication approach requires minimal work on making the existing engine to have no op replication for replicas. It will require changes to be done in Internal Engine (for primary) such that concepts like global/local checkpoint can be refined. There will be no extra management overhead than it exists today with replication.
      • Remote store approach will also require changes in both the Engine that runs on the primary and the replicas. It would have similar effort around the changes in Internal Engine (for primary) as for No-op replication mode. The Engine on replica would be such that there should not be any calls that lands on it - be it index()/delete(). It exists only to be promoted to primary when the master asks it to become one. The abstraction would be built such that there are methods that persists/fetches the latest primary term, but there will onus on Remote store plugin implementors to extend and method as per the abstraction.
    • Complexity
      • Both approaches have the same kind of complexity.

5.1 Evaluation of approaches simplified

Considerations() \ Approaches () No Op Replication Remote Store
Correctness High Dependent on underlying remote store
Performance High Low to Medium
Plugin Extensibility Works for any choice of remote store for segments / translog Requires abstraction of storing primary term metadata per Index Shard
Security Status quo OS requires CRUD on remote store for appropriate directory/paths/keys
Operational - Ease of implementation & management Low to Medium Medium to High
Operational - Complexity Medium Medium

Basis the evaluation, both the approaches can be used to handle divergent writes for remote translog. However, the major invariant i.e. durability and correctness, performance, and other considerations, we can modify current replication to make it no op and achieve the uber goal.

Please feel free to correct if I have stated any facts incorrectly and feedback/suggestions are welcome.

6. References

@ashking94
Copy link
Member

cc: @mch2 @sachinpkale @Bukhtawar @kartg @andrross @reta @nknize. Looking for thoughts and suggestions.

@andrross
Copy link
Member

Caveat: I'm certainly not an expert on this part of OpenSearch so please correct me wherever I'm wrong

One nitpick that might help clarify things is to replace the phrase "no op replication" with something like "primary term validation". If I understand correctly, you wouldn't need to send the document over the wire and instead would just be asking the replicas "is the primary term x?".

One concern I have about that approach is that today when a document is replicated via the translog, the replicas are in effect performing two operations transactionally: 1) validating that the primary term hasn't changed, and 2) persisting the document. With the proposed change those operations will be split apart. The replicas will confirm the primary term hasn't changed and the remote store will persist the document. However, those operations will be split apart to different distributed components, which makes me a little nervous. Can you provide more details about how this will work? e.g. do you upload to remote store before or after confirming primary term from the replicas? What happens if the primary term changes in between those operations? etc

Having said that, it does seem like a more natural fit to have the remote translog provide the same conceptual functionality provided by replicas today (persist this document only if the primary term is still x). It obviously puts a stronger requirement on the actual remote store because it has to be more than just dumb durable storage.

@reta
Copy link
Collaborator

reta commented Jul 15, 2022

A few comments from my side.

We can tweak changes done in #3621 for stale primary to stop requesting write requests. Auto restore can kick in then. This will make sure that, when the restore happens, the stale primary has failed totally.

I think this is the right thing to do, even in case of primary with no replicas: when the connection with master is lost (split brain etc), it looks right for the primary to go into readonly mode and not accepts writes anymore.

Leader/Primary term metadata prefix for remote store: Base64UUID/IndexUUID/ShardID with filename as CurrentPrimaryTerm.

It looks clever to store the CurrentPrimaryTerm (under Base64UUID/IndexUUID/ShardID) and check it before every write operation (so the primary could detect if it is stale or not). Storing translog under Base64UUID/IndexUUID/ShardID/PrimaryTerm would eliminate overlaps of old & new primary trying to update the translog simultaneously. But in this case the old translogs have to be recycled somehow, right?

@Bukhtawar
Copy link
Collaborator Author

One concern I have about that approach is that today when a document is replicated via the translog, the replicas are in effect performing two operations transactionally: 1) validating that the primary term hasn't changed, and 2) persisting the document. With the proposed change those operations will be split apart. The replicas will confirm the primary term hasn't changed and the remote store will persist the document. However, those operations will be split apart to different distributed components, which makes me a little nervous. Can you provide more details about how this will work? e.g. do you upload to remote store before or after confirming primary term from the replicas? What happens if the primary term changes in between those operations? etc

Thanks @andrross for the observation, Here is how we intend to solve the above, but before that some context on how doc repl does it, below are the sequence of steps.

  1. An isolated primary indexes operations to Lucene and writes to it's local translog
  2. Once it forwards the request to the replicas it then realises, it's partitioned off
  3. The isolated primary would fail to ack this request to the client however the operation has been committed to the translog and runs the risk of exposing these operations for reads.
  4. Soon after the no_master_block would kick in and help mitigate by stopping to accept further reads.

What follows from above is,

  1. It's possible to have dirty reads which are unacknowledged
  2. No acknowledged writes should be lost.

So if we go with the same status quo, we could always write operations to the remote xlog, once writes complete but before we acknowledge the write we can check for the primary term.

Having said that, it does seem like a more natural fit to have the remote translog provide the same conceptual functionality provided by replicas today (persist this document only if the primary term is still x). It obviously puts a stronger requirement on the actual remote store because it has to be more than just dumb durable storage.

This part is something we should discuss more. The implementation would probably require some leasing protocol

  1. Writer Node(W1) hosting the primary shard tries to acquire a lease for shard id(S1) in the primary term(P1) for (x) seconds
  2. Once acquired upload the translog to the remote store and renew the lease subsequently
  3. Now, lets say remote store gets throttled and the plugin client retries
  4. Meanwhile replica gets promoted to primary and writer node(W2) in primary term(P2) acquires the lease for shard id(S1) and starts the upload
  5. W1 and W2 at this point could be concurrently uploading, unless W1's upload gets interrupted (JVM halted) once W1 fails to renew lease

We can always go with the "Remote Store" based proposal, which still provides us the limited guarantees we need. The part I like about this approach is it provides a nice property to better guard against isolated primaries as long as the cluster can guarantees the primary term invariant(no two primaries at any point can have the same primary term) at all times.

The "No Op replication" approach inherently treats remote xlog store an extension of the "primary" translog storage instead of treating it as a separate distributed store in a bid to ensure that writes ever happens from the primary of the shard instead of guaranteeing "a single writer" at all times.

The caveat here is that if the writer version (in this case primary term)change itself isn't guaranteed to be serialised, the remote store itself wouldn't be able to guarantee correctness, unless it implements some form of a leasing protocol mentioned above.

One thing to note here is that, the current snapshot mechanism(only primaries snapshot) also use remote store while keeping the remote store logic dumb. I think we should confirm how cases like isolated primaries are dealt there and whether remote store or remote xlog and can leverage some properties. It might be possible that snapshots can have certain trade-offs which which might not be acceptable for the current use case.

It looks clever to store the CurrentPrimaryTerm (under Base64UUID/IndexUUID/ShardID) and check it before every write operation (so the primary could detect if it is stale or not). Storing translog under Base64UUID/IndexUUID/ShardID/PrimaryTerm would eliminate overlaps of old & new primary trying to update the translog simultaneously. But in this case the old translogs have to be recycled somehow, right?

@reta Remote translog pruning would be discussed separately #3766

@Bukhtawar
Copy link
Collaborator Author

Bukhtawar commented Jul 18, 2022

/cc: @itiyamas @muralikpbhat @backslasht

@sachinpkale
Copy link
Member

Thanks @ashking94 for the detailed approaches.

Regarding: Using remote store that guarantees read-after-write consistency approach, I have few questions: Are we thinking to use the same remote store that will be used to store translog or this can be a different store? If it is the same, then will this approach make read-after-write a prerequisite for stores to be considered for remote translog?

@ashking94
Copy link
Member

Regarding: Using remote store that guarantees read-after-write consistency approach, I have few questions: Are we thinking to use the same remote store that will be used to store translog or this can be a different store? If it is the same, then will this approach make read-after-write a prerequisite for stores to be considered for remote translog?

@sachinpkale Yes, we are thinking to use the same remote store that will be used to store translog. Read-after-write is not just a requirement for primary term validation, but has to be a mandate for storing remotely as we want subsequent reads (after writes) to give us the data written so far, else we can get into an inconsistent state in case of auto restore/failovers/recoveries and likewise.

@andrross
Copy link
Member

andrross commented Jul 18, 2022

@Bukhtawar Thanks for calling out that OpenSearch today allows for dirty reads. I think it is probably okay to stay with the status quo behavior here. (side note: this is where something like a TLA+ model would be helpful, mentioned in #3866, because it could be a more formal definition of the various consistency properties offered by the system).

So if we go with the same status quo, we could always write operations to the remote xlog, once writes complete but before we acknowledge the write we can check for the primary term.

With the current system, an isolated primary may briefly write to its local translog and serve dirty reads, but those writes will not be acknowledged and will not be replicated/persisted. In the remote case, the isolate primary will still write to the remote xlog before discovering it is no longer the primary. Is there a mechanism to ensure those writes will not be picked up by the new primary? Is that a concern here? (update, I see now this is discussed in "remote store" option which I commented on below, but I think this is also an issue for the "no op replication" case too?)

@andrross
Copy link
Member

andrross commented Jul 18, 2022

Having said that, it does seem like a more natural fit to have the remote translog provide the same conceptual functionality provided by replicas today (persist this document only if the primary term is still x). It obviously puts a stronger requirement on the actual remote store because it has to be more than just dumb durable storage.

This part is something we should discuss more. The implementation would probably require some leasing protocol

For this I was thinking of relying on transactional properties of the remote store. I think it's unlikely that any of the remote object stores used in OpenSearch today would be able to meet that contract. Given that we can accept uncommitted reads (the current behavior), we probably don't need such a strong guarantee.

@andrross
Copy link
Member

andrross commented Jul 18, 2022

Good write up @ashking94! Just some comments on "Using remote store that guarantees read-after-write consistency":

It reads the CurrentPrimaryTerm from remote store and validates the new primary term to be published is greater than the current primary term. Leader/Primary term metadata prefix for remote store: Base64UUID/IndexUUID/ShardID with filename as CurrentPrimaryTerm.
It publishes the new primary term to the remote store.

For this to work properly I think you might need proper check-and-set semantics in the remote store (i.e. conditional write). The above describes two steps: 1) read and validate, then 2) write new value. Is there protection here to ensure that that CurrentPrimaryTerm still has the same value observed in step 1 when it does the write in step 2?

If the local primary term and the CurrentPrimaryTerm don’t match, we respond with failure back to the client. However, since the data has been persisted and the new primary could have read the unacknowledged translog, we can allow that to happen and live without the need for a rollback.
There is a possibility that unacknowledged or failed write requests can get persisted. For such cases, we can use the last modified time of the translog file to determine whether to keep or discard it for replaying the translogs.

I don't think we want to use timestamps for logical ordering like this because clock skew can happen, clocks can go backwards, etc. Are sequence numbers serialized relative to primary term changes? If we know the sequence number at which a primary term changes, then any sequence numbers newer than that from the previous primary term could be ignored.

@Bukhtawar
Copy link
Collaborator Author

Bukhtawar commented Jul 19, 2022

It reads the CurrentPrimaryTerm from remote store and validates the new primary term to be published is greater than the current primary term. Leader/Primary term metadata prefix for remote store: Base64UUID/IndexUUID/ShardID with filename as CurrentPrimaryTerm.
It publishes the new primary term to the remote store.

For this to work properly I think you might need proper check-and-set semantics in the remote store (i.e. conditional write). The above describes two steps: 1) read and validate, then 2) write new value. Is there protection here to ensure that that CurrentPrimaryTerm still has the same value observed in step 1 when it does the write in step 2?

@andrross the check and set semantics is probably not needed since as discussed it's fine for translogs to have dirty writes we just need to guarantee we don't ack it back. So essentially what we need is write to translog in a primaryTerm(which could be stale) but ensure we do a read on the primary term again(this could be another metadata on the remote store that guarantees read after write consistency) after writing the translog but before ack-ing back

The way to achieve that could be writing a file(stale.manifest) in all previous primaryTerm by a writer in primaryTerm (p+1) whose mere existence could mean if the path is writeable or not instead of a common primaryTerm file. The protocol then changes to checking whether the (stale.manifest) exists in the given primaryTerm path.

What is a guaranteed by the cluster is that once a higher primaryTerm has been known to exist all older primaryTerm writers should go stale. What follows from here is that, alternatively even if we write a new file in a common directory prefixed by primaryTerm before we start the shard in the same primaryTerm, all writers should ensure they do a LIST and confirm if there are no other writers in the higher term that could be active before acknowledging writes.

If the local primary term and the CurrentPrimaryTerm don’t match, we respond with failure back to the client. However, since the data has been persisted and the new primary could have read the unacknowledged translog, we can allow that to happen and live without the need for a rollback.
There is a possibility that unacknowledged or failed write requests can get persisted. For such cases, we can use the last modified time of the translog file to determine whether to keep or discard it for replaying the translogs.

I don't think we want to use timestamps for logical ordering like this because clock skew can happen, clocks can go backwards, etc. Are sequence numbers serialized relative to primary term changes? If we know the sequence number at which a primary term changes, then any sequence numbers newer than that from the previous primary term could be ignored.

Firstly its still fine to take or discard an unacknowledged write(since dirty writes are status quo), totally agree timestamp resolutions would be incorrect to start with and we would be using sequence numbers for all practical purposes of resolving conflicts

@Bukhtawar
Copy link
Collaborator Author

Given there could be rough edges using a new "remote store" protocol with a mix of cluster state primitives and distributed store locking espl when file system stores lack optimistic concurrency control properties, it seems safer and optimal to use No Op replication protocol which already has been battle tested and relies on a consensus among all writers per request before making a call on acknowledging the request. I would seek feedbacks on the cases that would probably not be covered by this protocol before we start on ironing out rough edges for the "remote store" protocol

@muralikpbhat
Copy link

muralikpbhat commented Jul 19, 2022

I agree that no-op replication is more safer, but it introduces the dependency from primary to replica on the write path, which otherwise doesn't have to be there in case of NRT with remote storage.

I get the challenges involved in depending on the properties of remote store. Couple of requirements here: read-after-write and uniqueness guarantee for primary term. Since primary term uniqueness is guaranteed by cluster management system, all we need is a way for the older primary to know that a new primary has started writing. 2 options: 1) Old primary can read a common file (or do a list) after every write (before ack), which is expensive. 2) New primary can write only after there is a gaurantee that old primary has stopped. This can also be expensive if it has to do it for every write. However if it can do only for the write after primary promotion, it can be optimal. For example, primary when it is promoted, can make the previous term's remote store locations an non-writable. This way, in steady state, there is no tax, only during primary switch, there is one time work of marking the old location non writable. For example, in case of S3, we can use object tags to have a condition that only objects with newPrimaryTerm as tag can be written to that location of the shard. Hopefully most of the remote stores support that kind of ACLs. However if this ACL policy propagation takes time, this could delay new primary promotion.

@ashking94
Copy link
Member

Thanks @andrross & @Bukhtawar for ideating on this.

@andrross the check and set semantics is probably not needed since as discussed it's fine for translogs to have dirty writes we just need to guarantee we don't ack it back. So essentially what we need is write to translog in a primaryTerm(which could be stale) but ensure we do a read on the primary term again(this could be another metadata on the remote store that guarantees read after write consistency) after writing the translog but before ack-ing back

The way to achieve that could be writing a file(stale.manifest) in all previous primaryTerm by a writer in primaryTerm (p+1) whose mere existence could mean if the path is writeable or not instead of a common primaryTerm file. The protocol then changes to checking whether the (stale.manifest) exists in the given primaryTerm path.

What is a guaranteed by the cluster is that once a higher primaryTerm has been known to exist all older primaryTerm writers should go stale. What follows from here is that, alternatively even if we write a new file in a common directory prefixed by primaryTerm before we start the shard in the same primaryTerm, all writers should ensure they do a LIST and confirm if there are no other writers in the higher term that could be active before acknowledging writes.

Adding on to @Bukhtawar's point, we can revise the flow when the replica promotion happens -> Initialising shard (in Primary mode) can fetch the common path (Base64UUID/IndexUUID/ShardID/PrimaryTerms) and list all the primary terms in the folder -> It can proactively fail itself if it sees that there are primary terms bigger than it's own -> It publishes it's own primary term (<primary-term>.active) in the common path and starts downloading the translogs.

The flow during write request becomes -> Primary receives the request -> It stores the request in the indexing buffer and uploads the translog to the remote store. Translog prefix for remote store: Base64UUID/IndexUUID/ShardID/PrimaryTerm -> It does a list all on the common path and validates that there are no higher terms that are active than it's own.

Firstly its still fine to take or discard an unacknowledged write(since dirty writes are status quo), totally agree timestamp resolutions would be incorrect to start with and we would be using sequence numbers for all practical purposes of resolving conflicts

Agreed to this point.

@muralikpbhat
Copy link

Regarding no-op replica write option: how do you handle the case of 'only' copy? What is the guarantee that a primary recovery wont happen unless partitioned out primary has stopped writing?

@Bukhtawar
Copy link
Collaborator Author

Bukhtawar commented Jul 19, 2022

For example, primary when it is promoted, can make the previous term's remote store locations an non-writable. This way, in steady state, there is no tax, only during primary switch, there is one time work of marking the old location non writable. For example, in case of S3, we can use object tags to have a condition that only objects with newPrimaryTerm as tag can be written to that location of the shard. Hopefully most of the remote stores support that kind of ACLs. However if this ACL policy propagation takes time, this could delay new primary promotion.

Thanks @muralikpbhat yes thats a good point. While tagging improves and helps with the performance it certainly makes the correctness and failover or propagation delays depend on the support for these feature across various cloud providers and their supported consistency models.

The "Noop Replication" also serves to ping replica copies and tries to fail stale copies proactively which otherwise would continue to stay in the cluster for longer and cause problems on failovers

Regarding no-op replica write option: how do you handle the case of 'only' copy? What is the guarantee that a primary recovery wont happen unless partitioned out primary has stopped writing?

We would rely on the no master block to kick in after X seconds(this time can be relaxed to include GC pauses as well). Once this duration has expired we would begin with the restore. Essentially we are relying on an implicit leasing by the master which gets expired once there is no master and the write block sets in. Today too we wait for a failed node to join back for 60s defined by the delay unassigned primary time

@andrross
Copy link
Member

I agree that no-op replication is more safer, but it introduces the dependency from primary to replica on the write path, which otherwise doesn't have to be there in case of NRT with remote storage.

I think this point is the crux of the issue. Keeping the dependency from primary to replicas means that slow or otherwise misbehaving replicas can impact indexing, as you can only be as fast as your slowest replica. "No-op replication" will improve things substantially compared to the status quo given that the replicas only have to do a trivial validation (hence the "no-op"), but completely breaking this dependency is potentially a big win. That being said, the choice here isn't a one-way door. I think it might be reasonable to start with "no-op replication" as it is safer, less intrusive change with fewer requirements on the remote store. @muralikpbhat What do you think? Is it reasonable to start with "no-op replication" or should try to eliminate the primary->replica dependency from the start?

@itiyama
Copy link

itiyama commented Jul 20, 2022

  1. What if the user is fine with paying the cost of storing a checkpoint in a store that supports conditional writes? We should keep that option in mind for extensibility in future while implementing any of the approaches above. Support for conditional writes of a checkpoint(term+ generation combination) can bring in a lot of simplicity and (maybe) performance gains as the checkpoint will be smaller. There should be a mechanism to decouple the checkpoint write from actual data so that different store technology can be selected for them. Relying on a store with conditional writes will even prevent dirty reads.
  2. Writing files in a term specific location is agnostic of any of the approaches described above since the store does not support conditional writes, hence it is important for the application to maintain immutability in some way. It could be that all segments or translog files are written with a unique name in the store and manifest maintains the actual file to store file name mapping.
  3. No-op replication - I agree that this is a much safer approach with the caveat that indices with 0 replica will turn red(hence store is not a true durability fallback anymore).
  4. The tag based approach on S3 may not work since S3 tags are on objects and not on paths. Same could be for other cloud providers as well.
  5. It would be interesting to see how you do the garbage collection without making it expensive(since LIST calls are expensive). One way could be to collect the data using the older checkpoint diffs, but that would still leave some garbage data in the storage.
  6. Leasing will work, but it is difficult to get right :) Leasing will not ensure that there are no spurious writes to the store- just that the bad writes are not acknowledged. Application will need to check the lease expiry just before returning the response. Another thing that worries me about lease is that the promotions will have to wait for lease expiry.
  7. Write-check protocol introduces an extra read for every call to store. The check has to be done on every call. Selectively making the call on writer switches etc. is not going to work.

@muralikpbhat
Copy link

Great points. I am also supportive of beginning with no-op replication. However, it would be ideal to have the design such a way that a remote storage developer can actually implement this differently if desired. Is it possible to abstract this out and give no-op replica as base implementation which can be overridden ?

@ashking94
Copy link
Member

Thanks @itiyama for the critical points.

Addressing to couple of points -

What if the user is fine with paying the cost of storing a checkpoint in a store that supports conditional writes? We should keep that option in mind for extensibility in future while implementing any of the approaches above. Support for conditional writes of a checkpoint(term+ generation combination) can bring in a lot of simplicity and (maybe) performance gains as the checkpoint will be smaller. There should be a mechanism to decouple the checkpoint write from actual data so that different store technology can be selected for them. Relying on a store with conditional writes will even prevent dirty reads.

Noted.

No-op replication - I agree that this is a much safer approach with the caveat that indices with 0 replica will turn red(hence store is not a true durability fallback anymore).

In line with @Bukhtawar's response here, the stale writer would fail and auto restore (enhancement to #3145 ) would kick in reviving the red index - thus adhering to durability tenet.

It would be interesting to see how you do the garbage collection without making it expensive(since LIST calls are expensive). One way could be to collect the data using the older checkpoint diffs, but that would still leave some garbage data in the storage.

This would be taken care as part of #3766. This needs to be thought through, but definitely there would be a need for an optimised approach here.

@ashking94
Copy link
Member

Great points. I am also supportive of beginning with no-op replication. However, it would be ideal to have the design such a way that a remote storage developer can actually implement this differently if desired. Is it possible to abstract this out and give no-op replica as base implementation which can be overridden ?

@muralikpbhat Appreciate your suggestions. So, here is the proposal, we can go with no-op replication for 2.3 and keep it till GA. This will ensure that existing remote stores support segment/translog remote storage. Since correctness is super crucial and core to OpenSearch, it would make sense to get no-op replication approach implemented with durability and correctness honoured with definiteness. After GA, we can have an interface that we can expose as plugin independent from the remote store for Segments and Translog. That interface would need stricter guarantees from plugin extensions like conditional writes or ACL path based writes. However, we will focus on implementing no-op replication (as the only mechanism for handling divergent writes) correctly and with durability as the main tenet for 2.3 and GA.

@itiyama
Copy link

itiyama commented Jul 23, 2022

We would rely on the no master block to kick in after X seconds(this time can be relaxed to include GC pauses as well). Once this duration has expired we would begin with the restore. Essentially we are relying on an implicit leasing by the master which gets expired once there is no master and the write block sets in. Today too we wait for a failed node to join back for 60s defined by the delay unassigned primary time

Let us be clear on why we are choosing the No-op replication approach. We rely on that approach because it is an existing mechanism of handling divergent writes and we don't want to build a new protocol at this point as any new protocol introduces complexity. The approach relies on primary replica interaction for its correctness. When the only copy of the data is lost, it marks the shard as unassigned. Now, we are digressing from that algorithm where we will recover from data store by introducing leasing selectively on "only copy" failures. Since the data store is not used as part of the agreement(primary term check) during writes in the regular case, it is difficult to say that this approach would work. We are swapping out one part of a complex distributed systems algorithm by introducing a well defined technique. I am very nervous about doing it without building a TLA+ model.

If you fail the primary and mark it as unassigned, the protocol does not diverge from existing one. We can then provide customers an option of auto recovering from store via a setting by accepting data loss. Similarly, if you include the remote store in the agreement(primary term check by verifying the term on remote store), the protocol does not diverge.

Another question I have about this: If you can make leasing protocol work correctly for a single primary + store, what stops you from making it work correctly with primary + replica + store. In that case, can we get rid of no-op replication altogether?

@Bukhtawar
Copy link
Collaborator Author

Let us be clear on why we are choosing the No-op replication approach. We rely on that approach because it is an existing mechanism of handling divergent writes and we don't want to build a new protocol at this point as any new protocol introduces complexity. The approach relies on primary replica interaction for its correctness. When the only copy of the data is lost, it marks the shard as unassigned. Now, we are digressing from that algorithm where we will recover from data store by introducing leasing selectively on "only copy" failures. Since the data store is not used as part of the agreement(primary term check) during writes in the regular case, it is difficult to say that this approach would work. We are swapping out one part of a complex distributed systems algorithm by introducing a well defined technique. I am very nervous about doing it without building a TLA+ model.

Any new changes in replication protocol will need a TLA+ model irrespective IMO.
Having said that, yes we are digressing on that algorithm but also making the leasing stronger to check on master block/reachability during every lease refresh and failing to ack back writes once we are unable to renew the lease or the lease is almost close to expiring. Its possible that the writer thread is in-flight and the lease expires, but thats probably fine, since the status quo is to allow dirty reads, all we need to ensure is we check on the lease before we ack back.

This can be implemented via a daemon thread that checks for leader reachability and allows to check for lease ownership subsequently before acknowledging writes.

If you fail the primary and mark it as unassigned, the protocol does not diverge from existing one. We can then provide customers an option of auto recovering from store via a setting by accepting data loss. Similarly, if you include the remote store in the agreement(primary term check by verifying the term on remote store), the protocol does not diverge.

We would ideally target a solution that guarantees durability.
The problem with putting this responsibility on remote store is

  1. Correctness Guarantees: we can't always ensure that a remote store(currently a blob store) will have the same consistency model across cloud providers.
  2. Latency : There would be additional latency overhead for every write if we have a remote store dependency and getting ACL and others access control mechanisms to work might not work across all cloud providers

The problem with putting remote store in agreement is that the remote store is just a storage extension of their respective current primary or isolated primary copies, however with no-replica copies case this exactly becomes the second option of "Remote Store".

The plan going forward is to allow cloud providers to leverage durability without constraints on their consistency models. While also providing extensions to specific remote stores in future which can guarantee strong consistency and offer concurrency controls at low latencies.

@itiyama
Copy link

itiyama commented Jul 26, 2022

  1. I do not have any correctness concerns if you are building a TLA+ model for this protocol. The model you shared has the downside of coupling primary and replicas and upside of reduced latency for writes. None of the models are simple anymore :)
  2. If you are already doing the hard work of getting the lease right for single primary case, why not do it for primary+replica case. That will also allow you to decouple primary and replica completely and rely entirely on store. What concerns do you see? The only trade-off is that during primary switch, latency for writes will increase. Note that latency will not increase for every write. The trade-off here is latency during primary switch vs primary/replica decoupling. It is not longer algorithm simplicity since you are anyway building leasing.
  3. Relying on store for primary term agreement will increase latency. You can still achieve correctness by write check protocol on a read after write consistent store, which is anyway a constraint. The trade-off is latency per write vs complete decoupling of primary/replica.

@ashking94
Copy link
Member

We need to handle the case of single shard, where an isolated shard fails itself so that auto restore from remote store can happen.

@ashking94
Copy link
Member

Closing this and opening #5672 for tracking failing single shards when they are isolated.

@shwetathareja
Copy link
Member

@ashking94 to your original concern here

We can tweak changes done in #3621 for stale primary to stop requesting write requests. Auto restore can kick in then. This will make sure that, when the restore happens, the stale primary has failed totally.

How are we going to stop the write in stale primary during no-master block which is partitioned from the master? Can you please share the PR for it?

@ashking94
Copy link
Member

@shwetathareja This has been planned for 2.7 release. I will be working on this shortly. #5672 is the issue for the same that you can use to track.

@shwetathareja
Copy link
Member

shwetathareja commented Feb 28, 2023

Thanks @ashking94 . I am expecting #5672 to address generically where last copy of shard was unassigned and it went red as there could be cases where primary & replica got partitioned away from master together or replica was unassigned.

@ashking94
Copy link
Member

Fail requests on isolated shards - #6737

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

9 participants