[RFC] Modeling writes as an extensible workflow #3237

kartg · 2022-05-06T22:46:18Z

Note 1 - The discussion in this issue focuses on the write paths in OpenSearch, though the assertions herein are probably true for other parts of the codebase.

Note 2 - For clarity, the term “replication” will only be used to describe code paths that result in segment files being created on a shard. The act of sending requests from the primary shard to replicas will be termed “forwarding”.

What's the problem?

Much of the code in the write path for segments is tightly coupled together. For example:

The implementation of index and delete operations uses a class hierarchy that mandates that these operations be replicated.
The notion of “replication” is coupled to “forwarding” (see Note 2 above)
The notion of a “write” is tightly coupled with a translog update.

(I'll talk about why we would want to solve for these in a moment.)

This coupling is driven by the code architecture. In technical terms, I'd say it uses a compile-time, top-down (inheritance-based) chain of responsibility (CoR) design pattern. Put more simply, it's like spaghetti lasagna - lots of layers encasing lots of noodley code.

Such a design pattern poses two problems:

The compile-time nature precludes the ability to configure behavior at run-time
The inheritance-based CoR pattern implicitly defines a fixed set of steps for the code, but misses out on the benefits of a unified orchestrator class or workflow definition - for example, the ability for a step to react to the result of a previous step

I think we can make this better (though that recipe is beyond saving IMO).

What should we do about it?

We should rearchitect write-path operations as a workflow comprised of the following configurable steps:

Reroute (route the incoming request to the correct shard/node)
Ingest (process the request)
Persist (make the results of the request durable)
- This would include separate configuration/extension points for storage and translog
Forward (send the request to another node)

Persist and Forward will be conditional steps that rely on the output of prior steps to determine if they should execute.

Why do this now?

Because extensibility is one of key themes for Opensearch (#2095). It is essential that we start tackling this architectural limitation now since we have multiple ongoing initiatives for OpenSearch extensibility that require more run-time configurability:

With the introduction of replication strategies like segment replication being defined per-index, write code paths can no longer simply mandate replication. Segment replication no longer needs “replication” to be coupled with “forwarding”.
With a remote translog, the need for “forwarding” is removed entirely.
The introduction of remote storage will affect the behavior of both replication and recovery.

Open Questions (aka things I'm mulling over)

What situations/architectures (if any) would require the Reroute step to be optional/configurable?
Is there a way to remove the need for an Engine class, so that ingest and translog can be configured independent of one another?
How does this workflow and the decoupling of replication vs forwarding affect sync actions?

Given the sheer breadth of functionality in the Opensearch codebase, there are probably other coupled components that I haven't considered. Please comment below if there are things that would break with this workflow approach, or other areas that may benefit from a similar approach.

Bukhtawar · 2022-05-07T18:44:27Z

Few observations

With #1319, the need for “forwarding” is removed entirely.

There are few cases today that inherently rely on forwarding writes for correctness, for instance if the primary is partitioned off from the rest of the cluster, it continues to accept writes, once it forwards the request to the replica which meanwhile has been promoted to the primary, it then becomes aware of the problem and hence respond with a failure which otherwise would have been acknowledged and writes diverged.
Forwarding is a nice property to detect such anomalies. We need to give it more thought before we get rid of this

Is there a way to remove the need for an Engine class, so that ingest and translog can be configured independent of one another?

With #1319, translogs would be decoupled from the Engine and would be made optional and extracted out
Would that simplify the Engine?

dblock · 2022-05-10T15:37:41Z

At a high level this makes a lot of sense to me. If you see a refactor increment that would make the code better I would PR that on main. An alternative could be to try and setup the new workflow as proposed without touching the existing implementation, that could be a good start in a feature branch.

kartg · 2022-05-11T04:04:59Z

There are few cases today that inherently rely on forwarding writes for correctness, for instance if the primary is partitioned off from the rest of the cluster, it continues to accept writes, once it forwards the request to the replica which meanwhile has been promoted to the primary, it then becomes aware of the problem and hence respond with a failure which otherwise would have been acknowledged and writes diverged.

Thanks @Bukhtawar ! These are exactly the kind of blind spots I was hoping to get feedback on. I'd like to understand this particular behavior more in depth. Are you aware of any reading material to learn more about this? Or could you point me to other resources (code, exceptions, etc.) from where I could start learning?

With #1319, translogs would be decoupled from the Engine and would be made optional and extracted out
Would that simplify the Engine?

Decoupling translog behavior from Engine would indeed simplify its code. Assuming the next step is to decouple ingest/storage from Engine, I'm left wondering what value the Engine class itself is adding by encapsulating those two extension points.

To use a food analogy again 😉 - what I'm asking is if the two extension points are like a bagel 🥯 and cream cheese (where it doesn't make sense to have them separately) or like fries 🍟 and soda (where it does)

kartg · 2022-05-12T18:21:31Z

@nknize I saw this PR comment from you so I'd love to hear what you think about the ideas here, especially around what value the Engine class provides (as discussed above 👆 )

kartg added enhancement Enhancement or improvement to existing feature or request discuss Issues intended to help drive brainstorming and decision making distributed framework RFC Issues requesting major changes labels May 6, 2022

Bukhtawar mentioned this issue Jun 27, 2022

[Discuss] Writes on NRT Replica with Remote Translog #3706

Closed

anasalkouz added Indexing Indexing, Bulk Indexing and anything related to indexing and removed distributed framework labels Sep 19, 2023

msfroh added the Roadmap:Cost/Performance/Scale Project-wide roadmap label label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Modeling writes as an extensible workflow #3237

[RFC] Modeling writes as an extensible workflow #3237

kartg commented May 6, 2022 •

edited

Loading

Bukhtawar commented May 7, 2022

dblock commented May 10, 2022

kartg commented May 11, 2022 •

edited

Loading

kartg commented May 12, 2022

[RFC] Modeling writes as an extensible workflow #3237

[RFC] Modeling writes as an extensible workflow #3237

Comments

kartg commented May 6, 2022 • edited Loading

What's the problem?

What should we do about it?

Why do this now?

Open Questions (aka things I'm mulling over)

Bukhtawar commented May 7, 2022

dblock commented May 10, 2022

kartg commented May 11, 2022 • edited Loading

kartg commented May 12, 2022

kartg commented May 6, 2022 •

edited

Loading

kartg commented May 11, 2022 •

edited

Loading