Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Restate architecture documentation #100

Merged
merged 3 commits into from
Aug 7, 2023
Merged

Add Restate architecture documentation #100

merged 3 commits into from
Aug 7, 2023

Conversation

tillrohrmann
Copy link
Contributor

This fixes #92.

@netlify
Copy link

netlify bot commented Aug 2, 2023

Deploy Preview for docsrestatedev ready!

Name Link
🔨 Latest commit 9ffae9f
🔍 Latest deploy log https://app.netlify.com/sites/docsrestatedev/deploys/64d0ecf3d8fb100008298dfd
😎 Deploy Preview https://deploy-preview-100--docsrestatedev.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

@igalshilman igalshilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @tillrohrmann !

Copy link
Contributor

@gvdongen gvdongen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Till! I think there are a few topics that we could still discuss here:

  • Journal: I think the journal warrants a dedicated section here to explain how it gives you suspension and replay. And that it logs invocations but also context calls.
  • How invocations work: request goes to the ingress, state is eagerly attached together with the journal, runtime knows where the service is running (service registry) and sets up the connection... Mention suspensions
  • Service registry: that metas keep service registry based on discovery and that services don't need to do this themselves anymore. Their requests just go via the runtime.
    Just some thoughts...

The *Metas* are responsible for managing the service meta information and coordinating the *Workers*.

The *Workers* are responsible for invoking services, storing their journal and service state as well as maintaining processing order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't introduce the term journal yet. Probably would be better to describe the responsibility from a higher perspective and then say we accomplish that via having a central journal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the journal approach in general deserves a section here. To explain how we do all our magic: resiliency etc.

@tillrohrmann
Copy link
Contributor Author

Thanks for the feedback @gvdongen. I will add sections for the journal and service invocation process.

@tillrohrmann
Copy link
Contributor Author

I've pushed another commit including the description of durable execution via journaling, the service registry and the service invocation flow @gvdongen.

Copy link
Contributor

@gvdongen gvdongen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the new sections @tillrohrmann
I think this adds a lot of useful information for the user!
I think the reading flow could be improved by shuffling some sections around... I would propose changing the order of the sections to

  1. Durable execution via journaling (because from the user perspective this is the most important building block that he needs to understand)
  2. Service invocation flow (includes service registry section... not sure if that would improve the reading flow)
  3. Scalability
  4. Consistency & fault tolerance (although in the mental model this belongs together with the journal for durable execution for me...)
  5. State storage (include state queries into the section or maybe skip that for this page... For me this is more like a feature than an architecture component...)
    What do you think?

## Service registry

All servie meta information is maintained by the *Metas* via the service registry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All servie meta information is maintained by the *Metas* via the service registry.
All service meta information is maintained by the *Metas* via the service registry.

@tillrohrmann
Copy link
Contributor Author

tillrohrmann commented Aug 4, 2023

It seems that you have some other expectations for the architecture page than what I thought @gvdongen. My understanding for this page was to describe the runtime's architecture (basic principles and design ideas) in order to give credibility to what we are doing (like the runtime is built with scalability, consistency and fault tolerance in mind). Maybe you had more the whole of Restate in mind (what are the basic concepts you as a user need to understand, how do things work end-to-end from a higher level)?

  1. Durable execution via journaling (because from the user perspective this is the most important building block that he needs to understand)

I am wondering whether durable execution is something that belongs on the architecture page of how the runtime works or should be more closer to the "Services" section. Technically speaking one could implement durable execution also by taking a memory snapshot or it could be a pure SDK concept. What matters from the runtime perspective is that the service endpoint can durable store bytes (not 100% correct because the runtime also needs to understand a few commands like calls or sleeps). Also given my description I talk more about the service endpoint than the runtime which might be an indicator.

  1. Service invocation flow (includes service registry section... not sure if that would improve the reading flow)

Moving the service invocation flow up would mean that the definition of partitions and partition processors would only come later. It might also not be clear why one needs to route the invocation to the right Worker running a specific partition processor at this point.

  1. Scalability
  2. Consistency & fault tolerance (although in the mental model this belongs together with the journal for durable execution for me...)

For me these are two different pairs of shoes. What I want to describe here is how the runtime achieves consistency and fault tolerance (by running replicated state machines using Raft). What is built on top of it (durable execution via journaling) is certainly related but is just one way of how to achieve durable execution. If we could take a memory snapshot of the service endpoint, then storing these bytes would work equally well.

  1. State storage (include state queries into the section or maybe skip that for this page... For me this is more like a feature than an architecture component...)

I would like to keep the state query part because for me it is major architectural component (exposing internal state via a SQL interface by running a SQL execution engine) and it is technically speaking independent of the actual state storage.

@tillrohrmann
Copy link
Contributor Author

I've pushed a commit that groups scalability and consistency & fault tolerance under principles and state storage, state query and service registry under components. Not sure whether this makes the reading experience easier.

@gvdongen
Copy link
Contributor

gvdongen commented Aug 4, 2023

First of all, sorry for being so difficult there... I think all the content here is good so feel free to merge it. I don't want to block this...

Besides that, it seems indeed that we have slightly different views of the scope of what should be on there...I mainly saw this page as a a page where we describe how Restate makes sure that it can do what it does. So that it doesn't just seem like magic to users. I think what you wrote until now is definitely content that should be covered there... But I think I saw this page as slightly broader, so more as the architecture of Restate-as-a-larger-product instead focused on the runtime.

The way I saw it, was to have:

  • an overview page which is the docs landing page which basically lists the main features of Restate, how Restate sits in your stack and the key use cases (I am working on that)
  • an architecture page which gives more insight in how those features are accomplished. I know that from an implementation perspective some things sit closer to the SDK. But I think the user will see the split differently and will see it as my application logic vs. Restate. And from that perspective it is not important if a feature is enabled by the SDK vs by the runtime. So that's why I saw this page as a page which gives a bit more info on the key things which make Restate possible (central log, distributed runtime, RocksDB state store, consistency, and all the other topics you discussed)

Anyway, let's not block this on my feedback here. Because this page contains a lot of useful information for the user and we can always iterate to improve the story that we are telling across the docs 👍

@tillrohrmann
Copy link
Contributor Author

I'll try to give it another pass to improve the overall reading experience by highlighting what matters most to users. If I don't manage to improve it, then we'll iterate on what we have in this PR here.

This commit restructures the architecture section to start
with durable execution and the service invocation flow. The
runtime specific sections are now grouped under "Runtime".
@tillrohrmann
Copy link
Contributor Author

I've re-arranged the sections into:

  1. Durable execution via journaling
  2. Service invocation flow
  3. Runtime
    a. Scalability
    b. Consistency
    c. Storage
    d. State query
    e. Service registry

@tillrohrmann tillrohrmann merged commit 9ffae9f into restatedev:main Aug 7, 2023
5 checks passed
@tillrohrmann tillrohrmann deleted the issue#92 branch August 7, 2023 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restate architecture
3 participants