From 8c2abb09c2986687bd2a580b9680be0fa1375fef Mon Sep 17 00:00:00 2001 From: Pavel Tcholakov Date: Fri, 6 Sep 2024 14:35:21 +0200 Subject: [PATCH] Add sections on backing up and upgrading Restate --- docs/operate/data-backup.mdx | 30 ++++ docs/operate/upgrading.mdx | 28 ++++ docs/references/errors.md | 213 --------------------------- docs/references/sql-introspection.md | 132 ----------------- 4 files changed, 58 insertions(+), 345 deletions(-) create mode 100644 docs/operate/data-backup.mdx create mode 100644 docs/operate/upgrading.mdx diff --git a/docs/operate/data-backup.mdx b/docs/operate/data-backup.mdx new file mode 100644 index 00000000..722028cf --- /dev/null +++ b/docs/operate/data-backup.mdx @@ -0,0 +1,30 @@ +--- +sidebar_position: 8 +description: "Strategies for backing up and restoring the Restate data store" +--- + +import Admonition from '@theme/Admonition'; + +# Data backup + + + Future versions of Restate will support distributed deployment with spanning multiple machnes enhancing the availability you can achieve with your Restate cluster. This document only covers single-node Restate deployments. + + +The Restate server persists both metadata (such as the details of deployed services, in-flight invocations) and data (e.g. virtual object and workflow state keys) in its datastore, which is located in its base directory (by default, the `restate-data` path relative to the startup working directory). Restate is configured to perform write-ahead logging with fsync to the log enabled to ensure that effects are fully persisted before being acknowledged to participating services. + +Backing up the full contents of the Restate base directory will ensure that you can recover this state in the event of a server loss. We recommend placing the data directory on fast block storage that supports atomic snapshots, such as [Amazon EBS volume snapshots](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-snapshots.html). Alternatively, we recommend stopping the `restate-server` process and archiving the base directory contents before restarting it. This will ensure that the backup contains an atomic view of the persisted state. + +In addition to state, you should also back up the Restate configuration used. + +## Restoring backups + + + Restate can not guarantee that it is the only instance of the given node. You must take care to only run one instance of any given Restate node when restoring copies of the data store from backup, as running multiple instances might lead to "split-brain" scenarios where different servers process invocations for the same set of services, causing state to diverge. + + +Restoring from backup requires: + +- Restate server release compatible with the version which produced the data store snapshot being resotred; see the section on [version upgrade and rollback](upgrading) +- compatible [Restate server configuration](/operate/configuration/server) - in particular, ensure that `cluster-name` and `node-name` attributes match +- exclusive access to a data directory restored from the most recent atomic snapshot of the previous restate installation diff --git a/docs/operate/upgrading.mdx b/docs/operate/upgrading.mdx new file mode 100644 index 00000000..86bbe7db --- /dev/null +++ b/docs/operate/upgrading.mdx @@ -0,0 +1,28 @@ +--- +sidebar_position: 7 +description: "Restate installation software version upgrades, compatibility policy, rollback strategy" +--- + +# Version upgrades + +Restate follows [Semantic Versioning](https://semver.org/). The server persists compatibility markers which enable it to detect incompatible data versions. However, you should be careful to follow supported version migration paths and perform [data backups](data-backup) when performing software upgardes. + +## What is the Restate compatibility promise? + +Migrating to the latest patch level should always be possible and is recommended to benefit from the latest bugfixes and enhancements available. + +Incremental minor version upgrades will retain functional compatibility with the immediate prior version. That is, for any minor version update, you will be able to upgrade from `x.y` to `x.(y+1)` while retaining all persisted data and metadata. You must not skip minor version upgrades as this might cause you to miss one-time datastore migrations required for preserving forward compatibility. + +We recognize that sometimes unexpected compatibility scenarios might occur. For this reason downgrading a Restate installation to the latest patch level of the previous minor version is also supported. For example, you can safely rollback the Restate server versionf rom `x.(y).0` to `x.(y-1).z` if you encounter unexpected compatibility issues elsewhere in your services. + +Consult the release notes for specific details of any new version when planning upgrades. + +## Service compatibility + +Registered Restate services must use a compatible SDK which is compatible with the service protocol version(s) the running Restate server. Note that Restate SDK artifacts follow independent versioning from the server. You can find the latest SDK compatibility matrix in the SDKs' respective repositories: + +* [Restate Java SDK](https://github.com/restatedev/sdk-java#versions) +* [Restate TypeScript SDK](https://github.com/restatedev/sdk-typescript#versions) +* [Restate Go SDK](https://github.com/restatedev/sdk-go#versions) +* [Restate Python SDK](https://github.com/restatedev/sdk-python#versions) +* [Restate Rust SDK](https://github.com/restatedev/sdk-rust#versions) diff --git a/docs/references/errors.md b/docs/references/errors.md index 6e4dd7d5..a6e4feed 100644 --- a/docs/references/errors.md +++ b/docs/references/errors.md @@ -7,216 +7,3 @@ slug: errors This page contains the list of error codes emitted by Restate components. -

META0003

- -Cannot reach the service endpoint to execute discovery. Make sure: - -* The provided `URI`/`ARN` is correct -* The deployment is up and running -* Restate can reach the deployment through the configured `URI`/`ARN` -* If additional authentication is required, make sure it's configured through `additional_headers` - -

META0004

- -Cannot register the provided deployment, because it conflicts with the uri of an already registered deployment. - -In Restate deployments have a unique uri/arn and are immutable, thus it's not possible to discover the same deployment twice. -Make sure, when updating a deployment, to assign it a new uri/arn. - -You can force the override using the `"force": true` field in the discover request, but beware that this can lead in-flight invocations to an unrecoverable error state. - -See the [versioning documentation](https://docs.restate.dev/operate/versioning) for more information. - -

META0005

- -Cannot propagate deployment/service metadata to Restate services. If you see this error when starting Restate, this might indicate a corrupted Meta storage. - -We recommend wiping the Meta storage and recreating it by registering deployments in the same order they were registered before. - -

META0006

- -Cannot register the newly discovered service revision in the provided deployment, because it conflicts with an already existing service revision. - -When implementing a new service revision, make sure that: - -* The service type is the same as the previous revision. -* The new revision contains at least all the handlers of the previous revision. - -See the [versioning documentation](https://docs.restate.dev/operate/versioning) for more information. - -

META0009

- -The provided subscription is invalid. Subscriptions should have: - -* A `source` field in the format of `kafka:///`. When registering, the Kafka cluster should be configured in the Restate configuration. -* A `sink` field in the format of `service:///`. When registering, service and handler should be available already in the registry, meaning they have been previously registered. -* Additional constraints may apply depending on the sink service type. - -Please look at the Kafka documentation (for [TypeScript](https://docs.restate.dev/develop/ts/kafka) and [Java](https://docs.restate.dev/develop/java/kafka)) for more details on subscriptions and event handlers. - -

META0010

- -Trying to open meta storage directory, configured via `meta.storage_path`, which contains incompatible data. This indicates that your data was written with a different Restate version than you are running right now. - -Suggestions: - -* Up/Downgrade your Restate server to the requested version. -* Migrate your data to the requested version by running the migration scripts. -* Wipe your meta storage directory to start afresh via `rm -rf //local-metadata-store`. -* Configure a different meta storage directory via `meta.storage_path`. - -

META0011

- -Non-empty meta storage directory, configured via `meta.storage_path`, is missing the version file. This indicates data corruption or that the data has been written with an incompatible Restate version < 0.8. - -Suggestions: - -* Wipe your meta storage directory to start afresh via `rm -rf //local-metadata-store`. -* Configure a different meta storage directory via `meta.storage_path`. -* Downgrade your Restate server to {'<='} 0.7. - -

META0012

- -Trying to register a service endpoint whose supported service protocol versions is incompatible with the server. This indicates that you have to upgrade your server to make it work together with the deployed SDK. - -Suggestions: - -* Check the compatibility matrix between SDK and server versions - * Try upgrading to a server version which is compatible with your SDK - * Try using an SDK version which is compatible with your server - -

META0013

- -Received a bad service discovery response from the specified service endpoint. This indicates that you are trying to register a service endpoint with an incompatible server. - -Suggestions: - -* Check the compatibility matrix between SDK and server versions - * Either deploy a server version which is compatible with your SDK - * Or use an SDK version which is compatible with your server - -

META0014

- -Service discovery response failed, and the server may have responded in HTTP1.1. -This can happen when discovering locally running dev servers from Faas platforms -eg `wrangler dev`. FaaS platforms in generally will support HTTP2, however, so -this is only a local development concern. - -You can try to discover the endpoint with `--use-http1.1` when working -with these local dev servers. This should not be needed in production. - -

META0015

- -The service discovery response suggested that the SDK is serving in -bidirectional protocol mode, but discovery is going over a protocol that does -not support it (currently only Lambda). - -Lambda endpoints do not support the bidirectional protocol mode and should be -configured to announce themselves as being in request-response mode upon -discovery. - -

RT0001

- -The invocation response stream was aborted due to the timeout configured in `worker.invoker.abort_timeout`. -This timeout is fired when Restate has an open invocation, and it's waiting only for response messages, but no message is seen for the configured time. - -Suggestions: - -* Check for bugs in your code. Most likely no message was sent to Restate because your code is blocked and/or reached a deadlock. -* If your code is supposed to not send any message to Restate for longer than the configured timeout, because for example is doing a blocking operation that takes a long time, change the configuration accordingly. - -

RT0002

- -Cannot start Restate because the configuration cannot be parsed. Check the configuration file and the environment variables provided. - -For a complete list of configuration options, and a sample configuration, check https://docs.restate.dev/operate/configuration - -

RT0003

- -The invocation failed because Restate received a message from a service larger than the `worker.invoker.message_size_limit`. - -Suggestions: - -* Check in your code whether there is a case where a very large message can be generated, such as a state entry being too large, a request payload being too large, etc. -* Increase the limit by tuning the `worker.invoker.message_size_limit` config entry, eventually tuning the memory of your operating system/machine where Restate is running. - -

RT0004

- -Failed starting process because it could not bind to configured address. -This happens usually if another process has already bound to this address. - -Suggestions: - -* Select an address that is free. -* Stop the process that has bound to the specified address. -* Make sure you have the permissions to bind to the configured port. Some operating systems require admin/root privileges to bind to ports lower than 1024. - -

RT0005

- -Failed opening RocksDB, because the db file is currently locked. -This happens usually if another process still holds the lock. - -Suggestions: - -* Check no other Restate process is running and using the same db file. -* Configure a different RocksDB storage directory via `worker.storage_rocksdb.path`. - -

RT0006

- -A generic error occurred while invoking the service. -We suggest checking the service/deployment logs as well to get any hint on the error cause. - -

RT0007

- -A retry-able error was received from the service while processing the invocation. Suggestions: - -* Check the component/deployment logs to get more info about the error cause, like the stacktrace. -* Look at the error handling docs for more info about error handling in components (for [TypeScript](https://docs.restate.dev/develop/ts/error-handling) or [Java](https://docs.restate.dev/develop/java/error-handling)). - -

RT0009

- -Trying to open worker storage directory, configured via `worker.storage_rocksdb.path`, which contains no storage format version information. This indicates data corruption or that the data has been written with an incompatible Restate version < 0.8. - -Suggestions: - -* Wipe your meta storage directory to start afresh via `rm -rf //db`. -* Configure a different worker storage directory via `worker.storage_rocksdb.path`. -* Downgrade your Restate server to < 0.8. - -

RT0010

- -Network error when interacting with the service endpoint. This can be caused by a variety of reasons including: - -* The service is (temporarily) down -* The service is (temporarily) not reachable over the network -* Your network security setup blocks Restate from reaching the service -* A config error where the registered service endpoint and the actually deployed service endpoint differ - -

RT0011

- -No deployment found for the given service. -This might indicate that the service and/or the associated deployment was removed from the schema registry before starting to process the invocation. -Check whether the schema registry contains the related service and deployment. - -

RT0012

- -Protocol violation error. This can be caused by an incompatible runtime and SDK version. If the error persists, please file a [bug report](https://github.com/restatedev/restate/issues). - -

RT0013

- -The service endpoint does not support any of the supported service protocol versions of the server. Therefore, the server cannot talk to this endpoint. Please make sure that the service endpoint's SDK and the Restate server are compatible. - -Suggestions: - -* Register a service endpoint which uses an SDK which is compatible with the used server -* Upgrade the server to a version which is compatible with the used SDK - -

RT0014

- -The server cannot resume an in-flight invocation which has been started with a now incompatible service protocol version. Restate does not support upgrading service protocols yet. - -Suggestions: - -* Downgrade the server to a version which is compatible with the used service protocol version -* Kill the affected invocation via the CLI. - diff --git a/docs/references/sql-introspection.md b/docs/references/sql-introspection.md index 3f4de438..01b89ce8 100644 --- a/docs/references/sql-introspection.md +++ b/docs/references/sql-introspection.md @@ -2,135 +2,3 @@ sidebar_position: 3 description: "API reference for inspecting the invocation status and service state." --- -# SQL Introspection API - -This page contains the reference of the introspection tables. -To learn how to access the instrospection interface, check out the [instrospection documentation](/operate/introspection). - -## Table: `state` - -| Column name | Type | Description | -|-------------|------|-------------| -| `partition_key` | `UInt64` | Internal column that is used for partitioning the services invocations. Can be ignored. | -| `service_name` | `Utf8` | The name of the invoked service. | -| `service_key` | `Utf8` | The key of the Virtual Object. | -| `key` | `Utf8` | The `utf8` state key. | -| `value_utf8` | `Utf8` | Only contains meaningful values when a service stores state as `utf8`. This is the case for services that serialize state using JSON (default for Typescript SDK, Java/Kotlin SDK if using JsonSerdes). | -| `value` | `Binary` | A binary, uninterpreted representation of the value. You can use the more specific column `value_utf8` if the value is a string. | - -## Table: `sys_journal` - -| Column name | Type | Description | -|-------------|------|-------------| -| `partition_key` | `UInt64` | Internal column that is used for partitioning the services invocations. Can be ignored. | -| `id` | `Utf8` | [Invocation ID](/operate/invocation#invocation-identifier). | -| `index` | `UInt32` | The index of this journal entry. | -| `entry_type` | `Utf8` | The entry type. You can check all the available entry types in [`entries.rs`](https://github.com/restatedev/restate/blob/main/crates/types/src/journal/entries.rs). | -| `name` | `Utf8` | The name of the entry supplied by the user, if any. | -| `completed` | `Boolean` | Indicates whether this journal entry has been completed; this is only valid for some entry types. | -| `invoked_id` | `Utf8` | If this entry represents an outbound invocation, indicates the ID of that invocation. | -| `invoked_target` | `Utf8` | If this entry represents an outbound invocation, indicates the invocation Target. Format for plain services: `ServiceName/HandlerName`, e.g. `Greeter/greet`. Format for virtual objects/workflows: `VirtualObjectName/Key/HandlerName`, e.g. `Greeter/Francesco/greet`. | -| `sleep_wakeup_at` | `Date64` | If this entry represents a sleep, indicates wakeup time. | -| `promise_name` | `Utf8` | If this entry is a promise related entry (GetPromise, PeekPromise, CompletePromise), indicates the promise name. | -| `raw` | `Binary` | Raw binary representation of the entry. Check the [service protocol](https://github.com/restatedev/service-protocol) for more details to decode it. | - -## Table: `sys_keyed_service_status` - -| Column name | Type | Description | -|-------------|------|-------------| -| `partition_key` | `UInt64` | Internal column that is used for partitioning the services invocations. Can be ignored. | -| `service_name` | `Utf8` | The name of the invoked virtual object/workflow. | -| `service_key` | `Utf8` | The key of the virtual object/workflow. | -| `invocation_id` | `Utf8` | [Invocation ID](/operate/invocation#invocation-identifier). | - -## Table: `sys_inbox` - -| Column name | Type | Description | -|-------------|------|-------------| -| `partition_key` | `UInt64` | Internal column that is used for partitioning the services invocations. Can be ignored. | -| `service_name` | `Utf8` | The name of the invoked virtual object/workflow. | -| `service_key` | `Utf8` | The key of the virtual object/workflow. | -| `id` | `Utf8` | [Invocation ID](/operate/invocation#invocation-identifier). | -| `sequence_number` | `UInt64` | Sequence number in the inbox. | -| `created_at` | `Date64` | Timestamp indicating the start of this invocation. | - -## Table: `sys_idempotency` - -| Column name | Type | Description | -|-------------|------|-------------| -| `partition_key` | `UInt64` | Internal column that is used for partitioning the services invocations. Can be ignored. | -| `service_name` | `Utf8` | The name of the invoked service. | -| `service_key` | `Utf8` | The key of the virtual object or the workflow ID. Null for regular services. | -| `service_handler` | `Utf8` | The invoked handler. | -| `idempotency_key` | `Utf8` | The user provided idempotency key. | -| `invocation_id` | `Utf8` | [Invocation ID](/operate/invocation#invocation-identifier). | - -## Table: `sys_promise` - -| Column name | Type | Description | -|-------------|------|-------------| -| `partition_key` | `UInt64` | Internal column that is used for partitioning the services invocations. Can be ignored. | -| `service_name` | `Utf8` | The name of the workflow service. | -| `service_key` | `Utf8` | The workflow ID. | -| `key` | `Utf8` | The promise key. | -| `completed` | `Boolean` | True if the promise was completed. | -| `completion_success_value` | `Binary` | The completion success, if any. | -| `completion_success_value_utf8` | `Utf8` | The completion success as UTF-8 string, if any. | -| `completion_failure` | `Utf8` | The completion failure, if any. | - -## Table: `sys_service` - -| Column name | Type | Description | -|-------------|------|-------------| -| `name` | `Utf8` | The name of the registered user service. | -| `revision` | `UInt64` | The latest deployed revision. | -| `public` | `Boolean` | Whether the service is accessible through the ingress endpoint or not. | -| `ty` | `Utf8` | The service type. Either `service` or `virtual_object` or `workflow`. | -| `deployment_id` | `Utf8` | The ID of the latest deployment | - -## Table: `sys_deployment` - -| Column name | Type | Description | -|-------------|------|-------------| -| `id` | `Utf8` | The ID of the service deployment. | -| `ty` | `Utf8` | The type of the endpoint. Either `http` or `lambda`. | -| `endpoint` | `Utf8` | The address of the endpoint. Either HTTP URL or Lambda ARN. | -| `created_at` | `Date64` | Timestamp indicating the deployment registration time. | - -## Table: `sys_invocation` - -| Column name | Type | Description | -|-------------|------|-------------| -| `id` | `Utf8` | [Invocation ID](/operate/invocation#invocation-identifier). | -| `target` | `Utf8` | Invocation Target. Format for plain services: `ServiceName/HandlerName`, e.g. `Greeter/greet`. Format for virtual objects/workflows: `VirtualObjectName/Key/HandlerName`, e.g. `Greeter/Francesco/greet`. | -| `target_service_name` | `Utf8` | The name of the invoked service. | -| `target_service_key` | `Utf8` | The key of the virtual object or the workflow ID. Null for regular services. | -| `target_handler_name` | `Utf8` | The invoked handler. | -| `target_service_ty` | `Utf8` | The service type. Either `service` or `virtual_object` or `workflow`. | -| `invoked_by` | `Utf8` | Either `ingress` if the service was invoked externally or `service` if the service was invoked by another Restate service. | -| `invoked_by_service_name` | `Utf8` | The name of the invoking service. Or `null` if invoked externally. | -| `invoked_by_id` | `Utf8` | The caller [Invocation ID](/operate/invocation#invocation-identifier) if the service was invoked by another Restate service. Or `null` if invoked externally. | -| `invoked_by_target` | `Utf8` | The caller invocation target if the service was invoked by another Restate service. Or `null` if invoked externally. | -| `pinned_deployment_id` | `Utf8` | The ID of the service deployment that started processing this invocation, and will continue to do so (e.g. for retries). This gets set after the first journal entry has been stored for this invocation. | -| `trace_id` | `Utf8` | The ID of the trace that is assigned to this invocation. Only relevant when tracing is enabled. | -| `journal_size` | `UInt32` | The number of journal entries durably logged for this invocation. | -| `created_at` | `Date64` | Timestamp indicating the start of this invocation. | -| `modified_at` | `Date64` | Timestamp indicating the last invocation status transition. For example, last time the status changed from `invoked` to `suspended`. | -| `inboxed_at` | `Date64` | Timestamp indicating when the invocation was inboxed, if ever. | -| `scheduled_at` | `Date64` | Timestamp indicating when the invocation was scheduled, if ever. | -| `running_at` | `Date64` | Timestamp indicating when the invocation first transitioned to running, if ever. | -| `completed_at` | `Date64` | Timestamp indicating when the invocation was completed, if ever. | -| `retry_count` | `UInt64` | The number of invocation attempts since the current leader started executing it. Increments on start, so a value greater than 1 means a failure occurred. Note: the value is not a global attempt counter across invocation suspensions and leadership changes. | -| `last_start_at` | `Date64` | Timestamp indicating the start of the most recent attempt of this invocation. | -| `next_retry_at` | `Date64` | Timestamp indicating the start of the next attempt of this invocation. | -| `last_attempt_deployment_id` | `Utf8` | The ID of the service deployment that executed the most recent attempt of this invocation; this is set before a journal entry is stored, but can change later. | -| `last_attempt_server` | `Utf8` | Server/SDK version, e.g. `restate-sdk-java/1.0.1` | -| `last_failure` | `Utf8` | An error message describing the most recent failed attempt of this invocation, if any. | -| `last_failure_error_code` | `Utf8` | The error code of the most recent failed attempt of this invocation, if any. | -| `last_failure_related_entry_index` | `UInt64` | The index of the journal entry that caused the failure, if any. It may be out-of-bound of the currently stored entries in `sys_journal`. | -| `last_failure_related_entry_name` | `Utf8` | The name of the journal entry that caused the failure, if any. | -| `last_failure_related_entry_type` | `Utf8` | The type of the journal entry that caused the failure, if any. You can check all the available entry types in [`entries.rs`](https://github.com/restatedev/restate/blob/main/crates/types/src/journal/entries.rs). | -| `status` | `Utf8` | Either `pending` or `scheduled` or `ready` or `running` or `backing-off` or `suspended` or `completed`. | -| `completion_result` | `Utf8` | If `status = 'completed'`, this contains either `success` or `failure` | -| `completion_failure` | `Utf8` | If `status = 'completed' AND completion_result = 'failure'`, this contains the error cause | -