Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose a partition worker CreateSnapshot RPC #1998

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pcholakov
Copy link
Contributor

@pcholakov pcholakov commented Sep 27, 2024

@pcholakov pcholakov changed the title Expose a partition worker CreateSnapshot RPC [Snapshots] Expose a partition worker CreateSnapshot RPC Sep 27, 2024
Copy link

github-actions bot commented Sep 27, 2024

Test Results

15 files  ±0  15 suites  ±0   9m 15s ⏱️ -29s
 6 tests ±0   6 ✅ ±0  0 💤 ±0  0 ❌ ±0 
18 runs  ±0  18 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 8df89b3. ± Comparison against base commit 13b6fda.

♻️ This comment has been updated with latest results.

@pcholakov pcholakov changed the title [Snapshots] Expose a partition worker CreateSnapshot RPC Expose a partition worker CreateSnapshot RPC Sep 27, 2024
@pcholakov
Copy link
Contributor Author

Testing notes

With the new admin RPC exposed, we can now request a snapshot of the worker's partition store on demand:

> restatectl snapshots
Partition snapshots

Usage: restatectl snapshots [OPTIONS] <COMMAND>

Commands:
  create-snapshot  Create [aliases: create]
  help             Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose...                               Increase logging verbosity
  -q, --quiet...                                 Decrease logging verbosity
      --table-style <TABLE_STYLE>                Which table output style to use [default: compact] [possible values: compact, borders]
      --time-format <TIME_FORMAT>                [default: human] [possible values: human, iso8601, rfc2822]
  -y, --yes                                      Auto answer "yes" to confirmation prompts
      --connect-timeout <CONNECT_TIMEOUT>        Connection timeout for network calls, in milliseconds [default: 5000]
      --request-timeout <REQUEST_TIMEOUT>        Overall request timeout for network calls, in milliseconds [default: 13000]
      --cluster-controller <CLUSTER_CONTROLLER>  Cluster Controller host:port (e.g. http://localhost:5122/) [default: http://localhost:5122/]
  -h, --help                                     Print help (see more with '--help')
> restatectl snapshots create -p 1
Snapshot created: snap_12PclG04SN8eVSKYXCFgXx7

Server writes snapshot to db-snapshots relative to base dir:

2024-09-26T07:31:49.261080Z INFO restate_admin::cluster_controller::service
  Create snapshot command received
    partition_id: PartitionId(1)
on rs:worker-0
2024-09-26T07:31:49.261133Z INFO restate_admin::cluster_controller::service
  Asking node to snapshot partition
    node_id: GenerationalNodeId(PlainNodeId(0), 3)
    partition_id: PartitionId(1)
on rs:worker-0
2024-09-26T07:31:49.261330Z INFO restate_worker::partition_processor_manager
  Received 'CreateSnapshotRequest { partition_id: PartitionId(1) }' from N0:3
on rs:worker-9
  in restate_core::network::connection_manager::network-reactor
    peer_node_id: N0:3
    protocol_version: 1
    task_id: 32
2024-09-26T07:31:49.264763Z INFO restate_worker::partition::snapshot_producer
  Partition snapshot written
    lsn: 3
    metadata: "/Users/pavel/restate/test/n1/db-snapshots/1/snap_12PclG04SN8eVSKYXCFgXx7/metadata.json"
on rt:pp-1

Snapshot metadata will contain output similar to the below:

{
  "version": "V1",
  "cluster_name": "snap-test",
  "partition_id": 1,
  "node_name": "n1",
  "created_at": "2024-09-26T07:31:49.264522000Z",
  "snapshot_id": "snap_12PclG04SN8eVSKYXCFgXx7",
  "key_range": {
    "start": 9223372036854775808,
    "end": 18446744073709551615
  },
  "min_applied_lsn": 3,
  "db_comparator_name": "leveldb.BytewiseComparator",
  "files": [
    {
      "column_family_name": "",
      "name": "/000030.sst",
      "directory": "/Users/pavel/restate/test/n1/db-snapshots/1/snap_12PclG04SN8eVSKYXCFgXx7",
      "size": 1267,
      "level": 0,
      "start_key": "64650000000000000001010453454c46",
      "end_key": "667300000000000000010000000000000002",
      "smallest_seqno": 11,
      "largest_seqno": 12,
      "num_entries": 0,
      "num_deletions": 0
    },
    {
      "column_family_name": "",
      "name": "/000029.sst",
      "directory": "/Users/pavel/restate/test/n1/db-snapshots/1/snap_12PclG04SN8eVSKYXCFgXx7",
      "size": 1142,
      "level": 6,
      "start_key": "64650000000000000001010453454c46",
      "end_key": "667300000000000000010000000000000002",
      "smallest_seqno": 0,
      "largest_seqno": 0,
      "num_entries": 0,
      "num_deletions": 0
    }
  ]
}

This can later be used to restore the column family to the same state.

type MessageType = CreateSnapshotRequest;

async fn on_message(&self, msg: Incoming<Self::MessageType>) {
info!("Received '{:?}' from {}", msg.body(), msg.peer());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional? if so, does this need to be at INFO level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, but happy to turn it down or eliminate altogether. Seemed like something that happens infrequently enough that it wouldn't hurt to log - but perhaps debug is much more appropriate here.

/// Default timeout for internal cluster RPC calls.
#[serde_as(as = "serde_with::DisplayFromStr")]
#[cfg_attr(feature = "schemars", schemars(with = "String"))]
pub rpc_call_timeout: humantime::Duration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if this can be a single universal value. Expectations might vary a lot depending on the RPC itself. Perhaps we can define the timeout on the operation level instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read that as, it shouldn't even be configurable then? It seems like for the create-snapshot internal operation a sane value of a couple of seconds should suffice. I think it might make sense to have a general intra-cluster RPC timeout but if we don't use it now, there's no need to pollute the config keys with it yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we could integrate a default timeout in the define_rpc! macro, which can be optionally overridden per-operation in config. The right config section then seems more like [worker] since it is the "server" handling the RPC. Thoughts on this approach?

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR @pcholakov. The change look good to me. I think if we remove the general purpose rpc timeout from the scope of this PR, it is good to get merged.

node_id: GenerationalNodeId,
partition_id: PartitionId,
) -> anyhow::Result<SnapshotId> {
let snapshot_timeout = self.networking_options.rpc_call_timeout.as_ref();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether the generic rpc call timeout is a good fit for the snapshot timeout. If the snapshot is large, then uploading it to S3 will probably take a bit of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I had the idea that the scope of the PRC is purely to produce the snapshot and write it to the filesystem, with a completely separate feedback mechanism for communicating the "offloaded" snapshot LSN back to (possibly) a metadata store entry. But I haven't given this a lot of thought and we might well prefer for the upload to happen as part of CreateSnapshot. In any event, I have made the change to eliminate the generic timeout already - will push it shortly.

crates/worker/src/partition_processor_manager.rs Outdated Show resolved Hide resolved
crates/worker/src/partition_processor_manager.rs Outdated Show resolved Hide resolved
crates/worker/src/partition_processor_manager.rs Outdated Show resolved Hide resolved
@pcholakov pcholakov force-pushed the pcholakov/stack/1 branch 2 times, most recently from d621a61 to 76ddc3b Compare October 4, 2024 08:02
Comment on lines +68 to +69
PARTITION_CREATE_SNAPSHOT_REQUEST = 42;
PARTITION_CREATE_SNAPSHOT_RESPONSE = 43;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AhmedSoliman I see you left some gaps elsewhere - would you like me to start these in a new range, e.g. 50+? I think we'll have more RPCs for managing worker nodes that could be grouped together.

stack-info: PR: #1998, branch: pcholakov/stack/1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants