-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Snapshots] 1: Introduce partition store snapshotting support #1954
Conversation
6656ce0
to
e692969
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice. I've a few questions and small nits but this looks pretty good overall.
let cf_name = cf_for_partition(partition_id); | ||
let already_exists = self.rocksdb.inner().cf_handle(&cf_name).is_some(); | ||
if already_exists { | ||
warn!(?partition_id, "already exists, refusing to import snapshot"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warn!(?partition_id, "already exists, refusing to import snapshot"); | |
warn!(?partition_id, "Already exists, refusing to import snapshot"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we print cf_name
in the log message?
The warn message here maybe can be phrased to indicate that the actual reason is that the column_family exists in the database. It should indicate an inconsistency of some sort since we are holding the lookup lock you checked up there that we don't have the partition_id, so it must mean that the partition_id is loaded but from a different column_family. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly right! Updated message - this error probably deserves a dedicated error code with a detailed explanation how this situation could arise. The most likely scenario is that a previous import has failed, or we are operating on an outdated datastore, and we have misalignment between the worker node configuration and what is on disk.
import_metadata.set_db_comparator_name(snapshot.db_comparator_name.as_str()); | ||
import_metadata.set_files(&snapshot.files); | ||
|
||
info!(?partition_id, ?snapshot.min_applied_lsn, "initializing partition store from snapshot"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
info!(?partition_id, ?snapshot.min_applied_lsn, "initializing partition store from snapshot"); | |
info!(?partition_id, ?snapshot.min_applied_lsn, "Initializing partition store from snapshot"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'd be nice if you can also include the last applied LSN of this snapshot in the log message
pub directory: String, | ||
pub size: usize, | ||
pub level: i32, | ||
#[serde_as(as = "Option<serde_with::hex::Hex>")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, why Hex and not Bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll laugh, I was anchored on hex from staring at the C++ LiveFile struct in the debugger view projecting that it might be nicer for humans to look at. No real preference - Option<Bytes>
is just as good for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I do have a strong preference for it not being an array, i.e. just the default Bytes
representation. Since this metadata is likely to be seen by a human in its raw form, it's nice to be pretty-printed and I think there's a big difference between a 15-20 line array like this:
17 "start_key": [
1 100,
2 101,
3 0,
4 0,
5 0,
6 0,
7 0,
8 0,
9 0,
10 0,
11 1,
12 4,
13 83,
14 69,
15 76,
16 70
17 ],
and a shorter hex or base64 representation. Hex made more sense as slightly easier to interpret - if I saw this and was trying to make sense of the key without a Restate-specific tool to render the key, my first reaction would be to hex dump it:
"start_key": "ZGUAAAAAAAAAAQEEU0VMRg==",
EDIT: I like the compactness of hex but the downside is it's not going to be obvious that it is NOT just a string value:
1 {
2 "column_family_name": "",
3 "name": "/000048.sst",
4 "directory": "/Users/pavel/tmp/restate-data/Pavels-MacBook-Pro.local/db-snapshots/1/1726555628950",
5 "size": 1267,
6 "level": 0,
7 "start_key": "64650000000000000001010453454c46",
8 "end_key": "667300000000000000010000000000000002",
9 "smallest_seqno": 19,
10 "largest_seqno": 20,
11 "num_entries": 0,
12 "num_deletions": 0
13 },
So that was my thought process - what would be your first choice? :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this base64 in the end - but still interested to know what others think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The positive point for hex is that we can compare the values to the output from rocks' ldb
CLI tool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that sells me on hex!
92e0320
to
54651d0
Compare
5388476
to
3237e69
Compare
I am still trying to understand why Build and test (ubuntu-22.04) is failing in CI - can't reproduce away from GitHub yet - but this is now ready for re-review. |
/// The minimum LSN guaranteed to be applied in this snapshot. The actual | ||
/// LSN may be >= [minimum_lsn]. | ||
pub min_applied_lsn: Lsn, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can get the version from the checkpoint, then create the export, this way we know the exact value rather than the minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to implement exactly that; unfortunately the checkpointing and export are bundled in the RocksDB API - constructing the Checkpoint[^1] C++ object does nothing, it's just a handle. The alternative is to import the snapshot as a temporary column family, possibly in a separate database, which didn't seem worth it. Can you think of another way?
I just pushed a pretty substantial revision - mostly addressing the initial PR feedback but also updating to the latest restate/rust-rocksdb to pick up the memory leak+corruption bugfix I pushed last week, and a few other cleanups. The only notable change is that the snapshot metadata struct is now extended and I've also introduced a Not dependent on #1980, this just stacks on top of it. The two can be merged independently into main. There will be other snapshots-related PRs to follow that depend on this one. |
7e56a65
to
ab6fad5
Compare
This commit adds the foundational support to export and import partition store snapshots.
Going to merge this once all the checks pass on the latest rebase to keep the open PR stack in check. |
With this change we add the basic ability for the PartitionProcessor to export
its state into a snapshot, and later re-create this by importing the same
snapshot.