Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snapshots] 1: Introduce partition store snapshotting support #1954

Merged
merged 1 commit into from
Sep 26, 2024
Merged

Conversation

pcholakov
Copy link
Contributor

@pcholakov pcholakov commented Sep 11, 2024

Copy link

github-actions bot commented Sep 11, 2024

Test Results

15 files  ±0  15 suites  ±0   17m 58s ⏱️ -22s
 6 tests ±0   6 ✅ ±0  0 💤 ±0  0 ❌ ±0 
18 runs  ±0  18 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 84dfea6. ± Comparison against base commit a121f49.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@AhmedSoliman AhmedSoliman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice. I've a few questions and small nits but this looks pretty good overall.

crates/partition-store/src/partition_store.rs Outdated Show resolved Hide resolved
crates/partition-store/src/partition_store_manager.rs Outdated Show resolved Hide resolved
crates/partition-store/src/partition_store_manager.rs Outdated Show resolved Hide resolved
crates/partition-store/src/partition_store_manager.rs Outdated Show resolved Hide resolved
let cf_name = cf_for_partition(partition_id);
let already_exists = self.rocksdb.inner().cf_handle(&cf_name).is_some();
if already_exists {
warn!(?partition_id, "already exists, refusing to import snapshot");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
warn!(?partition_id, "already exists, refusing to import snapshot");
warn!(?partition_id, "Already exists, refusing to import snapshot");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we print cf_name in the log message?

The warn message here maybe can be phrased to indicate that the actual reason is that the column_family exists in the database. It should indicate an inconsistency of some sort since we are holding the lookup lock you checked up there that we don't have the partition_id, so it must mean that the partition_id is loaded but from a different column_family. No?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly right! Updated message - this error probably deserves a dedicated error code with a detailed explanation how this situation could arise. The most likely scenario is that a previous import has failed, or we are operating on an outdated datastore, and we have misalignment between the worker node configuration and what is on disk.

import_metadata.set_db_comparator_name(snapshot.db_comparator_name.as_str());
import_metadata.set_files(&snapshot.files);

info!(?partition_id, ?snapshot.min_applied_lsn, "initializing partition store from snapshot");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
info!(?partition_id, ?snapshot.min_applied_lsn, "initializing partition store from snapshot");
info!(?partition_id, ?snapshot.min_applied_lsn, "Initializing partition store from snapshot");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be nice if you can also include the last applied LSN of this snapshot in the log message

crates/partition-store/src/partition_store_manager.rs Outdated Show resolved Hide resolved
pub directory: String,
pub size: usize,
pub level: i32,
#[serde_as(as = "Option<serde_with::hex::Hex>")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why Hex and not Bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll laugh, I was anchored on hex from staring at the C++ LiveFile struct in the debugger view projecting that it might be nicer for humans to look at. No real preference - Option<Bytes> is just as good for me.

Copy link
Contributor Author

@pcholakov pcholakov Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I do have a strong preference for it not being an array, i.e. just the default Bytes representation. Since this metadata is likely to be seen by a human in its raw form, it's nice to be pretty-printed and I think there's a big difference between a 15-20 line array like this:

17        "start_key": [
  1         100,
  2         101,
  3         0,
  4         0,
  5         0,
  6         0,
  7         0,
  8         0,
  9         0,
 10         0,
 11         1,
 12         4,
 13         83,
 14         69,
 15         76,
 16         70
 17       ],

and a shorter hex or base64 representation. Hex made more sense as slightly easier to interpret - if I saw this and was trying to make sense of the key without a Restate-specific tool to render the key, my first reaction would be to hex dump it:

"start_key": "ZGUAAAAAAAAAAQEEU0VMRg==",

EDIT: I like the compactness of hex but the downside is it's not going to be obvious that it is NOT just a string value:

  1     {
  2       "column_family_name": "",
  3       "name": "/000048.sst",
  4       "directory": "/Users/pavel/tmp/restate-data/Pavels-MacBook-Pro.local/db-snapshots/1/1726555628950",
  5       "size": 1267,
  6       "level": 0,
  7       "start_key": "64650000000000000001010453454c46",
  8       "end_key": "667300000000000000010000000000000002",
  9       "smallest_seqno": 19,
 10       "largest_seqno": 20,
 11       "num_entries": 0,
 12       "num_deletions": 0
 13     },

So that was my thought process - what would be your first choice? :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this base64 in the end - but still interested to know what others think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The positive point for hex is that we can compare the values to the output from rocks' ldb CLI tool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that sells me on hex!

crates/partition-store/tests/snapshots_test/mod.rs Outdated Show resolved Hide resolved
crates/rocksdb/src/error.rs Outdated Show resolved Hide resolved
@pcholakov pcholakov marked this pull request as draft September 16, 2024 20:12
@pcholakov pcholakov marked this pull request as ready for review September 17, 2024 09:49
@pcholakov
Copy link
Contributor Author

I am still trying to understand why Build and test (ubuntu-22.04) is failing in CI - can't reproduce away from GitHub yet - but this is now ready for re-review.

crates/partition-store/src/partition_store_manager.rs Outdated Show resolved Hide resolved
Comment on lines +27 to +46
/// The minimum LSN guaranteed to be applied in this snapshot. The actual
/// LSN may be >= [minimum_lsn].
pub min_applied_lsn: Lsn,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can get the version from the checkpoint, then create the export, this way we know the exact value rather than the minimum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to implement exactly that; unfortunately the checkpointing and export are bundled in the RocksDB API - constructing the Checkpoint[^1] C++ object does nothing, it's just a handle. The alternative is to import the snapshot as a temporary column family, possibly in a separate database, which didn't seem worth it. Can you think of another way?

[1] https://github.com/facebook/rocksdb/blob/0e04ef1a96d75acdcf2cf1802692375d62aea64c/include/rocksdb/utilities/checkpoint.h#L24-L25

crates/rocksdb/src/lib.rs Outdated Show resolved Hide resolved
@pcholakov pcholakov changed the base branch from main to feat/restatectl-logs-trim September 21, 2024 16:26
@pcholakov
Copy link
Contributor Author

I just pushed a pretty substantial revision - mostly addressing the initial PR feedback but also updating to the latest restate/rust-rocksdb to pick up the memory leak+corruption bugfix I pushed last week, and a few other cleanups. The only notable change is that the snapshot metadata struct is now extended and I've also introduced a SnapshotId identifier.

Not dependent on #1980, this just stacks on top of it. The two can be merged independently into main. There will be other snapshots-related PRs to follow that depend on this one.

@pcholakov pcholakov changed the title Introduce partition store snapshotting [Snapshots] Introduce partition store snapshotting support Sep 25, 2024
Base automatically changed from feat/restatectl-logs-trim to main September 25, 2024 10:04
This commit adds the foundational support to export and import partition store
snapshots.
@pcholakov
Copy link
Contributor Author

Going to merge this once all the checks pass on the latest rebase to keep the open PR stack in check.

@pcholakov pcholakov changed the title [Snapshots] Introduce partition store snapshotting support [Snapshots] 1: Introduce partition store snapshotting support Sep 26, 2024
@pcholakov pcholakov merged commit 4d6717c into main Sep 26, 2024
12 checks passed
@pcholakov pcholakov deleted the pr1954 branch September 26, 2024 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants