Feedback after complete mini-lsm. #88

Foreverhighness · 2024-07-20T08:30:41Z

First of all, thank you for such a great tutorial on LSM-Tree storage engine.

After finishing 3 weeks project, I have got a lot of to share, including some questions and suggestions.

1. StorageIterator specification

Is there an implicit StorageIterator specification?

In week 1 project, there are several iterators that implement StorageIterator.

As I implemented these iterators, I often asked myself, "Have I implemented these iterators in a coherent way?"

It turns out that I wrote the same assertions again and again, so I feel there is some implicit specification that I should apply.

The spec I summarized is listed below:

fn key(&self) -> KeyType<'_>
- This function can be called iff the iterator is valid, if it is called from an invalid iterator, we could panic.
- The return value must be an valid key, which means it is an non-empty key.
fn value(&self) -> &[u8]
- This function can be called iff the iterator is valid, if it is called from an invalid iterator, we could panic.
fn is_valid(&self) -> bool
- This function return false iff its underlying storage cannot produce a new element.
- For FuseIterator, it also return false after next return an error.
fn num_active_iterators(&self) -> usize
- For invalid iterator, it should return 0
- For valid iterator, it should not return 0.
- For FuseIterator, when the iterator is valid, it should not return 0. when the iterator is invalid, the return value is unspecified.
fn next(&mut self) -> Result<()>
- For invalid iterator, it is no-op.
- For valid iterator, if return Ok(())
  - If underlying storage cannot produce a new element, turn into invalid iterator. return Ok(())
  - Otherwise, store the new element. return Ok(())
  - The key from new element should strict greater than old key.
- For valid iterator, if return Err(e)
  - The key, value, is_valid call should return same value as before, that means act like no-op.
  - For compound of iterators like MergeIterator, SstConcatIterator and TwoMergeIterator, it should remove the underlying iterator which cause the error. (I'm not sure whether is sound for SstConcatIterator and TwoMergeIterator)
  - For FuseIterator, it should turn into invalid iterator.

I'm not sure if it covers all situations.

2. Inconsistent function signature

As mentioned in #72. There is an inconsistent function signature in decode_block_meta.

mini-lsm/mini-lsm-starter/src/table.rs

Lines 45 to 46 in dd333ca

    
           /// Decode block meta from a buffer. 
        
           pub fn decode_block_meta(buf: impl Buf) -> Vec<BlockMeta> {

mini-lsm/mini-lsm/src/table.rs

Lines 63 to 64 in dd333ca

    
           /// Decode block meta from a buffer. 
        
           pub fn decode_block_meta(mut buf: &[u8]) -> Result<Vec<BlockMeta>> {

Until week 2 day 7, the signature is fine, but it turns out that this is not sufficient for checksum functionality.

The original signature implied that this function is infallible. But checksum will cause an error, so the return type should be wrapped in Result,
&[u8] is more convenient than impl Buf when implementing checksum functionality.

3. key always non-empty

It seems that an empty key is considered invalid. So we can enforce it by adding ParsedKey type, we could use nutype crate to achieve this goal.

But it may not be necessary for an educational project, since it is convenient to use an empty key instead of Option wrapper.

4. `apply_result` does not point out what it should return

mini-lsm/mini-lsm-starter/src/compact/simple_leveled.rs

Lines 41 to 48 in dd333ca

    
           /// Apply the compaction result. 
        
           /// 
        
           /// The compactor will call this function with the compaction task and the list of SST ids generated. This function applies the 
        
           /// result and generates a new LSM state. The functions should only change `l0_sstables` and `levels` without changing memtables 
        
           /// and `sstables` hash map. Though there should only be one thread running compaction jobs, you should think about the case 
        
           /// where an L0 SST gets flushed while the compactor generates new SSTs, and with that in mind, you should do some sanity checks 
        
           /// in your implementation. 
        
           pub fn apply_compaction_result(

As the document does point out that LsmStorageState should be the new state, but it does not point out what Vec<usize> means.
Only after I exam solution code, I figure out that should be sst_ids which need be removed.

5. Seedable level compaction simulator

The level compaction simulator uses rand to generate the key range, so each run will produce different output, making it difficult to verify that the output is identical to the reference solution by using command line tools such as diff cmp.

One solution is to add the seed flag to the simulator, and use rand::SeedableRng to generate random numbers.

6. `println!()` to log

In the reference solution, the implementation of apply_result contains some of println!() to log information, making it difficult to verify that the output is identical to the reference solution by using command line tools such as diff, cmp.

I think it should be replaced with eprintln!(), which will output to stderr and does not lose logging functionality.

7. Encourage to add test

In rust project, adding test is pretty easy! and adding tests is an easiest way to contribute to repository.

For example, when implementing manifest functionality, it is good to test serde_json library first.
So I can quickly write relative test within manifest.rs.

#[cfg(test)]
#[test]
fn test_record_serde() {
    let record1 = ManifestRecord::Flush(1);
    let json = serde_json::to_vec(&record1).unwrap();
    let record2 = serde_json::from_slice(&json).unwrap();
    assert_eq!(record1, record2);
}

Most of IDEs have the ability to run tests directly in Rust, which is different from C/C++ or Java, where you need some sort of test framework or use the main function to run a single simple test.

8. extract apply result logic to bind specific impl

When implementing apply_result function, I use a lot of related method of Vec in my solution,
e.g. Vec::splice, Vec::drain. But the levels Vec<(usize, Vec<usize>)> seems inefficient for doing compaction.

We could generalize levels impl like levels: Impl LevelsImpl and create a type alias type Levels = Vec<(usize, Vec<usize>)>;, and impl LevelsImpl trait for Vec<(usize, Vec<usize>)>, and we have the ability to easily switch to another implementation.

But it is of low priority, since this project is just for educational purposes.

9. immutable memtable is not immutated

Due to the interior mutability of MemTable, there is no way to forbid programmers from modifying immutable memtables in Vec<Arc<MemTable>>.

Which may (low probability) cause a nasty bug. But could be avoided with a little care.

The solution is quite simple: Use the Newtype pattern and expose only the read-only interface.

10. Migration to mvcc version is considered painful

It is kind of dirty work, and frustrating, to refactor code to use the key+ts representation.

The code that needs to be changed is scattered across many files.

Asking students to write unsafe code in Rust is awkward.

Learning on code that does not complied is quite inefficient.

11. use `Arc_cycle` to store weak pointer into LsmMvccInner

pub fn new_txn(&self, inner: Arc<LsmStorageInner>, serializable: bool) -> Arc<Transaction>

Strictly speaking, LsmMvccInner::new_txn can only be called with the LsmStorageInner that created it, but its semantics do not restrict this.

For example, if there are two different LsmStorageInner named storageA and storageB with associated LsmMvccInner named mvccA and mvccB,
nobody forbids us to call mvccB.new_txn(storageA) or mvccA.new_txn(storageB), which is obviously a logical error.

However, as project developers, we know that there is only one LsmStorageInner instance, so this won't be a critical issue.

But ideally, we can do better by using Arc::Arc_cycle to pass a weak pointer directly into LsmMvccInner::new.

12. TxnIterator, LsmIterator is not compatible on non-mvcc version

Currently TxnIterator uses TwoMergeIterator<TxnLocalIterator, LsmIterator> to iterate over the underlying storage.

When switching to the non-mvcc version, creating an empty transaction without mvcc can be tricky.

It might be better to use Option<A> in TwoMergeIterator and Option<Arc<Transaction>> in TxnIterator to handle this situation.

Again, thank you for such a great tutorial on LSM-Tree storage engine.

The text was updated successfully, but these errors were encountered:

skyzh · 2024-07-22T20:17:42Z

thanks for the suggestions :) I'll incorporate them into the tutorial upon the next revision!

Foreverhighness mentioned this issue Jul 20, 2024

Add fh solution. #89

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback after complete mini-lsm. #88

Feedback after complete mini-lsm. #88

Foreverhighness commented Jul 20, 2024

skyzh commented Jul 22, 2024

Feedback after complete mini-lsm. #88

Feedback after complete mini-lsm. #88

Comments

Foreverhighness commented Jul 20, 2024

1. StorageIterator specification

2. Inconsistent function signature

3. key always non-empty

4. apply_result does not point out what it should return

5. Seedable level compaction simulator

6. println!() to log

7. Encourage to add test

8. extract apply result logic to bind specific impl

9. immutable memtable is not immutated

10. Migration to mvcc version is considered painful

11. use Arc_cycle to store weak pointer into LsmMvccInner

12. TxnIterator, LsmIterator is not compatible on non-mvcc version

skyzh commented Jul 22, 2024

4. `apply_result` does not point out what it should return

6. `println!()` to log

11. use `Arc_cycle` to store weak pointer into LsmMvccInner