LOCKING.brainstorm.txt

This file tells the brainstormy/rambly story of planning more granular locking
for ShmCache. Locking the whole memory (hash table region and value region) for
all operations is currently really slow whenever multiple processes try to do
operations in parallel.


##############################################################################
#                     Hash table with open addressing                        #
##############################################################################

.........................................................................
| Hash table global lock                                                |
`````````````````````````````````````````````````````````````````````````
Hash table region
0-------1-------2-------3-------4-------5-------6-------7-------8-------9
|valAddr|valAddr|valAddr|valAddr|valAddr|valAddr|valAddr|valAddr|valAddr|
-------------------------------------------------------------------------
                              __________________|
                             |                             _____ Ring buffer
                             |                            |      pointer
Value region                 v                            v
0----------------------------1----------------------------2-----------------------------
|key,allocSize,size,flags,val|key,allocSize,size,flags,val|key,allocSize,size,flags,val|
----------------------------------------------------------------------------------------
|                            |                            |
v                            v                            v
........................................................................................
| Lock X                                 | Lock Y                                 |
````````````````````````````````````````````````````````````````````````````````````````

Each lock is associated with multiple values, because we're using flock(), and
we want to limit the amount of lock files.

Example steps for set():

- Compute hash table bucket index (4)
- Lock hash table for read
- Lock X for write
- Read key from value 0
- Key doesn't match; release lock X
- Read next hash table entry (index 5)
- Lock Y for write
- Read key from value 2
- Key doesn't match; release lock Y
- Read next hash table entry (index 6)
- Lock X for write
- Read key from value 1
- Key matches
- If new value fits into allocSize of value 1
  - Release hash table lock
  - Update value size
  - Write new value
  - If new size leaves a large-enough gap, split value into new value and free space
  - Release lock X
- Else
  - Release hash table lock
  - (race condition???)
  - Lock hash table for write
    - (deadlock, because locking value first, then hash table???)
      - Can fix deadlock and race condition by locking initially for write, but
        that loses read-parallelism
  - Remove old value
    - Set value size to 0 (free)
    - Set hash table entry valAddr to 0 (free)
    - Merge this space with adjacent free space
      - Acquire next value space's lock for write
      - Read next value
      - If next value is free, merge it to this one
      - Continue until non-free value is found
      - Release all free spaces' locks (keep a stack)
  - Release lock X
  - (race condition???)
  - Lock ring buffer location (lock Y) for write
  - Read value size at ring buffer location
  - If not large enough for value
    - Lock ringbuffer+1..N for write until have enough space for new value
  - Remove all old values
  - Merge now-free space into one chunk
  - Lock ringbuffer+N+Nsize-(leftover space); split the chunk into new value and leftover free space
  - Release all locks Y->
  - Release hash table lock


*********** Why use a global lock for the hash table? ************************

It's difficult/impossible to implement bucket-specific locks with open
addressing because of deadlocks. 

Example:

.........................................
| Lock A            | Lock B            |
`````````````````````````````````````````
^       ^       ^       ^       ^
|       |       |       |       |
0-------1-------2-------3-------4--------
|valAddr| NULL  |valAddr|valAddr|valAddr|
-----------------------------------------

Note that there is always at least one NULL in this table, as the table must
never reach a 100% load factor when using open addressing.

- Find table index for key "foo"
  - Hash "foo" to 2
  - Acquire lock A
  - Iterate to 3 (the correct location)
  - Try to acquire lock B
- Find table index for key "bar"
  - Hash "bar" to 4
  - Acquire lock B
  - Iterate to 0 (the correct location)
  - Try to acquire lock A
- Deadlock
  - "foo" has A, tries to lock B
  - "bar" has B, tries to lock A

Solutions:

- Require locks to always be acquired in a certain order (A->B)
  - Not realistic, as "bar" would beed to acquire A and all preceding locks before B
- Drop any locks you're holding, if you're trying to acquire a lock
  - Doesn't allow us to keep locks of multiple entries, which would be required
    when merging (non-free) value region chunks
- Make each hash table entry have its own lock. The precence of at least one
  NULL in the table ensures that locks are always acquired left-to-right.
  - Would probably require us to create 10k..100k files for flock() or sem_get().
    Would we hit number-of-open-files limits?
  - TODO: this doesn't actually fix the problem; delete this

******************************************************************************

Open addressing doesn't seem to be a good solution in terms of having more
granular locks for the hash table entries. The root problem is that one
bucket's entries can be associated with multiple locks, so to lock a single
bucket, you might need multiple locks, and the order in which those are
acquired is difficult to enforce, leading to deadlocks. Implementing the
hash table with separate chaining is a solution (see the next section).

When planning a locking hierarchy, it looks like you should never keep two
locks of the same level of hierarchy (where the hash table is one level of
hierarchy, and the value memory region is another level) at the same time, or
if you do, you need to be 100% sure that the locks are always acquired in the
same order (A->B->C). It seems a bit easier to keep multiple locks when all of
them are in different levels of hierarchy, as usually those locks are acquired
in the same order.

(Deadlocks occur because of _lock inversion_.)


##############################################################################
#                   Hash table with separate chaining                        #
##############################################################################

With separate chaining we can have one lock fully wrap _all_ of the entries
in a single hash table bucket, unlike with open addressing, because separate
chaining gives each bucket and its entries a clear hierarchy.

...............................................................................
| Lock A                                 | Lock B                              
```````````````````````````````````````````````````````````````````````````````
             ^            ^            ^            ^            ^
Hash table   |            |            |            |            |
buckets region            |            |            |            |
0------------1------------2------------3------------4------------5-------------
|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|
-------------------------------------------------------------------------------
                      |___          ________________
Hash table                |        |                |
overflow region           v        |                v
0------------1------------2------------3------------4------------5-------------
|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|
-------------------------------------------------------------------------------
                              ______________________|
                             |                             _____ Ring buffer
                             |                            |      pointer
Value region                 v                            v
0----------------------------1----------------------------2-----------------------------
|key,allocSize,size,flags,val|key,allocSize,size,flags,val|key,allocSize,size,flags,val|
----------------------------------------------------------------------------------------
|                            |                            |
v                            v                            v
........................................................................................
| Lock X                                 | Lock Y                                 |
````````````````````````````````````````````````````````````````````````````````````````

Each lock is associated with multiple buckets or values, because we're using
flock(), and we want to limit the amount of lock files.

Example steps for set():

- Compute hash table bucket index (1)
- Lock A for write
- Lock X for write
- Read key from value 0
- Key doesn't match; release lock X
- Read next hash table entry from overflow region (index 2)
- Lock Y for write
- Read key from value 2
- Key doesn't match; release lock Y
- Read next hash table entry from overflow region (index 4)
- Lock X for write
- Read key from value 1
- Key matches
- If new value fits into allocSize of value 1
  - Release lock A
  - Update value size
  - Write new value
  - If new size leaves a large-enough gap, split value into new value and free space
  - Release lock X
- Else
  - Remove old value
    - Set value size to 0 (free)
    - Set hash table entry valAddr to 0 (free) (OK as we still have lock A)
    - Merge this space with adjacent free space
      - Acquire next value space's lock for write
      - Read next value
      - If next value is free, merge it to this one
      - Continue until non-free value is found
      - Release all free spaces' locks (keep a stack)
  - Release lock X
  - (race condition???)
    - If this is actually a race condition, we can move the lock release
      further down, with slightly worse performance
  - Lock ring buffer location (lock Y) for write
  - Read value key and size at ring buffer location
    - note: we can't temporarily release lock Y here to first lock the hash table
      bucket and then the value, to prevent a deadlock, as it causes a race condition
      - some other process could, for example, split the value at the ring buffer
        location, requiring us to re-read the value after we re-acquire lock Y,
        lest we have an old copy of that value chunk's size
  - Lock hash table bucket for value's key at ring buffer's location, if value is non-free
    - (deadlock??? locking hash table bucket _after_ value)
    - There seems to be no way to lock the hash table bucket first without first
      locking and reading the value's key. The only idea I have is to introduce
      a global everything-blocking lock for cases like this. A global lock could
      work if the current shmop_read() and unpack() calls were faster (could try
      to optimize these).
  - If not large enough for value
    - Lock ringbuffer+1..N for write until have enough space for new value
  - Remove all old values at ringbuffer+0..N
  - Merge now-free space into one chunk
  - Lock ringbuffer+N+Nsize-(leftover space); split the chunk into new value and leftover free space
  - Release all locks Y->
  - Release lock A

Separate chaining should make the hash table locking more performant.

We currently use a ring buffer where values are (mostly) in insertion order
(FIFO).

When removing oldest items in the ring buffer, we need to lock a hash table
entry, but can only do that after locking a value, which results in deadlocks.
Hash table entries must always be locked _before_ locking values.

Maybe some other value memory region architecture makes locking easier.

- How do you easily find/create a new memory chunk for a new value without
  complicated locking of old values and their hash table entries?
  - Hash table currently contains pointers to "structs" in value memory region
    - Clearing a value "struct" without clearing the pointer creates a dangling pointer
    - Need to always clear the pointer first, then the struct
  - A memory pool or slab allocation, i.e., fixed-size block allocation, would
    help, as there always exists a chunk large enough for a value, and no
    merging of oldest value chunks would have to be made

- What other ways are there to keep track of oldest cache items?
  - A separate LRU linked list (like Memcached does)
  - A separate FIFO queue (ring buffer) that points to values that are spread
    across the value region in non-insertion-order
    - Doesn't allow easily merging two adjacent value region chunks
    - Maybe this queue contains groups of value chunks, and not the chunks
      themselves; each group only contains _contiguous_ value chunks

- In case we still use dynamic-size block allocation, how do you merge multiple
  value chunks while keeping a lock for the shortest possible duration?
  - A separate LRU/FIFO list is problematic, as adjacent queue items don't
    always point to adjacent value region chunks
  - Optimize the current merging code
  - Don't. Always keep a free chunk of the largest possible item size available.

- Is there an easy way to always keep at least one chunk of the largest possible
  item size available?
  - A naive implementation would just delete and merge oldest items until there
    is enough space, like we do now. This is difficult in terms of locking.
  - Make each chunk a fixed size (largest possible item size). This causes huge
    internal fragmentation.
    - Different slabs for most popular chunk sizes fixes this

- Can we use swaps (moves) to make something easier? Swap two value chunks.
  Swaps don't require pack()/unpack().


A realization: in C you can map structs in shared memory, and when some
member's data changes in the shared memory block, you can just read the new
value with obj->member. Our way of using PHP's unpack() is a lot slower and
creates a thick, slow layer between PHP and shared memory, where we basically
always read copies of the shared memory, and not the shared memory directly.
This relates to the note above about temporarily releasing lock Y and then
re-reading the value region chunk.


##############################################################################
#  Not locking the hash table for the whole duration of value modifications  #
##############################################################################

Do we actually need to lock the hash table entry (or its wrapping bucket) for
the whole duration of modifying the entry's associated value?

Tl;dr: yes, because there are subtle race conditions that must be prevented by
locking all properties (hash table entry, value region chunk) of a cache item
for the whole duration of an operation.  

...............................................................................
| Lock A                                 | Lock B                              
```````````````````````````````````````````````````````````````````````````````
             ^            ^            ^            ^            ^
Hash table   |            |            |            |            |
buckets region            |            |            |            |
0------------1------------2------------3------------4------------5-------------
|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|
-------------------------------------------------------------------------------
             |_______________
                             |                             _____ Ring buffer
                             |                            |      pointer
Value region                 v                            v
0----------------------------1----------------------------2-----------------------------
|key,allocSize,size,flags,val|key,allocSize,size,flags,val|key,allocSize,size,flags,val|
----------------------------------------------------------------------------------------
|                            |                            |
v                            v                            v
........................................................................................
| Lock X                                 | Lock Y                                 |
````````````````````````````````````````````````````````````````````````````````````````

Example steps for set():

- Compute hash table bucket index (1)
- Lock A for (!) read
- Read value address from hash table entry 1
- (!) Release lock A
- Possible changes in another process:
  - add() another cache item
    - Lock A for write
    - Set "next" to some address in the separate hash table overflow region
    - Release lock A
  - remove() the cache item
    - Lock A for write
    - Set hash table entry 1 valAddr to NULL
    - Release lock A
    - Lock X for write
    - Set value chunk 1 size to 0 (free)
    - Release lock X
  - Change valAddr of hash table entry 1 to point to value 2
  - Set valAddr of entry 1 to NULL
- Lock X for write
- Read value chunk 1 and see that its key matches
- If existing chunk is big enough for new value
  - Write new value to existing chunk
  - If existing chunk now has enough free space, split the second half into free space
- Else
  - Remove value chunk 1
  - Lock A for write
  - Set hash table entry 1 valAddr to NULL
  - Set bucketPreviousEntry->next to this->next
  - Release lock A
  - Lock Y+0..N for write
    - If the already-acquired lock X is among these, it's OK, as we'll use recursive locks
  - Remove values after the ring buffer pointer, as usual, in a loop:
    - Remove a value chunk
    - Lock hash table bucket for write
    - Set its associated hash table entry valAddr to NULL
    - Release hash table bucket lock
  - Release locks Y+0..N
- Release lock X

There's at least one possible race condition:
- Process A
  - Update value with a new value
  - Update value size with new size
- Process B
  - Remove hash table entry (set valAddr to NULL)
  - Set value size to 0 (free)

The problem occurs if those steps are run in this order:
- A: Read valAddr from hash table entry
- B: Set hash table entry valAddr to NULL
- B: Update size from 456 to 0
- A: Update size from 0 to 123

The hash table thinks the cache item is removed, but the value memory doesn't.

If the hash table entry is locked for the whole duration of the operation, it
essentially acts as a lock on the whole cache item:

hash_table_entry_lock {
  cache_item {
    hash_table_entry
    value_chunk
  }
}


##############################################################################
#              What features should the memory allocator have?               #
##############################################################################

- No/minimal external fragmentation
- Minimal internal fragmentation
- Fast insertion of data
- Fast freeing of chunks
- Granular locking (don't lock the whole value region for write operations)
- No merging of chunks (thus not requiring keeping multiple chunk locks
  simultaneously, which prevents some deadlocks)
  - However, we can already prevent these deadlocks by acquiring each chunk's
    lock always in the same order, which is easy to do. Still, it's simpler
    to just need one value chunk lock when removing a cache item.

Memory allocation strategies:

- A memory pool: https://en.wikipedia.org/wiki/Memory_pool
  - Probably the same thing as Memcached's slab allocator in practice
- A slab allocator like in Memcached, where memory is partitioned into
  fixed-size (1MB) slabs, which contain as many fixed-size
  (1MB, 512kB, ..., 80 b) chunks as can fit into one slab. Locking can be done
  per slab, which locks multiple chunks, and therefore multiple cache items
  with one lock.
  - This is basically half fixed-size allocator, half variable-size allocator

How to make removing values fast:

- Free-list: a linked list to keep track of free value chunks and their sizes
- Some kind of a fixed-size or step-sized memory layout for not having to loop
  through the whole free-list or oldest items list to find a proper size block
  - Slab allocator (Memcached)
  - Same principle as in a hash table: buckets with their own free-lists

Note that splitting value chunks is easy, but merging them is hard. A slab
allocator doesn't split chunks, and instead has large-enough chunks for any
value size (up to max size), which is simple but allows internal fragmentation.

Why is merging chunks hard?

- One or both of the merged chunks might be in use (pointed to by a hash table entry)
  - The pointers need to be NULLed before freeing the value chunks
- Removing and merging currently takes a lot of time (partly due to difficulty
  in implementing safe granular locking)


***** Idea: a variable-size allocator wrapped by a fixed-size allocator ******

|__ ____ _ ____|_________ ____|__ _ _________
|__|____|_|____|_________|____|__|_|_________|
|Zone 0        |Zone 1        |Zone 2
               ^
               |
          ringBufferPtr

Group chunks into zones (pages) like Memcached does, but when removing oldest
cache items, don't remove a single chunk, but the whole zone. This gives space
for any size value (up to the largest possible cache item value). A fixed-size
zone can act as a mini-region inside which we can do variable-size allocations.

When inserting a new value, insert it to the newest zone that has enough space,
so that the value is not immediately evicted soon after when an old zone is
freed to make space for new values.

The whole value memory region can be a ring buffer, with a pointer pointing to
the newest zone (and the oldest zone can easily be found with
ringBufferPtr - ZONE_SIZE).

A global free-list keeps track of either the free space of zones, or the free
space of the chunks in zones. If it keeps track of only the free space of
a whole zone, then the zone is essentially a stack, and all chunks inside it
must be contiguous.

If a zone is a stack, a stack pointer points to the start of free space inside
the zone. Freeing a chunk would require moving memory, though, and PHP has no
shmop_move() function. Consider some options:

- A zone is not a stack, but contains variable-size chunks anywhere in it
  - Instead of the global free-list, each zone contains its own free-list,
    like Memcached
- A zone is a stack, but when freeing a chunk, and there is an adjacent
  non-free chunk to the right of it, don't free it, and mark it as "dirty".
  When the chunk to the right of it is freed, also free this chunk. In short,
  only free a chunk when it's the top chunk in the non-free stack.

I like the second one for its simplicity.

This whole mechanism gives us:

- Minimal/some fragmentation (each zone is filled with variable-size chunks;
  some internal fragmentation in terms of having unused space in zones, but no
  internal fragmentation in terms of the top-level allocator, which has
  fixed-size zones)
- Fast insertion of data (write the data and advance a zone's internal
  free-space-starts-here pointer)
- Fast freeing of chunks (a whole zone is freed at once; basically works like
  a zone allocator)
- Granular locking (per zone)
- No merging of chunks (if no chunk large enough is found, we free the oldest
  zone and therefore all chunks in it)

******************************************************************************

A problem remains: how to quickly, and with minimal locking, remove all
corresponding hash table entries, when a zone's chunks are freed.


##############################################################################
#                      How Memcached handles locking                         #
##############################################################################

typedef struct _stritem {
    /* Protected by LRU locks */
    struct _stritem *next;
    struct _stritem *prev;
    /* Rest are protected by an item lock */
    struct _stritem *h_next;    /* hash chain next */
    rel_time_t      time;       /* least recent access */
    rel_time_t      exptime;    /* expire time */
    int             nbytes;     /* size of data */
    unsigned short  refcount;
    uint8_t         nsuffix;    /* length of flags-and-length string */
    uint8_t         it_flags;   /* ITEM_* above */
    uint8_t         slabs_clsid;/* which slab class we're in */
    uint8_t         nkey;       /* key length, w/terminating null and padding */
    /* this odd type prevents type-punning issues when we do
     * the little shuffle to save space when not using CAS. */
    union {
        uint64_t cas;
        char end;
    } data[];
    /* if it_flags & ITEM_CAS we have 8 bytes CAS */
    /* then null-terminated key */
    /* then " flags length\r\n" (no terminating null) */
    /* then data with terminating \r\n (no terminating null; it's binary!) */
} item;

static void process_delete_command(conn *c, token_t *tokens, const size_t ntokens) {
    char *key = tokens[KEY_TOKEN].value;
    size_t nkey = tokens[KEY_TOKEN].length;

    item *it = item_get(key, nkey, c, DONT_UPDATE);
    if (it) {
        item_unlink(it);
        item_remove(it);      /* release our reference */
    }
}

/*
 * Returns an item if it hasn't been marked as expired,
 * lazy-expiring as needed.
 */
item *item_get(const char *key, const size_t nkey, conn *c, const bool do_update) {
    uint32_t hv = hash(key, nkey);
    item_lock(hv);
    item *it = do_item_get(key, nkey, hv, c, do_update);
    item_unlock(hv);
    return it;
}

/** wrapper around assoc_find which does the lazy expiration logic */
item *do_item_get(const char *key, const size_t nkey, const uint32_t hv, conn *c, const bool do_update) {
    item *it = assoc_find(key, nkey, hv);

    if (it != NULL)
        refcount_incr(it);

    if (it != NULL) {
        was_found = 1;
        if (item_is_flushed(it)) {
            do_item_unlink(it, hv);
            do_item_remove(it);
            it = NULL;
        } else if (it->exptime != 0 && it->exptime <= current_time) {
            do_item_unlink(it, hv);
            do_item_remove(it);
            it = NULL;
        }
    }

    return it;
}

/*
 * Unlinks an item from the LRU and hashtable.
 */
void item_unlink(item *item) {
    uint32_t hv = hash(ITEM_key(item), item->nkey);
    item_lock(hv);
    do_item_unlink(item, hv);
    item_unlock(hv);
}

/*
 * Decrements the reference count on an item and adds it to the freelist if
 * needed.
 */
void item_remove(item *item) {
    uint32_t hv;
    hv = hash(ITEM_key(item), item->nkey);

    item_lock(hv);
    do_item_remove(item);
    item_unlock(hv);
}

void do_item_unlink(item *it, const uint32_t hv) {
    if ((it->it_flags & ITEM_LINKED) != 0) {
        it->it_flags &= ~ITEM_LINKED;
        assoc_delete(ITEM_key(it), it->nkey, hv);
        item_unlink_q(it);
        do_item_remove(it);
    }
}

void assoc_delete(const char *key, const size_t nkey, const uint32_t hv) {
    item **before = _hashitem_before(key, nkey, hv);
    if (*before) {
        item *nxt = (*before)->h_next;
        (*before)->h_next = 0;   /* probably pointless, but whatever. */
        *before = nxt;
    }
}

static void item_unlink_q(item *it) {
    pthread_mutex_lock(&lru_locks[it->slabs_clsid]);
    do_item_unlink_q(it);
    pthread_mutex_unlock(&lru_locks[it->slabs_clsid]);
}

static void do_item_unlink_q(item *it) {
    item **head, **tail;
    head = &heads[it->slabs_clsid];
    tail = &tails[it->slabs_clsid];

    if (*head == it)
        *head = it->next;
    if (*tail == it)
        *tail = it->prev;

    if (it->next) it->next->prev = it->prev;
    if (it->prev) it->prev->next = it->next;
    sizes[it->slabs_clsid]--;
    sizes_bytes[it->slabs_clsid] -= ITEM_ntotal(it);
}

void do_item_remove(item *it) {
    if (refcount_decr(it) == 0)
        item_free(it);
}

void item_free(item *it) {
    size_t ntotal = ITEM_ntotal(it);

    /* so slab size changer can tell later if item is already free or not */
    unsigned int clsid = ITEM_clsid(it);
    slabs_free(it, ntotal, clsid);
}

void slabs_free(void *ptr, size_t size, unsigned int id) {
    pthread_mutex_lock(&slabs_lock);
    do_slabs_free(ptr, size, id);
    pthread_mutex_unlock(&slabs_lock);
}

static void do_slabs_free(void *ptr, const size_t size, unsigned int id) {
    if (id < POWER_SMALLEST || id > power_largest)
        return;

    slabclass_t *p = &slabclass[id];
    item *it = (item *)ptr;

    if ((it->it_flags & ITEM_CHUNKED) == 0) {
        it->it_flags = ITEM_SLABBED;
        it->slabs_clsid = 0;
        it->prev = 0;
        it->next = p->slots;
        if (it->next) it->next->prev = it;
        p->slots = it;

        p->sl_curr++;
        p->requested -= size;
    }
    else {
        do_slabs_free_chunked(it, size);
    }
}


From https://github.com/memcached/memcached/blob/master/doc/threads.txt:

    - item_lock must be held while modifying an item.
    - slabs_lock must be held while modifying the ITEM_SLABBED flag bit within an item.
    - ITEM_LINKED must not be set before an item has a key copied into it.
    - items without ITEM_SLABBED set cannot have their memory zeroed out.

    LOCK ORDERS:

    item_lock -> lru_lock -> slabs_lock
    lru_lock -> item_trylock

From thread.c:

    /* item_lock() must be held for an item before any modifications to either its
     * associated hash bucket, or the structure itself.
     * LRU modifications must hold the item lock, and the LRU lock.
     * LRU's accessing items must item_trylock() before modifying an item.
     * Items accessible from an LRU must not be freed or modified
     * without first locking and removing from the LRU.
     */

    void item_lock(uint32_t hv) {
        pthread_mutex_lock(&item_locks[hv & hashmask(item_lock_hashpower)]);
    }

    void *item_trylock(uint32_t hv) {
        pthread_mutex_t *lock = &item_locks[hv & hashmask(item_lock_hashpower)];
        if (pthread_mutex_trylock(lock) == 0)
            return lock;
        return NULL;
    }

    "If successful, pthread_mutex_lock() and pthread_mutex_unlock() return zero"
    "pthread_mutex_trylock() returns zero if a lock is acquired"

When removing an item with process_delete_command():

item_get()
  item_lock() <-------------------------------------------------------- Lock
  assoc_find()
    Returns a hash table entry like:
    struct {
        *next; <-- refers to the slab's LRU list's next item
        *prev; <-- refers to the slab's LRU list's prev item
        *h_next;    /* hash chain next */
        time;       /* least recent access */
        exptime;    /* expire time */
        nbytes;     /* size of data */
        refcount;
        nsuffix;    /* length of flags-and-length string */
        it_flags;   /* ITEM_* above */
        slabs_clsid;/* which slab class we're in */
        nkey;       /* key length, w/terminating null and padding */
        /* Key and data lives somewhere after this struct's end */
    }
  item_unlock() <---------------------------------------------------- Unlock
item_unlink()
  item_lock() <-------------------------------------------------------- Lock
  it->it_flags &= ~ITEM_LINKED;
  Remove this item from hash table bucket linked-list
  pthread_mutex_lock(&lru_locks[it->slabs_clsid]); <------------------- Lock
  Remove this item from slab's LRU linked-list
    (See https://github.com/memcached/memcached/blob/master/doc/new_lru.txt)
    if (it->next) it->next->prev = it->prev;
    if (it->prev) it->prev->next = it->next;
  pthread_mutex_unlock(&lru_locks[it->slabs_clsid]); <--------------- Unlock
  item_free()
    pthread_mutex_lock(&slabs_lock); <--------------------------------- Lock
    slabclass_t *p = &slabclass[it->slabs_clsid];
    it->it_flags = ITEM_SLABBED;
    it->slabs_clsid = 0;
    it->prev = 0;
    it->next = p->slots;
    if (it->next) it->next->prev = it;
    p->slots = it;
    pthread_mutex_unlock(&slabs_lock); <----------------------------- Unlock
  item_unlock() <---------------------------------------------------- Unlock


##############################################################################
#                 Simpler hash table with separate chaining                  #
##############################################################################

A sidenote: it might get a little awkward to refer to hash table entries when
some of the entries are in the main hash table bucket region, and some are in
the hash table overflow region, like I initially mentioned above when introducing
separate chaining:

Hash table
buckets region
0------------1------------2------------3------------4------------5-------------
|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|
-------------------------------------------------------------------------------
                      |___          ________________
Hash table                |        |                |
overflow region           v        |                v
0------------1------------2------------3------------4------------5-------------
|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|
-------------------------------------------------------------------------------

It might be better to have all the entries in the same region, and then have
one more layer of indirection from the bucket region to the separate table
entries region (i.e. even the first entry in a bucket lives in the entries
region, not in the buckets region).

Hash table   [entryAddr][entryAddr][entryAddr]
buckets        __________|
              |          ____________________
              v         |                    v
Hash table   [valAddr,next][valAddr,next][valAddr,next]
overflow       ______________________________|
              v
Value region [key,allocSize,size,flags,val][key,allocSize,size,flags,val]

However, with separate chaining we don't need to have extra entries that are
always null, like we had to do with open addressing, we can make the hash table
entries as big as needed, without wasting memory. The value region chunk can be
the entry:

Hash table   [entryAddr][entryAddr][entryAddr]
buckets        __________|
              |
              v
Value region [hashNext,key,allocSize,size,flags,val][hashNext,key,allocSize,size,flags,val]
(entry zones) |______________________________________^


##############################################################################
#              How to implement locking when evicting an item?               #
##############################################################################

Current situation:

Hash table        _A__ _B__
buckets          |____|____|
             ______|     |_________
            |                      |
Entry  |____v__ ______|__________ _v_|
zones  |_______|______|__________|___|
       |Zone|X    ^   |Zone Y  ^   | |
            |     |   ^        |   |
            |     |   |        |   |
            |    ringBufferPtr |   |
            |_____|____________|   |
                  |________________|

We have these locks:
- Each bucket has its own lock
- Each zone has its own lock
- Ring buffer pointer lock

Example steps when trying to update an item in bucket A and in zone X with
a new value, and the existing chunk doesn't have enough space for it:

- Lock bucket A and zone X
- Remove existing old chunk in zone X
- Unlock zone X
- Race condition?
  - No, because we only use entry->hashNext after unlocking, and the chunk at
    entry->hashNext is never moved or removed while bucket A is locked
  - RULE 1: to modify anything else than the ok-to-free dirty-bit in
    a zone chunk, you need to hold the bucket lock
  - RULE 2: to modify _anything_ in zones, you need to hold the zone's lock
- Lock zone Y (oldest zone at ring buffer pointer)
- Remove all items in zone Y
  - If we naively lock each item's associated bucket here, we deadlock. See
    "Deadlock" below.
- Unlock bucket A


##############################################################################
#                    Deadlock: bucket->zone vs zone->bucket                  #
##############################################################################

Hash table        _A__ _B__
buckets         1|____|____|2______
             _3____|    ^          |5
            |           |___6__    |
Entry  |____v__ ______|________|_ _v_|
zones  |_______|______|__________|___|
       |Zone|X        |Zone Y  ^     |
            |         ^        |
            |         |        |
            |    ringBufferPtr |
            |__________________|
                      4

Summary:
- Process 1 has locks A->Y, and wants locks A->Y->B
- Process 2 has lock B, and wants locks B->Y

Details:
Process 1 wants to set a new value to an existing item. Process 2 wants to
read an existing item.

1 P1: LOCK bucket A
2 P2: LOCK bucket B
3 P1: LOCK zone X
- P1: new value doesn't fit into existing chunk in zone X
- P1: find zone for entry->hashNext (zone Y)
- P1: free existing chunk in zone X
- P1: UNLOCK zone X
- P1: unlink entry from bucket A
  - P1: LOCK zone for entry->hashNext (zone Y)
  - P1: set bucket A head to be entry->hashNext->hashNext
    - No race condition: zone Y's chunk at entry->hashNext->hashNext can't have
      moved or been removed after unlocking X, as we still have lock A (see RULE 1)
  - P1: UNLOCK zone Y
- P1: LOCK ringBufferPtr (for oldest zone eviction)
4 P1: LOCK zone Y (oldest zone via ringBufferPtr)
5 P2: LOCK zone Y
  - Waits until P1 unlocks Y below
- P1: evict zone Y to make space for new values (free all chunks)
  - P1: free zone Y first chunk
  - P1: LOCK bucket A (again; recursive lock makes this OK)
  - P1: unlink entry from bucket A (like above)
  - P1: UNLOCK bucket A (still holds earlier lock of A)
  - P1: free zone Y second chunk
  6 P1: LOCK bucket B
    - DEADLOCK: P1 has Y, wants B; P2 has B, wants Y
  - P1: unlink entry from bucket B
  - P1: UNLOCK bucket B
- P1: ++ringBufferPtr
- P1: write new value to zone Y
- P1: UNLOCK zone Y
- P1: UNLOCK ringBufferPtr
- P1: UNLOCK bucket A
- P2: read second chunk of zone Y
  - RACE CONDITION due to P1 having freed all chunks; see "RACE CONDITION" below
  - However, this is also fixed when the DEADLOCK is fixed; see "TRYLOCK" below
- P2: UNLOCK zone Y
- P2: UNLOCK bucket B

RULE 3: the lock order between a bucket and a zone lock must be
bucket->zone, not zone->bucket. For ringBufferPtr it's ringBufferPtr->zone.
So, lock order:
Bucket lock -> ringBufferPtr lock -> zone lock

RULE 4: multiple zones can't be locked at the same time by a single process.

RULE 5: bultiple buckets can be locked at the same time, but _only_ by using
a trylock when trying to lock the 2nd..Nth bucket (see RULE 6). If the trylock
fails, the process must drop its zone lock and ringBufferPtr lock before trying
again.


To fix the DEADLOCK, instead of:
---------------------------------------
P1: LOCK bucket A
                       P2: LOCK bucket B
P1: LOCK RBP
P1: LOCK zone Y
    (for eviction)
                       P2: LOCK zone Y
P1: LOCK bucket B


Use a TRYLOCK in the other process:
---------------------------------------
P1: LOCK bucket A
                       P2: LOCK bucket B
P1: LOCK RBP
P1: LOCK zone Y
    (for eviction)
                       P2: LOCK zone Y
P1: TRYLOCK bucket B
    (fail)
P1: UNLOCK zone Y
                       (acquired Y)
P1: LOCK zone Y
    (for eviction)
                       P2: UNLOCK zone Y
                       P2: UNLOCK bucket B
P1: TRYLOCK bucket B
    (success)
P1: UNLOCK bucket B
P1: UNLOCK zone Y
P1: UNLOCK RBP


To prevent livelock, only one of the processes should use a trylock, and
the other should be a regular lock. Otherwise you'll have this problem:
---------------------------------------
P1: LOCK bucket A
                       P2: LOCK bucket B
P1: LOCK RBP
P1: LOCK zone Y
    (for eviction)
                       P2: TRYLOCK zone Y
                           (fail)
P1: TRYLOCK bucket B
    (fail)
                       P2: UNLOCK bucket B
                       P2: LOCK bucket B
P1: UNLOCK zone Y
P1: LOCK zone Y
    (for eviction)
                       P2: TRYLOCK zone Y
                           (fail)
P1: TRYLOCK bucket B
    (fail)
                       P2: UNLOCK bucket B
                       P2: LOCK bucket B
P1: UNLOCK zone Y
P1: LOCK zone Y
    (for eviction)

          ...  ad infinitum ...

(You could add random length timeouts between UNLOCK and LOCK, but that's a bit
hacky.)


Because of this, only the process that is doing the zone eviction should use
a trylock. Does it cause a livelock if two processes try to run zone eviction
at the same time?
---------------------------------------
P1: LOCK bucket A
                       P2: LOCK bucket B
P1: LOCK RBP
P1: LOCK zone Y
    (for eviction)
                       P2: LOCK RBP
                       P2: LOCK zone Y
                           (for eviction)
P1: TRYLOCK bucket C
                       P2: TRYLOCK bucket C

Not a problem, because the ringBufferPtr lock always blocks, and the zone Y
lock would also block.

RULE 6: you can violate RULE 3's bucket->zone order and acquire a lock in
zone->bucket order, but only if you use a trylock, AND all processes must lock
(not trylock) either the same bucket lock or the same zone lock first, before
the trylock.

Is this kind of a trylock similar to wait/wound mutexes?
https://github.com/torvalds/linux/tree/master/Documentation/locking


The RACE CONDITION mentioned above is also fixed when using a trylock, without
having to do anything for the RACE CONDITION separately. Instead of doing this
(which demonstrates the race condition):
---------------------------------------
P1: LOCK zone Y
                       P2: LOCK zone Y
P1: LOCK some bucket
P1: Unlink entry from
    bucket A
P1: UNLOCK some bucket
P1: Free all chunks
    of zone Y
P1: UNLOCK zone Y
                       P2: Read zone Y;
                           get back empty or
                           corrupted data
                       P2: UNLOCK zone Y


With trylocks we automatically get this:
---------------------------------------
P1: LOCK zone Y
                       P2: LOCK zone Y
P1: TRYLOCK some bucket
    (fail)
P1: UNLOCK zone Y
                       P2: Read zone Y
P1: LOCK zone Y
                       P2: UNLOCK zone Y
P1: TRYLOCK some bucket
    (success)
P1: Unlink entry from
    bucket A
P1: UNLOCK some bucket
P1: Free all chunks
    of zone Y
P1: UNLOCK zone Y


There is also a more subtle deadlock that the trylocking (wait/wound locking)
fixes, between A->Y->B and B->Y->A. In the following example zone Y has two
chunks: first one is associated with bucket B, and the second one is associated
with bucket A. From this we can see that when the TRYLOCK fails, the process
must also drop the RBP lock, not just the zone lock.
---------------------------------------
P1: LOCK bucket A
                       P2: LOCK bucket B
P1: LOCK RBP
P1: LOCK zone Y
    (for eviction)
                       P2: LOCK RBP
P1: TRYLOCK bucket B
    (fail)
P1: UNLOCK zone Y
P1: UNLOCK RBP
                       P2: LOCK zone Y
                           (for eviction)
                       P2: TRYLOCK bucket B
                           (success: P2 already has it)
                       P2: Free zone Y's 1st chunk
                       P2: TRYLOCK bucket A
                           (fail)
                       P2: UNLOCK zone Y
                       P2: UNLOCK RBP
P1: LOCK RBP
P1: LOCK zone Y
    (for eviction)
P1: See that 1st
    chunk is freed
P1: TRYLOCK bucket A
    (success: P1 already has it)
P1: Free zone Y's 2nd chunk
P1: UNLOCK zone Y


##############################################################################
#               Deadlock and livelock solution brainstorming                 #
##############################################################################

(Most of this was written before the deadlock solutions above, and some of it
after. Therefore this section isn't necessarily in proper chronological order
in relation to the other sections.)


First idea (lock everything with an additional readers-writer-lock):
----------

Hash table        _A__ _B__
buckets         1|____|____|2______
             _3____|    ^          |5
            |           |___6__    |
Entry  |____v__ ______|________|_ _v_|
zones  |_______|______|__________|___|
       |Zone|X        |Zone Y  ^     |
            |         ^        |
            |         |        |
            |    ringBufferPtr |
            |__________________|
                      4

1 P1: LOCK HASH_AND_ZONE_EVICTION_LOCK for read
- P1: LOCK bucket A
2 P2: LOCK HASH_AND_ZONE_EVICTION_LOCK for read
- P2: LOCK bucket B
3 P1: LOCK zone X
- P1: new value doesn't fit into existing chunk in zone X
- P1: find zone for entry->hashNext (zone Y)
- P1: free existing chunk in zone X
- P1: UNLOCK zone X
- P1: UNLOCK HASH_AND_ZONE_EVICTION_LOCK
- Race condition or deadlock?
- P1 currently only holds bucket A lock
- If
  - P3 locks HASH_AND_ZONE_EVICTION_LOCK (for read)
  - and locks bucket A
  > P3 blocks until P1 unlocks bucket A; no race condition
- If
  - P3 locks HASH_AND_ZONE_EVICTION_LOCK (for read)
  - and locks bucket C
  - and locks zone Y
  > P1 blocks until P3 unlocks HASH_AND_ZONE_EVICTION_LOCK
- If
  - P3 locks HASH_AND_ZONE_EVICTION_LOCK (for read)
  - and locks bucket C
  - and locks zone Y
  - and unlocks zone Y (no space for new value)
  - and unlocks HASH_AND_ZONE_EVICTION_LOCK
  - and locks HASH_AND_ZONE_EVICTION_LOCK (for write)
  > P1 holds A and wants HASH_AND_ZONE_EVICTION_LOCK;
    P3 holds HASH_AND_ZONE_EVICTION_LOCK and wants A (for eviction);
    would be a deadlock, so this idea doesn't remove the need for trylocking (see RULE 5 & 6)
- P1: LOCK HASH_AND_ZONE_EVICTION_LOCK for write
- Waits until P2 has unlocked its HASH_AND_ZONE_EVICTION_LOCK read lock
- P1: unlink entry from bucket A
- P1: LOCK zone for entry->hashNext (zone Y)
- P1: set bucket A head to be entry->hashNext->hashNext
  - No race condition: zone Y's chunk at entry->hashNext->hashNext can't have
    moved or been removed after unlocking X, as we still have lock A (see RULE 1)
- P1: UNLOCK zone Y
- P1: LOCK ringBufferPtr (for oldest zone eviction)
4 P1: LOCK zone Y (oldest zone via ringBufferPtr)
5 P2: LOCK zone Y
- Waits until P1 unlocks Y below
- P1: evict zone Y to make space for new values (free all chunks)
  - P1: free zone Y first chunk
  - P1: LOCK bucket A (again; recursive lock makes this OK)
  - P1: unlink entry from bucket A (like above)
  - P1: UNLOCK bucket A (still holds earlier lock of A)
  - P1: free zone Y second chunk
  6 P1: LOCK bucket B
    - DEADLOCK: P1 has Y, wants B; P2 has B, wants Y
  - P1: unlink entry from bucket B
  - P1: UNLOCK bucket B
- P1: ++ringBufferPtr
- P1: write new value to zone Y
- P1: UNLOCK zone Y
- P1: UNLOCK ringBufferPtr
- P1: UNLOCK bucket A
- P1: UNLOCK HASH_AND_ZONE_EVICTION_LOCK
- P2: read second chunk of zone Y
- P2: UNLOCK zone Y
- P2: UNLOCK bucket B
- P2: UNLOCK HASH_AND_ZONE_EVICTION_LOCK

In here the lock order must always be
HASH_AND_ZONE_EVICTION_LOCK -> bucket lock -> zone lock -> ringBufferPtr lock
to avoid deadlocks.

(Btw, if our flock() readers-writer-lock implementation is slow, you might try
to implement your own with sem_*() and shmop_*(), with a counter and
an "isLocked" variable in shared memory:
https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock#Implementation)


Second idea (always acquire zone lock first, then bucket lock):
-----------

- do you need a new short-duration lock for the whole hash table, for reading the bucket's head entry?
  - released already before we even read the entry's chunk
  - do we need to lock this for write later, again giving us the problem of lock inversion?
- how do you know which zone to lock before locking and reading the bucket?
  - if you first lock the bucket, then release it, then lock the zone and the bucket, you probably run into race conditions


Third idea (dirty bit for future removal of chunks, instead of immediate removal):
----------

A realization: this problem is essentially the same as in garbage collection
and reference counting.

https://blogs.msdn.microsoft.com/larryosterman/2005/02/18/concurrency-part-5-what-happens-when-you-cant-avoid-lock-misordering/
https://en.wikipedia.org/wiki/Reference_counting

We're essentially holding references to our cache item objects, and we can't
delete an object before all references to it have been NULLed.

- Mark a zone chunk as dirty with a dirty-bit, and if another process holds
  a lock to a reference (i.e. hash table bucket), let the other process
  delete its reference when it sees the dirty bit? if another process doesn't
  hold a lock to the bucket, the current process can NULL the hash table
  reference instead.
  - You can use LOCK_NB (in a loop, with a timeout), which returns immediately if
    the lock is already locked by someone else, but acquires the lock if it's not

Can we somehow make sure beforehand, that the oldest zone never has any
references to it in the hash table, when we're about to evict the zone?

- the earliest point to do something about this is when we move the ringBufferPtr
- at that point, set the zone's chunks' dirty-bits to 1
- at some point of processing, in any process, remove hash table reference and
  set chunk's hash-table-reference-deleted bit to 1
  - when that item is accessed via get(), set(), delete(), add()...
    - in some daemon-thread-like maintenance function when ShmCache is being destroyed
  - sometimes all chunks won't have had time to get deleted when we already need the zone's space for new values
    - need a way to wait in a process until the zone is fully evicted
    - evict the remaining chunks in this process, and/or wait currently lock-holding processes to finish

When a zone is locked for modification, but the modified chunk's associated
bucket is not locked, the only allowed modification is marking chunks as dirty
for later removal.


Fourth idea (non-blocking lock request, and pre-emptively releasing a lock):
-----------

Just use LOCK_NB in the hash table bucket locking, without any dirty-bits.
Something like:

while true {
  lock bucket
  try to lock zone (LOCK_NB)
  if zone couldn't be locked {
    unlock bucket
    continue
  }
  break
}

Is this prone to live-locking?

Is this similar to Memcached's item_trylock?


Fifth idea (use a buffer for value writes):
----------

Copy the value-to-write to some temporary buffer until it can be flushed to a zone.


##############################################################################
#                                  Locks                                     #
##############################################################################

- Stats lock: locks all access to the stats memory region
- Bucket locks: locks a hash table bucket, and the hashNext property of all
  entries in that bucket
- Zone locks: locks a single zone and all chunks in it
- Ring buffer pointer lock: locks the ring buffer pointer


##############################################################################
#                               Locking rules                                #
##############################################################################

Collecting all the mentioned RULEs gives:

RULE 1: to modify anything else than the ok-to-free dirty-bit in a zone chunk,
        you need to hold the chunk's associated bucket's lock

RULE 2: to modify _anything_ in zones, you need to hold the zone's lock

RULE 3: the lock order between a bucket and a zone lock must always be
        bucket -> zone, not zone -> bucket. For ringBufferPtr it's always
        ringBufferPtr -> zone. So, lock order:

        Bucket lock -> ringBufferPtr lock -> zone lock

RULE 4: multiple zones can't be locked at the same time by a single process

RULE 5: bultiple buckets can be locked at the same time, but _only_ by using
        a trylock when trying to lock the 2nd..Nth bucket (see RULE 6). If
        the trylock fails, the process must drop its zone lock and
        ringBufferPtr lock before trying again.

RULE 6: you can violate RULE 3's bucket -> zone order and acquire a lock in
        zone -> bucket order, but only if you use a trylock, AND all processes
        must lock (not trylock) either the same bucket lock or the same zone
        lock first, before the trylock.


##############################################################################
#                             Locking principles                             #
##############################################################################

Generic stuff I've learned so far:

- Use a DAG (e.g. a singly-linked list) if you can, which makes handling
  deadlocks a lot easier than more complicated graphs (i.e. ones with cycles)
- Use lock hierarchies of you can, and only ever traverse the hierarchy in one
  direction (e.g. parent->child)
- Don't keep two locks at the same time, if you can
- If you need to lock two locks in whatever order, use a trylock. If you can,
  use a trylock only in the other process/thread, because otherwise you have
  the risk of livelocks.
  - When you hold a lock, and trylocking another one fails, you can drop the
    original lock and try both (lock A + trylock B) again. Be wary of livelocks.
    This might be called a "wait/wound mutex".
  - Or you could guard the trylock with a separate lock, which all processes
    must acquire beforehand
- Deadlocks occur because of _lock inversion_
- When you're designing locks, visualize and draw!
  - Draw out the resources that are to be protected by locks
  - Draw out the locking order as lines and step numbers
  - Write out the locking order as two columns for two processes/threads
- When you're designing locks, think of the resources that need to be protected.
  What _is_ a resource? A single "resource" might contain data in multiple places.
- Guarding a shared resource with a lock is about _data ownership_. He who has
  the lock owns the data.
- Your locks should be as long-lasting as possible (cover as much lines as possible)
  for better understandability, as long as performance stays OK.
  Longer critical sections -> more code behaves like a single-threaded program.
- Write validators for your locks. assert() the order in which the locks are
  acquired, and which locks must be held when certain shared memory locations
  are read.
- https://www.cs.utexas.edu/~dahlin/professional/programming-with-threads.pdf


##############################################################################
#                     No zone locks? Only bucket locks?                      #
##############################################################################

Hash table        _A__ _B__
buckets         1|____|____|5______
             _2____|    ^          |6
            |           |___4__    |
Entry  |____v__ ______|________|_ _v_|
zones  |_______|______|__________|___|
       |Zone|X        |Zone Y  ^   ^ |
            |         ^        |   |
            |         |        |   |
            |    ringBufferPtr |   |7
            |__________________|___|
                      3

1 P1: Lock bucket A
- P1: Read head chunk offset from bucket A
2 P1: Free chunk from zone X
3 P1: Evict zone Y to make free space for chunk's (now-oversized) value
- P1: Read zone Y's first chunk's key
  - Race condition: someone might be writing to the key, and you might read
    partially old data and partially new data from the key string
4 P1: Lock bucket B
- P1: Unlink first chunk from bucket B
- P1: Free first chunk of zone Y
- P1: Unlock bucket B
- P1: Read zone Y's second chunk's key
  - Race condition: someone might be writing to the key, and you might read
    partially old data and partially new data from the key string
5 P2: Lock bucket B
- P2: Read head chunk offset from bucket B
6 P2: Unlink second chunk from bucket B
- P2: Free second chunk of zone Y
- P2: Unlock bucket B
7 P1: Lock zone Y's second chunk's bucket (B)
- P1: Free second chunk of zone Y
  - Problem: already freed by P2

We'll want to use zone locks because of the race condition and the
already-freed problem.