-
Notifications
You must be signed in to change notification settings - Fork 2
/
LOCKING.brainstorm.txt
1403 lines (1194 loc) · 55.2 KB
/
LOCKING.brainstorm.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
This file tells the brainstormy/rambly story of planning more granular locking
for ShmCache. Locking the whole memory (hash table region and value region) for
all operations is currently really slow whenever multiple processes try to do
operations in parallel.
##############################################################################
# Hash table with open addressing #
##############################################################################
.........................................................................
| Hash table global lock |
`````````````````````````````````````````````````````````````````````````
Hash table region
0-------1-------2-------3-------4-------5-------6-------7-------8-------9
|valAddr|valAddr|valAddr|valAddr|valAddr|valAddr|valAddr|valAddr|valAddr|
-------------------------------------------------------------------------
__________________|
| _____ Ring buffer
| | pointer
Value region v v
0----------------------------1----------------------------2-----------------------------
|key,allocSize,size,flags,val|key,allocSize,size,flags,val|key,allocSize,size,flags,val|
----------------------------------------------------------------------------------------
| | |
v v v
........................................................................................
| Lock X | Lock Y |
````````````````````````````````````````````````````````````````````````````````````````
Each lock is associated with multiple values, because we're using flock(), and
we want to limit the amount of lock files.
Example steps for set():
- Compute hash table bucket index (4)
- Lock hash table for read
- Lock X for write
- Read key from value 0
- Key doesn't match; release lock X
- Read next hash table entry (index 5)
- Lock Y for write
- Read key from value 2
- Key doesn't match; release lock Y
- Read next hash table entry (index 6)
- Lock X for write
- Read key from value 1
- Key matches
- If new value fits into allocSize of value 1
- Release hash table lock
- Update value size
- Write new value
- If new size leaves a large-enough gap, split value into new value and free space
- Release lock X
- Else
- Release hash table lock
- (race condition???)
- Lock hash table for write
- (deadlock, because locking value first, then hash table???)
- Can fix deadlock and race condition by locking initially for write, but
that loses read-parallelism
- Remove old value
- Set value size to 0 (free)
- Set hash table entry valAddr to 0 (free)
- Merge this space with adjacent free space
- Acquire next value space's lock for write
- Read next value
- If next value is free, merge it to this one
- Continue until non-free value is found
- Release all free spaces' locks (keep a stack)
- Release lock X
- (race condition???)
- Lock ring buffer location (lock Y) for write
- Read value size at ring buffer location
- If not large enough for value
- Lock ringbuffer+1..N for write until have enough space for new value
- Remove all old values
- Merge now-free space into one chunk
- Lock ringbuffer+N+Nsize-(leftover space); split the chunk into new value and leftover free space
- Release all locks Y->
- Release hash table lock
*********** Why use a global lock for the hash table? ************************
It's difficult/impossible to implement bucket-specific locks with open
addressing because of deadlocks.
Example:
.........................................
| Lock A | Lock B |
`````````````````````````````````````````
^ ^ ^ ^ ^
| | | | |
0-------1-------2-------3-------4--------
|valAddr| NULL |valAddr|valAddr|valAddr|
-----------------------------------------
Note that there is always at least one NULL in this table, as the table must
never reach a 100% load factor when using open addressing.
- Find table index for key "foo"
- Hash "foo" to 2
- Acquire lock A
- Iterate to 3 (the correct location)
- Try to acquire lock B
- Find table index for key "bar"
- Hash "bar" to 4
- Acquire lock B
- Iterate to 0 (the correct location)
- Try to acquire lock A
- Deadlock
- "foo" has A, tries to lock B
- "bar" has B, tries to lock A
Solutions:
- Require locks to always be acquired in a certain order (A->B)
- Not realistic, as "bar" would beed to acquire A and all preceding locks before B
- Drop any locks you're holding, if you're trying to acquire a lock
- Doesn't allow us to keep locks of multiple entries, which would be required
when merging (non-free) value region chunks
- Make each hash table entry have its own lock. The precence of at least one
NULL in the table ensures that locks are always acquired left-to-right.
- Would probably require us to create 10k..100k files for flock() or sem_get().
Would we hit number-of-open-files limits?
- TODO: this doesn't actually fix the problem; delete this
******************************************************************************
Open addressing doesn't seem to be a good solution in terms of having more
granular locks for the hash table entries. The root problem is that one
bucket's entries can be associated with multiple locks, so to lock a single
bucket, you might need multiple locks, and the order in which those are
acquired is difficult to enforce, leading to deadlocks. Implementing the
hash table with separate chaining is a solution (see the next section).
When planning a locking hierarchy, it looks like you should never keep two
locks of the same level of hierarchy (where the hash table is one level of
hierarchy, and the value memory region is another level) at the same time, or
if you do, you need to be 100% sure that the locks are always acquired in the
same order (A->B->C). It seems a bit easier to keep multiple locks when all of
them are in different levels of hierarchy, as usually those locks are acquired
in the same order.
(Deadlocks occur because of _lock inversion_.)
##############################################################################
# Hash table with separate chaining #
##############################################################################
With separate chaining we can have one lock fully wrap _all_ of the entries
in a single hash table bucket, unlike with open addressing, because separate
chaining gives each bucket and its entries a clear hierarchy.
...............................................................................
| Lock A | Lock B
```````````````````````````````````````````````````````````````````````````````
^ ^ ^ ^ ^
Hash table | | | | |
buckets region | | | |
0------------1------------2------------3------------4------------5-------------
|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|
-------------------------------------------------------------------------------
|___ ________________
Hash table | | |
overflow region v | v
0------------1------------2------------3------------4------------5-------------
|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|
-------------------------------------------------------------------------------
______________________|
| _____ Ring buffer
| | pointer
Value region v v
0----------------------------1----------------------------2-----------------------------
|key,allocSize,size,flags,val|key,allocSize,size,flags,val|key,allocSize,size,flags,val|
----------------------------------------------------------------------------------------
| | |
v v v
........................................................................................
| Lock X | Lock Y |
````````````````````````````````````````````````````````````````````````````````````````
Each lock is associated with multiple buckets or values, because we're using
flock(), and we want to limit the amount of lock files.
Example steps for set():
- Compute hash table bucket index (1)
- Lock A for write
- Lock X for write
- Read key from value 0
- Key doesn't match; release lock X
- Read next hash table entry from overflow region (index 2)
- Lock Y for write
- Read key from value 2
- Key doesn't match; release lock Y
- Read next hash table entry from overflow region (index 4)
- Lock X for write
- Read key from value 1
- Key matches
- If new value fits into allocSize of value 1
- Release lock A
- Update value size
- Write new value
- If new size leaves a large-enough gap, split value into new value and free space
- Release lock X
- Else
- Remove old value
- Set value size to 0 (free)
- Set hash table entry valAddr to 0 (free) (OK as we still have lock A)
- Merge this space with adjacent free space
- Acquire next value space's lock for write
- Read next value
- If next value is free, merge it to this one
- Continue until non-free value is found
- Release all free spaces' locks (keep a stack)
- Release lock X
- (race condition???)
- If this is actually a race condition, we can move the lock release
further down, with slightly worse performance
- Lock ring buffer location (lock Y) for write
- Read value key and size at ring buffer location
- note: we can't temporarily release lock Y here to first lock the hash table
bucket and then the value, to prevent a deadlock, as it causes a race condition
- some other process could, for example, split the value at the ring buffer
location, requiring us to re-read the value after we re-acquire lock Y,
lest we have an old copy of that value chunk's size
- Lock hash table bucket for value's key at ring buffer's location, if value is non-free
- (deadlock??? locking hash table bucket _after_ value)
- There seems to be no way to lock the hash table bucket first without first
locking and reading the value's key. The only idea I have is to introduce
a global everything-blocking lock for cases like this. A global lock could
work if the current shmop_read() and unpack() calls were faster (could try
to optimize these).
- If not large enough for value
- Lock ringbuffer+1..N for write until have enough space for new value
- Remove all old values at ringbuffer+0..N
- Merge now-free space into one chunk
- Lock ringbuffer+N+Nsize-(leftover space); split the chunk into new value and leftover free space
- Release all locks Y->
- Release lock A
Separate chaining should make the hash table locking more performant.
We currently use a ring buffer where values are (mostly) in insertion order
(FIFO).
When removing oldest items in the ring buffer, we need to lock a hash table
entry, but can only do that after locking a value, which results in deadlocks.
Hash table entries must always be locked _before_ locking values.
Maybe some other value memory region architecture makes locking easier.
- How do you easily find/create a new memory chunk for a new value without
complicated locking of old values and their hash table entries?
- Hash table currently contains pointers to "structs" in value memory region
- Clearing a value "struct" without clearing the pointer creates a dangling pointer
- Need to always clear the pointer first, then the struct
- A memory pool or slab allocation, i.e., fixed-size block allocation, would
help, as there always exists a chunk large enough for a value, and no
merging of oldest value chunks would have to be made
- What other ways are there to keep track of oldest cache items?
- A separate LRU linked list (like Memcached does)
- A separate FIFO queue (ring buffer) that points to values that are spread
across the value region in non-insertion-order
- Doesn't allow easily merging two adjacent value region chunks
- Maybe this queue contains groups of value chunks, and not the chunks
themselves; each group only contains _contiguous_ value chunks
- In case we still use dynamic-size block allocation, how do you merge multiple
value chunks while keeping a lock for the shortest possible duration?
- A separate LRU/FIFO list is problematic, as adjacent queue items don't
always point to adjacent value region chunks
- Optimize the current merging code
- Don't. Always keep a free chunk of the largest possible item size available.
- Is there an easy way to always keep at least one chunk of the largest possible
item size available?
- A naive implementation would just delete and merge oldest items until there
is enough space, like we do now. This is difficult in terms of locking.
- Make each chunk a fixed size (largest possible item size). This causes huge
internal fragmentation.
- Different slabs for most popular chunk sizes fixes this
- Can we use swaps (moves) to make something easier? Swap two value chunks.
Swaps don't require pack()/unpack().
A realization: in C you can map structs in shared memory, and when some
member's data changes in the shared memory block, you can just read the new
value with obj->member. Our way of using PHP's unpack() is a lot slower and
creates a thick, slow layer between PHP and shared memory, where we basically
always read copies of the shared memory, and not the shared memory directly.
This relates to the note above about temporarily releasing lock Y and then
re-reading the value region chunk.
##############################################################################
# Not locking the hash table for the whole duration of value modifications #
##############################################################################
Do we actually need to lock the hash table entry (or its wrapping bucket) for
the whole duration of modifying the entry's associated value?
Tl;dr: yes, because there are subtle race conditions that must be prevented by
locking all properties (hash table entry, value region chunk) of a cache item
for the whole duration of an operation.
...............................................................................
| Lock A | Lock B
```````````````````````````````````````````````````````````````````````````````
^ ^ ^ ^ ^
Hash table | | | | |
buckets region | | | |
0------------1------------2------------3------------4------------5-------------
|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|
-------------------------------------------------------------------------------
|_______________
| _____ Ring buffer
| | pointer
Value region v v
0----------------------------1----------------------------2-----------------------------
|key,allocSize,size,flags,val|key,allocSize,size,flags,val|key,allocSize,size,flags,val|
----------------------------------------------------------------------------------------
| | |
v v v
........................................................................................
| Lock X | Lock Y |
````````````````````````````````````````````````````````````````````````````````````````
Example steps for set():
- Compute hash table bucket index (1)
- Lock A for (!) read
- Read value address from hash table entry 1
- (!) Release lock A
- Possible changes in another process:
- add() another cache item
- Lock A for write
- Set "next" to some address in the separate hash table overflow region
- Release lock A
- remove() the cache item
- Lock A for write
- Set hash table entry 1 valAddr to NULL
- Release lock A
- Lock X for write
- Set value chunk 1 size to 0 (free)
- Release lock X
- Change valAddr of hash table entry 1 to point to value 2
- Set valAddr of entry 1 to NULL
- Lock X for write
- Read value chunk 1 and see that its key matches
- If existing chunk is big enough for new value
- Write new value to existing chunk
- If existing chunk now has enough free space, split the second half into free space
- Else
- Remove value chunk 1
- Lock A for write
- Set hash table entry 1 valAddr to NULL
- Set bucketPreviousEntry->next to this->next
- Release lock A
- Lock Y+0..N for write
- If the already-acquired lock X is among these, it's OK, as we'll use recursive locks
- Remove values after the ring buffer pointer, as usual, in a loop:
- Remove a value chunk
- Lock hash table bucket for write
- Set its associated hash table entry valAddr to NULL
- Release hash table bucket lock
- Release locks Y+0..N
- Release lock X
There's at least one possible race condition:
- Process A
- Update value with a new value
- Update value size with new size
- Process B
- Remove hash table entry (set valAddr to NULL)
- Set value size to 0 (free)
The problem occurs if those steps are run in this order:
- A: Read valAddr from hash table entry
- B: Set hash table entry valAddr to NULL
- B: Update size from 456 to 0
- A: Update size from 0 to 123
The hash table thinks the cache item is removed, but the value memory doesn't.
If the hash table entry is locked for the whole duration of the operation, it
essentially acts as a lock on the whole cache item:
hash_table_entry_lock {
cache_item {
hash_table_entry
value_chunk
}
}
##############################################################################
# What features should the memory allocator have? #
##############################################################################
- No/minimal external fragmentation
- Minimal internal fragmentation
- Fast insertion of data
- Fast freeing of chunks
- Granular locking (don't lock the whole value region for write operations)
- No merging of chunks (thus not requiring keeping multiple chunk locks
simultaneously, which prevents some deadlocks)
- However, we can already prevent these deadlocks by acquiring each chunk's
lock always in the same order, which is easy to do. Still, it's simpler
to just need one value chunk lock when removing a cache item.
Memory allocation strategies:
- A memory pool: https://en.wikipedia.org/wiki/Memory_pool
- Probably the same thing as Memcached's slab allocator in practice
- A slab allocator like in Memcached, where memory is partitioned into
fixed-size (1MB) slabs, which contain as many fixed-size
(1MB, 512kB, ..., 80 b) chunks as can fit into one slab. Locking can be done
per slab, which locks multiple chunks, and therefore multiple cache items
with one lock.
- This is basically half fixed-size allocator, half variable-size allocator
How to make removing values fast:
- Free-list: a linked list to keep track of free value chunks and their sizes
- Some kind of a fixed-size or step-sized memory layout for not having to loop
through the whole free-list or oldest items list to find a proper size block
- Slab allocator (Memcached)
- Same principle as in a hash table: buckets with their own free-lists
Note that splitting value chunks is easy, but merging them is hard. A slab
allocator doesn't split chunks, and instead has large-enough chunks for any
value size (up to max size), which is simple but allows internal fragmentation.
Why is merging chunks hard?
- One or both of the merged chunks might be in use (pointed to by a hash table entry)
- The pointers need to be NULLed before freeing the value chunks
- Removing and merging currently takes a lot of time (partly due to difficulty
in implementing safe granular locking)
***** Idea: a variable-size allocator wrapped by a fixed-size allocator ******
|__ ____ _ ____|_________ ____|__ _ _________
|__|____|_|____|_________|____|__|_|_________|
|Zone 0 |Zone 1 |Zone 2
^
|
ringBufferPtr
Group chunks into zones (pages) like Memcached does, but when removing oldest
cache items, don't remove a single chunk, but the whole zone. This gives space
for any size value (up to the largest possible cache item value). A fixed-size
zone can act as a mini-region inside which we can do variable-size allocations.
When inserting a new value, insert it to the newest zone that has enough space,
so that the value is not immediately evicted soon after when an old zone is
freed to make space for new values.
The whole value memory region can be a ring buffer, with a pointer pointing to
the newest zone (and the oldest zone can easily be found with
ringBufferPtr - ZONE_SIZE).
A global free-list keeps track of either the free space of zones, or the free
space of the chunks in zones. If it keeps track of only the free space of
a whole zone, then the zone is essentially a stack, and all chunks inside it
must be contiguous.
If a zone is a stack, a stack pointer points to the start of free space inside
the zone. Freeing a chunk would require moving memory, though, and PHP has no
shmop_move() function. Consider some options:
- A zone is not a stack, but contains variable-size chunks anywhere in it
- Instead of the global free-list, each zone contains its own free-list,
like Memcached
- A zone is a stack, but when freeing a chunk, and there is an adjacent
non-free chunk to the right of it, don't free it, and mark it as "dirty".
When the chunk to the right of it is freed, also free this chunk. In short,
only free a chunk when it's the top chunk in the non-free stack.
I like the second one for its simplicity.
This whole mechanism gives us:
- Minimal/some fragmentation (each zone is filled with variable-size chunks;
some internal fragmentation in terms of having unused space in zones, but no
internal fragmentation in terms of the top-level allocator, which has
fixed-size zones)
- Fast insertion of data (write the data and advance a zone's internal
free-space-starts-here pointer)
- Fast freeing of chunks (a whole zone is freed at once; basically works like
a zone allocator)
- Granular locking (per zone)
- No merging of chunks (if no chunk large enough is found, we free the oldest
zone and therefore all chunks in it)
******************************************************************************
A problem remains: how to quickly, and with minimal locking, remove all
corresponding hash table entries, when a zone's chunks are freed.
##############################################################################
# How Memcached handles locking #
##############################################################################
typedef struct _stritem {
/* Protected by LRU locks */
struct _stritem *next;
struct _stritem *prev;
/* Rest are protected by an item lock */
struct _stritem *h_next; /* hash chain next */
rel_time_t time; /* least recent access */
rel_time_t exptime; /* expire time */
int nbytes; /* size of data */
unsigned short refcount;
uint8_t nsuffix; /* length of flags-and-length string */
uint8_t it_flags; /* ITEM_* above */
uint8_t slabs_clsid;/* which slab class we're in */
uint8_t nkey; /* key length, w/terminating null and padding */
/* this odd type prevents type-punning issues when we do
* the little shuffle to save space when not using CAS. */
union {
uint64_t cas;
char end;
} data[];
/* if it_flags & ITEM_CAS we have 8 bytes CAS */
/* then null-terminated key */
/* then " flags length\r\n" (no terminating null) */
/* then data with terminating \r\n (no terminating null; it's binary!) */
} item;
static void process_delete_command(conn *c, token_t *tokens, const size_t ntokens) {
char *key = tokens[KEY_TOKEN].value;
size_t nkey = tokens[KEY_TOKEN].length;
item *it = item_get(key, nkey, c, DONT_UPDATE);
if (it) {
item_unlink(it);
item_remove(it); /* release our reference */
}
}
/*
* Returns an item if it hasn't been marked as expired,
* lazy-expiring as needed.
*/
item *item_get(const char *key, const size_t nkey, conn *c, const bool do_update) {
uint32_t hv = hash(key, nkey);
item_lock(hv);
item *it = do_item_get(key, nkey, hv, c, do_update);
item_unlock(hv);
return it;
}
/** wrapper around assoc_find which does the lazy expiration logic */
item *do_item_get(const char *key, const size_t nkey, const uint32_t hv, conn *c, const bool do_update) {
item *it = assoc_find(key, nkey, hv);
if (it != NULL)
refcount_incr(it);
if (it != NULL) {
was_found = 1;
if (item_is_flushed(it)) {
do_item_unlink(it, hv);
do_item_remove(it);
it = NULL;
} else if (it->exptime != 0 && it->exptime <= current_time) {
do_item_unlink(it, hv);
do_item_remove(it);
it = NULL;
}
}
return it;
}
/*
* Unlinks an item from the LRU and hashtable.
*/
void item_unlink(item *item) {
uint32_t hv = hash(ITEM_key(item), item->nkey);
item_lock(hv);
do_item_unlink(item, hv);
item_unlock(hv);
}
/*
* Decrements the reference count on an item and adds it to the freelist if
* needed.
*/
void item_remove(item *item) {
uint32_t hv;
hv = hash(ITEM_key(item), item->nkey);
item_lock(hv);
do_item_remove(item);
item_unlock(hv);
}
void do_item_unlink(item *it, const uint32_t hv) {
if ((it->it_flags & ITEM_LINKED) != 0) {
it->it_flags &= ~ITEM_LINKED;
assoc_delete(ITEM_key(it), it->nkey, hv);
item_unlink_q(it);
do_item_remove(it);
}
}
void assoc_delete(const char *key, const size_t nkey, const uint32_t hv) {
item **before = _hashitem_before(key, nkey, hv);
if (*before) {
item *nxt = (*before)->h_next;
(*before)->h_next = 0; /* probably pointless, but whatever. */
*before = nxt;
}
}
static void item_unlink_q(item *it) {
pthread_mutex_lock(&lru_locks[it->slabs_clsid]);
do_item_unlink_q(it);
pthread_mutex_unlock(&lru_locks[it->slabs_clsid]);
}
static void do_item_unlink_q(item *it) {
item **head, **tail;
head = &heads[it->slabs_clsid];
tail = &tails[it->slabs_clsid];
if (*head == it)
*head = it->next;
if (*tail == it)
*tail = it->prev;
if (it->next) it->next->prev = it->prev;
if (it->prev) it->prev->next = it->next;
sizes[it->slabs_clsid]--;
sizes_bytes[it->slabs_clsid] -= ITEM_ntotal(it);
}
void do_item_remove(item *it) {
if (refcount_decr(it) == 0)
item_free(it);
}
void item_free(item *it) {
size_t ntotal = ITEM_ntotal(it);
/* so slab size changer can tell later if item is already free or not */
unsigned int clsid = ITEM_clsid(it);
slabs_free(it, ntotal, clsid);
}
void slabs_free(void *ptr, size_t size, unsigned int id) {
pthread_mutex_lock(&slabs_lock);
do_slabs_free(ptr, size, id);
pthread_mutex_unlock(&slabs_lock);
}
static void do_slabs_free(void *ptr, const size_t size, unsigned int id) {
if (id < POWER_SMALLEST || id > power_largest)
return;
slabclass_t *p = &slabclass[id];
item *it = (item *)ptr;
if ((it->it_flags & ITEM_CHUNKED) == 0) {
it->it_flags = ITEM_SLABBED;
it->slabs_clsid = 0;
it->prev = 0;
it->next = p->slots;
if (it->next) it->next->prev = it;
p->slots = it;
p->sl_curr++;
p->requested -= size;
}
else {
do_slabs_free_chunked(it, size);
}
}
From https://github.com/memcached/memcached/blob/master/doc/threads.txt:
- item_lock must be held while modifying an item.
- slabs_lock must be held while modifying the ITEM_SLABBED flag bit within an item.
- ITEM_LINKED must not be set before an item has a key copied into it.
- items without ITEM_SLABBED set cannot have their memory zeroed out.
LOCK ORDERS:
item_lock -> lru_lock -> slabs_lock
lru_lock -> item_trylock
From thread.c:
/* item_lock() must be held for an item before any modifications to either its
* associated hash bucket, or the structure itself.
* LRU modifications must hold the item lock, and the LRU lock.
* LRU's accessing items must item_trylock() before modifying an item.
* Items accessible from an LRU must not be freed or modified
* without first locking and removing from the LRU.
*/
void item_lock(uint32_t hv) {
pthread_mutex_lock(&item_locks[hv & hashmask(item_lock_hashpower)]);
}
void *item_trylock(uint32_t hv) {
pthread_mutex_t *lock = &item_locks[hv & hashmask(item_lock_hashpower)];
if (pthread_mutex_trylock(lock) == 0)
return lock;
return NULL;
}
"If successful, pthread_mutex_lock() and pthread_mutex_unlock() return zero"
"pthread_mutex_trylock() returns zero if a lock is acquired"
When removing an item with process_delete_command():
item_get()
item_lock() <-------------------------------------------------------- Lock
assoc_find()
Returns a hash table entry like:
struct {
*next; <-- refers to the slab's LRU list's next item
*prev; <-- refers to the slab's LRU list's prev item
*h_next; /* hash chain next */
time; /* least recent access */
exptime; /* expire time */
nbytes; /* size of data */
refcount;
nsuffix; /* length of flags-and-length string */
it_flags; /* ITEM_* above */
slabs_clsid;/* which slab class we're in */
nkey; /* key length, w/terminating null and padding */
/* Key and data lives somewhere after this struct's end */
}
item_unlock() <---------------------------------------------------- Unlock
item_unlink()
item_lock() <-------------------------------------------------------- Lock
it->it_flags &= ~ITEM_LINKED;
Remove this item from hash table bucket linked-list
pthread_mutex_lock(&lru_locks[it->slabs_clsid]); <------------------- Lock
Remove this item from slab's LRU linked-list
(See https://github.com/memcached/memcached/blob/master/doc/new_lru.txt)
if (it->next) it->next->prev = it->prev;
if (it->prev) it->prev->next = it->next;
pthread_mutex_unlock(&lru_locks[it->slabs_clsid]); <--------------- Unlock
item_free()
pthread_mutex_lock(&slabs_lock); <--------------------------------- Lock
slabclass_t *p = &slabclass[it->slabs_clsid];
it->it_flags = ITEM_SLABBED;
it->slabs_clsid = 0;
it->prev = 0;
it->next = p->slots;
if (it->next) it->next->prev = it;
p->slots = it;
pthread_mutex_unlock(&slabs_lock); <----------------------------- Unlock
item_unlock() <---------------------------------------------------- Unlock
##############################################################################
# Simpler hash table with separate chaining #
##############################################################################
A sidenote: it might get a little awkward to refer to hash table entries when
some of the entries are in the main hash table bucket region, and some are in
the hash table overflow region, like I initially mentioned above when introducing
separate chaining:
Hash table
buckets region
0------------1------------2------------3------------4------------5-------------
|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|
-------------------------------------------------------------------------------
|___ ________________
Hash table | | |
overflow region v | v
0------------1------------2------------3------------4------------5-------------
|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|valAddr,next|
-------------------------------------------------------------------------------
It might be better to have all the entries in the same region, and then have
one more layer of indirection from the bucket region to the separate table
entries region (i.e. even the first entry in a bucket lives in the entries
region, not in the buckets region).
Hash table [entryAddr][entryAddr][entryAddr]
buckets __________|
| ____________________
v | v
Hash table [valAddr,next][valAddr,next][valAddr,next]
overflow ______________________________|
v
Value region [key,allocSize,size,flags,val][key,allocSize,size,flags,val]
However, with separate chaining we don't need to have extra entries that are
always null, like we had to do with open addressing, we can make the hash table
entries as big as needed, without wasting memory. The value region chunk can be
the entry:
Hash table [entryAddr][entryAddr][entryAddr]
buckets __________|
|
v
Value region [hashNext,key,allocSize,size,flags,val][hashNext,key,allocSize,size,flags,val]
(entry zones) |______________________________________^
##############################################################################
# How to implement locking when evicting an item? #
##############################################################################
Current situation:
Hash table _A__ _B__
buckets |____|____|
______| |_________
| |
Entry |____v__ ______|__________ _v_|
zones |_______|______|__________|___|
|Zone|X ^ |Zone Y ^ | |
| | ^ | |
| | | | |
| ringBufferPtr | |
|_____|____________| |
|________________|
We have these locks:
- Each bucket has its own lock
- Each zone has its own lock
- Ring buffer pointer lock
Example steps when trying to update an item in bucket A and in zone X with
a new value, and the existing chunk doesn't have enough space for it:
- Lock bucket A and zone X
- Remove existing old chunk in zone X
- Unlock zone X
- Race condition?
- No, because we only use entry->hashNext after unlocking, and the chunk at
entry->hashNext is never moved or removed while bucket A is locked
- RULE 1: to modify anything else than the ok-to-free dirty-bit in
a zone chunk, you need to hold the bucket lock
- RULE 2: to modify _anything_ in zones, you need to hold the zone's lock
- Lock zone Y (oldest zone at ring buffer pointer)
- Remove all items in zone Y
- If we naively lock each item's associated bucket here, we deadlock. See
"Deadlock" below.
- Unlock bucket A
##############################################################################
# Deadlock: bucket->zone vs zone->bucket #
##############################################################################
Hash table _A__ _B__
buckets 1|____|____|2______
_3____| ^ |5
| |___6__ |
Entry |____v__ ______|________|_ _v_|
zones |_______|______|__________|___|
|Zone|X |Zone Y ^ |
| ^ |
| | |
| ringBufferPtr |
|__________________|
4
Summary:
- Process 1 has locks A->Y, and wants locks A->Y->B
- Process 2 has lock B, and wants locks B->Y
Details:
Process 1 wants to set a new value to an existing item. Process 2 wants to
read an existing item.
1 P1: LOCK bucket A
2 P2: LOCK bucket B
3 P1: LOCK zone X
- P1: new value doesn't fit into existing chunk in zone X
- P1: find zone for entry->hashNext (zone Y)
- P1: free existing chunk in zone X
- P1: UNLOCK zone X
- P1: unlink entry from bucket A
- P1: LOCK zone for entry->hashNext (zone Y)
- P1: set bucket A head to be entry->hashNext->hashNext
- No race condition: zone Y's chunk at entry->hashNext->hashNext can't have
moved or been removed after unlocking X, as we still have lock A (see RULE 1)
- P1: UNLOCK zone Y
- P1: LOCK ringBufferPtr (for oldest zone eviction)
4 P1: LOCK zone Y (oldest zone via ringBufferPtr)
5 P2: LOCK zone Y
- Waits until P1 unlocks Y below
- P1: evict zone Y to make space for new values (free all chunks)
- P1: free zone Y first chunk
- P1: LOCK bucket A (again; recursive lock makes this OK)
- P1: unlink entry from bucket A (like above)
- P1: UNLOCK bucket A (still holds earlier lock of A)
- P1: free zone Y second chunk
6 P1: LOCK bucket B
- DEADLOCK: P1 has Y, wants B; P2 has B, wants Y
- P1: unlink entry from bucket B
- P1: UNLOCK bucket B
- P1: ++ringBufferPtr
- P1: write new value to zone Y
- P1: UNLOCK zone Y
- P1: UNLOCK ringBufferPtr
- P1: UNLOCK bucket A
- P2: read second chunk of zone Y
- RACE CONDITION due to P1 having freed all chunks; see "RACE CONDITION" below
- However, this is also fixed when the DEADLOCK is fixed; see "TRYLOCK" below
- P2: UNLOCK zone Y
- P2: UNLOCK bucket B
RULE 3: the lock order between a bucket and a zone lock must be
bucket->zone, not zone->bucket. For ringBufferPtr it's ringBufferPtr->zone.
So, lock order:
Bucket lock -> ringBufferPtr lock -> zone lock
RULE 4: multiple zones can't be locked at the same time by a single process.
RULE 5: bultiple buckets can be locked at the same time, but _only_ by using
a trylock when trying to lock the 2nd..Nth bucket (see RULE 6). If the trylock
fails, the process must drop its zone lock and ringBufferPtr lock before trying
again.
To fix the DEADLOCK, instead of:
---------------------------------------
P1: LOCK bucket A
P2: LOCK bucket B
P1: LOCK RBP
P1: LOCK zone Y
(for eviction)
P2: LOCK zone Y
P1: LOCK bucket B
Use a TRYLOCK in the other process:
---------------------------------------
P1: LOCK bucket A
P2: LOCK bucket B
P1: LOCK RBP
P1: LOCK zone Y
(for eviction)
P2: LOCK zone Y
P1: TRYLOCK bucket B
(fail)
P1: UNLOCK zone Y
(acquired Y)
P1: LOCK zone Y
(for eviction)
P2: UNLOCK zone Y
P2: UNLOCK bucket B
P1: TRYLOCK bucket B
(success)
P1: UNLOCK bucket B
P1: UNLOCK zone Y
P1: UNLOCK RBP
To prevent livelock, only one of the processes should use a trylock, and
the other should be a regular lock. Otherwise you'll have this problem:
---------------------------------------
P1: LOCK bucket A
P2: LOCK bucket B
P1: LOCK RBP
P1: LOCK zone Y
(for eviction)
P2: TRYLOCK zone Y
(fail)
P1: TRYLOCK bucket B
(fail)
P2: UNLOCK bucket B
P2: LOCK bucket B
P1: UNLOCK zone Y
P1: LOCK zone Y
(for eviction)
P2: TRYLOCK zone Y
(fail)
P1: TRYLOCK bucket B
(fail)
P2: UNLOCK bucket B
P2: LOCK bucket B
P1: UNLOCK zone Y
P1: LOCK zone Y
(for eviction)
... ad infinitum ...