-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent hash table #32
Comments
Here's a design which keeps all values in the table itself rather than through a pointer. EDIT: There is actually a flaw with this design, which could allow the same key to be inserted twice in the map. This is a very efficient layout since it doesn't suffer from pointer indirection. In effect, it would look just like Once an element has been added to the table, it can be marked as erased, however the value itself will only be dropped when the table is destroyed (or the table grows and the old table is GCed). The table will use 8-bit control bytes, just like
Lookup is wait-free: just ignore the top bit when scanning the table. Insertion will first try to decrement the table capacity counter, and resize the table if the capacity is full. Then search for an empty control byte, CAS it to "reserve" it, write the value to the data slot, then CAS the control byte again to "publish" the element. If at any point the resize bit is set, the operation is aborted. Deletion is evil simpler, just perform a CAS to mark the control byte as empty. If the resize bit is set, the operation is aborted. Aborted operations will basically block until the resize has finished, and the retry. Resizing basically goes through the entire table and sets the resize bit on every element, while cloning them to a new table. After the resize is complete, wake up any threads waiting for the resize. |
Interesting! What about insertion when the same key already has a value in the map? |
(Aside: While reading this I was wondering if we could get away with moves instead of |
An easy way to get rid of the As for insertion when a key already exists, I forgot to mention that part. Essentially, after a successful insertion, you need to perform another scan and delete all equal elements that are "before" the newly inserted one. |
If we are careful with the memory orderings then we should be able to ensure that a thread performing lookups will never see a "gap" where the item it is looking for is not found. |
@Amanieu Interesting, that sounds like a pretty good design. I wonder, would it makes sense to make resizing lock-free like in Cliff Click's hash table (but maybe not in the first release of the hash table)? Another concern I have is what if we have a Do you have an opinion on adaptive radix trees? The good thing about this tree is that it doesn't need resizing so we could just have |
Hmm actually after thinking about it for a bit, my design is broken because it allows 2 concurrent inserts of the same key to create 2 separate entries in the table. So please disregard it. I think the only way to keep the using the fast SIMD lookups from SwissTable is to use a sharded lock for inserts & deletes, where the shard is selected based on the hash. This also avoids the need for a resize bit in the control bytes. |
|
Yes, the tradeoffs would have to be explained in the documentation, since |
Suppose we go with cloning keys & values when resizing the table. What do you think the
|
I was thinking of returning something along the lines of a |
Useful comment with links: crossbeam-rs/crossbeam#41 (comment) It might be worth looking into https://github.com/efficient/libcuckoo because it's a concurrent hash table with high performance but uses locking here and there and locks the table during resizes. Apart from the "cuckoo" part, it seems very similar to what we're aiming for here. |
In general, I'm biased towards trie structures for this case since they provide very good performance overall and tend to fare better in many-writer scenarios, whereas hashmaps can have very poor scaling with many writers. The ART paper is especially compelling in this regard given the implementation simplicity. |
Relativistic hash tables from the linux kernel 1 should be easy to port to Rust but require RCU. This hashtable is great because it does not block reader during table resize, so it give stable low latency access. |
I would propose using a variant of a Jagged Array™ as the backing structure of a concurrent hash-table. The structure is pretty simple:
Where each pointer points to an array:
This means Note: I came up with the name, and I am unaware of any prior art. I would definitely appreciate pointers if more knowledgeable people are, I am doubtful such a simple concept is not already used somewhere. It is potentially feasible to grow a hash-map in a lock-free manner using this structure:
Or, a crazy alternative, never move:
This means: N, 7 x N, 56 x N and 448 * N for a total of 512x the initial capacity. Advantages:
Disadvantages:
BTW: I am not sure that cloning is such a rich idea if handing out references. For example, while This can be solved in many ways, but memory-stable elements is probably the easiest. Wrapping them in |
This isn't really a problem in practice since |
In case people are not aware of quiescent state based reclamation (QSBR), here is a concurrent map that was implemented in C++ based on QSBR and it outperforms libcuckoo: https://preshing.com/20160222/a-resizable-concurrent-map/ When implementing such primitives, I would recommend looking at the state of the art in game development. |
Isn't QSBR just epoch-based reclamation? It may be necessary, however putting the onus on the user to find a good place to bump the epoch on each thread is not as ergonomic as automatically handling it for them. Unless it is impossible to come up with a design which is ergonomic, or the performance cost is too high, I'd favor the more ergonomic approach. Thanks for the link; I think the idea of the "Reclaimed" marker, whether there or in the index table, is an excellent way of facilitating incremental migration... especially in Rust where we can perform a bitwise copy from old to new table and "forget" to call the destructor of the elements in the old table. |
They are similar but not the same. One the readers side EBR uses something like:
whereas QSBR identifies quiescent state by having the application call it when convenient. As you pointed out, it is not as ergonomic but the ease of use of EBR comes at a performance cost. Here is a good reference when comparing EBR, QSBC and hazard pointers http://www.rdrop.com/users/paulmck/RCU/hart_ipdps06.pdf |
@stjepang By the way, while lock-free/wait-free are good, they say nothing about contention. For an extreme example, imagine 96 threads (2x24 cores, multithreaded) all concurrently trying to clone the same Furthermore, still with How concerned should we be about such contention/false-sharing in the design? Or otherwise said:
|
I'd say a "good enough" solution is whatever matches the performance and simplicity of libcuckoo. Let that be our baseline as it has strong academic support, fares well in all published benchmarks, and I've heard it's been very successful in industry. It's become the go-to concurrent hash table in C++. Lookups into the hash table are a bit tricky. Returning a clone of Libcuckoo API: http://efficient.github.io/libcuckoo/classcuckoohash__map.html |
But if |
@SimonSapin True, what you're saying makes total sense. Not sure why libcuckoo doesn't return a guard as well... It's possible that the only reason is they didn't want to expose the internal locking to the outside world. |
Thanks for the libcuckoo code. It seems pretty lock-heavy (using spin-locks, but still). In Rust, the absence of copy-constructor should allow optimistic reads which avoids locking.
A This would be a very good reason for libcuckoo avoiding it, and preferring cloning.
I've been thinking about it, and the stumbling block was deletion. You can allocate
Indeed; there are potentially better ways than locking for reads, though.
I would really prefer NOT to return a In libcuckoo, the maximum number of locked elements, at any time, is 2x or 3x the number of threads. With a Furthermore, a Rust has a BIG advantage on C++ here: the absence of Copy Constructor. It means you can copy a block of raw bytes elsewhere, then treat them as a This same advantage also allows Optimistic Reads:
This is relatively simple to implement, I would say, and offers lock-free reads. In Rust, it also works with any kind of Note: you may also pick one lock/epoch per element if supporting Update is desired. Updates allow unbounded, and possibly panicking, user code to run though. |
👋 Scala developer here.
Scala uses a TrieMap. In my experience traversing elements is often much slower on tree based maps. This might be something to consider. Just out of curiosity, how does libcuckoos's performance compare to Java's ConcurrentHashMap? It might be an apples vs oranges comparison but interesting nonetheless. Here is a quick introduction to the design principles of the Feel free to move or delete my post if it's not adding any value 🙂 |
I was considering the cost of a reader being the last owner of an It turns out that as long as the concurrent map is generic over the key and value types, and does not enforce The exact details of such a type do not have to be worked out here; however I believe it makes a strong case for leaving the user in charge of deciding exactly which type should be stored in the map, rather than arbitrarily using |
Thanks everyone for your comments! I still haven't gotten the time to really dig into all of them, but they are appreciated <3 Just a quick update: @Amanieu is prototyping a concurrent hash table based on hashbrown that uses a combination of locks and epochs. We discussed this during the Rust All Hands and figured it'd be good to provide a low-level concurrent hash table on top of which higher-level hash tables with different tradeoffs can then be built. For example, we could have a higher-level layer that clones |
There is a design and a good chunk of a Rust implementation for a concurrent hash table based on nested hash tables, so that reallocation would not block all threads: http://ticki.github.io/blog/an-atomic-hash-table/ |
@Shnatsel : The idea is ingenious, however following a chain of pointers can lead to poor cache behavior, and repeatedly hashing has adverse performance (or did I not understand the seeded hashing?). If looking up in multiple sub-tables is an option, I'd favor simply having N tables, where the table at index The real challenge, though, is handling deletions & dangling references without putting an undue burden on the user and incurring too much overhead. That's a tough one :/ |
A very simplistic Concurrent Hash Map based on Abseil's Swiss Map: https://greg7mdp.github.io/parallel-hashmap/ => simply using an array of Abseil's maps, each with a lock. |
I was just reading A Practical Analysis of Rust's Concurrency Story and the brief claims:
|
I'm the advisor on that project, and I'd say that claim is probably unfounded, and the result of the authors being particularly excited about their results :) It does generally avoid locking, which helps it scale pretty well, but I don't think it's a design that can't be heavily optimized (for example, it currently uses lock-free linked lists for buckets). |
Concurrent hashmap projects that have sprung up in the meanwhile: |
Another example that seems to be more maintained would be evmap. |
evmaps separates readers and writers, so it fits different different use cases than a single type that implements |
Hey! We are making progress on this I think. Jonhoo ported the Java ConcurrentHashMap over to Rust and is now competing with my https://docs.rs/dashmap/3.5.1/dashmap and some other ones. Me and Jonhoo are putting together some benchmarks of all of this using a port of the libcuckoo benchmark suite. |
I will make sure to post here once they are finished. |
An initial benchmark suite which includes the current implementations of concurrent maps based on a port of the libcuckoo benchmark harness is available here. Though I would refrain from drawing conclusions from this benchmark right now. I have a few more specialized benchmarks to implement; DashMap is receiving a big rework soon that improves performance significantly and we are investigating Flurry's bad performance. |
I ran the same benchmarks against libcuckoo and the experimental unreleased dashmap. I was surprised to see libcuckoo heavily outperformed. I still have lots of exciting optimizations to make so this is really promising. |
Have you tried |
I have indeed, performance improved a bit but the general upwards trend was still the same. The reason for this is that the workload also does a tiny amount of write updates to simulate real world scenarios. |
What is the state of this issue? Is |
https://github.com/xacrimon/dashmap is a fast and production-ready concurrent hashmap implementation. |
This is just a brainstorming issue for concurrent hash tables.
@Amanieu, do you think we could make
hashbrown
concurrent by having a lock per bucket? When resizing the table we'd probably have to lock all buckets. That sounds like a rather straightforward approach for building a totally generic concurrent hash table without much fuss.I'm sure we could devise a faster hash table for more specialized use cases. Apart from a fully generic
HashMap<K, V>
, we'd probably want to build a specialHashMap<K: Copy, V>
too that can avoid locks in some places.Thoughts? What do you think would be the best way forward here? Ideally something not overly ambitious that at least gets us started. :)
cc servo/servo#22334
cc @SimonSapin
The text was updated successfully, but these errors were encountered: