Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Used a shared reference for target hashes #33

Open
jdidion opened this issue Sep 13, 2023 · 1 comment
Open

Used a shared reference for target hashes #33

jdidion opened this issue Sep 13, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@jdidion
Copy link
Collaborator

jdidion commented Sep 13, 2023

Thinking more about this - assume the default kmer size of 12. A kmer Vec<u8> (or ByteStr) is 16 bytes (12 u8s + a usize length), while a slice is 8 bytes (a usize pointer + a usize length). That means, with 2 or more threads, it consumes as much or more memory to build a separate set of target hashes for each thread, rather than build one and share it (behind an Arc). It would also make the API cleaner to use Vec<u8>/ByteStr keys as we can get rid of the lifetime parameter. I'm not sure if there will be any impact on performance.

We could also do something fancier - compute the key size (threads * 8 vs k + 4) to decide whether to use a single shared target hashes or one per thread, and put it behind e.g. a Cow (although then we need to keep the lifetime parameter).

@nh13 nh13 added the enhancement New feature or request label Sep 13, 2023
@jdidion
Copy link
Collaborator Author

jdidion commented Sep 13, 2023

Furthermore, TargetHash should switch from using std::collections::HashMap with FxHash to hashbrown (which is inherently faster, and also uses AHash, which is faster than FxHash for string keys).

One cool feature of hashbrown is the ability to dynamically compute keys using RawEntryBuilder. This means that the hash table could store keys as u32 (the index into the sequence). When a new key is added, the string slice associated with that index (for a fixed length k) is used to compute the hash (and compare amongst keys in the case of a collision). In this case, I'd suggest must moving the hash tables into the TargetSeq struct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants