Replies: 3 comments 2 replies
-
I am also so lost with kv cache and hoping a maintainer will reply. I have noticed the delta indirectly used in
Also I think there is a bug when you generate more tokens than n_ctx. when kv cache exceeds n_ctx = 4096 you'd expect the offset to put new tokens into the kv cache, aka kv_head, to go to 1 because of n_keep = 1 or maybe 0 if that got shifted, but instead it stores the first token after shifting at 3584, then goes to 1. This happens because each time the kv cache graph is made, when the memory context is generated, it has a worst_case parameter which is true if needs_reserve is true. Not sure what needs_reserve means really, but it's true whenever we need to shift.
then kv_head is initialized like this:
so for some reason when we are shifting n_tokens is 4096 - 3584 = 512. A while ago n_ctx was 512... so I'm stuck in another rabbit hole lol |
Beta Was this translation helpful? Give feedback.
-
I too find the kv cache cache manipulation very hard to understand, I'd love an overview and a simplified example of how to manipulate it. I tried reading the src of main example to see how the context shifting works, and I do not understand it. If anyone knows where I can find a minimal example of context shifting when using llama.cpp as a library I'd love to know about! |
Beta Was this translation helpful? Give feedback.
-
With pytorch and the transformer framework, I can get the kv cache like this |
Beta Was this translation helpful? Give feedback.
-
My understanding is that each generated TOKEN has a corresponding KV CACHE, which is maintained inside llama.cpp. It should not cache each token_id, but choose to cache the kv value. Anyway, it is a one-to-one relationship. The context size is the size of the kv cache. When the context is full, llama.cpp will discard the previously cached kv value to make room for new content. Is my understanding correct?
LLAMA_API void llama_kv_cache_seq_add(
struct llama_context * ctx,
llama_seq_id seq_id,
llama_pos p0,
llama_pos p1,
llama_pos delta);
In addition, what does this llama_pos delta parameter mean, how to use it, and why it can be negative? It feels like this add is a bit like a move operation.
In addition, can you traverse the kv cache and view the specific values stored in it and the corresponding tokenid? I feel very uneasy about deleting and discarding without seeing the content.
Beta Was this translation helpful? Give feedback.
All reactions