So what does KV CACHE look like? #7625

jygmysoul · 2024-05-29T17:53:35Z

jygmysoul
May 29, 2024

My understanding is that each generated TOKEN has a corresponding KV CACHE, which is maintained inside llama.cpp. It should not cache each token_id, but choose to cache the kv value. Anyway, it is a one-to-one relationship. The context size is the size of the kv cache. When the context is full, llama.cpp will discard the previously cached kv value to make room for new content. Is my understanding correct?

LLAMA_API void llama_kv_cache_seq_add(
struct llama_context * ctx,
llama_seq_id seq_id,
llama_pos p0,
llama_pos p1,
llama_pos delta);

In addition, what does this llama_pos delta parameter mean, how to use it, and why it can be negative? It feels like this add is a bit like a move operation.

In addition, can you traverse the kv cache and view the specific values stored in it and the corresponding tokenid? I feel very uneasy about deleting and discarding without seeing the content.

mjkpolo · 2024-06-11T19:51:23Z

mjkpolo
Jun 11, 2024

I am also so lost with kv cache and hoping a maintainer will reply.

I have noticed the delta indirectly used in build_k_shift() because it uses lctx.inp_K_shift which is filled in by llama_set_k_shift. But I don't see where v_l is shifted. Also how the heck does the int32_t * data = (int32_t *) lctx.inp_K_shift->data; work is that because you set the tensor as an input tensor?

~~A while ago, n_ctx using the same model changed from 512 to 4096, so something major changed kv cache wise.~~
They changed the default I think

Also I think there is a bug when you generate more tokens than n_ctx. when kv cache exceeds n_ctx = 4096 you'd expect the offset to put new tokens into the kv cache, aka kv_head, to go to 1 because of n_keep = 1 or maybe 0 if that got shifted, but instead it stores the first token after shifting at 3584, then goes to 1.

This happens because each time the kv cache graph is made, when the memory context is generated, it has a worst_case parameter which is true if needs_reserve is true. Not sure what needs_reserve means really, but it's true whenever we need to shift.

    if (need_reserve) {
        // TODO: extract to a function
        // build worst-case graph
        int n_tokens = (int)std::min(lctx.cparams.n_ctx, lctx.cparams.n_ubatch);
        int n_past = lctx.cparams.n_ctx - n_tokens;
        llama_token token = llama_token_bos(&lctx.model); // not actually used by llama_build_graph, but required to choose between token and embedding inputs graph
        ggml_cgraph * gf = llama_build_graph(lctx, llama_batch_get_one(&token, n_tokens, n_past, 0), true);

then kv_head is initialized like this:

    llm_build_context(
        llama_context  & lctx,
    const llama_batch  & batch,
    const llm_build_cb & cb,
                  bool   worst_case) :
...
        kv_head          (worst_case ? (kv_self.recurrent ? 0 : kv_self.size - n_tokens) : kv_self.head),

so for some reason when we are shifting n_tokens is 4096 - 3584 = 512. A while ago n_ctx was 512... so I'm stuck in another rabbit hole lol

0 replies

Yorizuka · 2024-06-20T19:06:20Z

Yorizuka
Jun 20, 2024

I too find the kv cache cache manipulation very hard to understand, I'd love an overview and a simplified example of how to manipulate it. I tried reading the src of main example to see how the context shifting works, and I do not understand it. If anyone knows where I can find a minimal example of context shifting when using llama.cpp as a library I'd love to know about!

1 reply

mjkpolo Jun 20, 2024

I think this discussion has a good explanation for the context shifting #7887

hitdra · 2024-09-27T19:20:46Z

hitdra
Sep 27, 2024

With pytorch and the transformer framework, I can get the kv cache like this
generated = model.generate(input_ids, max_new_tokens = 1, return_dict_in_generate=True)
kv = generated['past_key_values']
How to get the corresponding kv cache under llamacpp?
I really appreciate everyone being able to answer my questions, thanks a million!

1 reply

walker-ai Sep 29, 2024

Maybe you can check the function llama.cpp/llm_build_kv_store, llm_build_kqv, it prepare the kv cache for storing and loading for use in decoding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

So what does KV CACHE look like? #7625

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

So what does KV CACHE look like? #7625

jygmysoul May 29, 2024

Replies: 3 comments · 2 replies

mjkpolo Jun 11, 2024

Yorizuka Jun 20, 2024

mjkpolo Jun 20, 2024

hitdra Sep 27, 2024

walker-ai Sep 29, 2024

jygmysoul
May 29, 2024

Replies: 3 comments 2 replies

mjkpolo
Jun 11, 2024

Yorizuka
Jun 20, 2024

hitdra
Sep 27, 2024