[Feature Request] MLX_lm.cache_prompt | Save cached_prompt as plaintext in the kv-cache-file metadata #978

mark-lord · 2024-09-06T21:22:30Z

Currently when you run MLX_lm.cache_prompt, the produced kv-cache-file contains the chat template, tokenizer config, model, and max_kv_size. It would be great if the actual text passed into it by the --prompt flag was also saved in the metadata. Would make it easier to debug any unexpected LLM behaviour as not being down to a prompt formatting issue.

Now, I'm not fully sure I'm doing this right; this is my first time suggesting a change to a repo! But I believe if you edit line 140 of mlx_lm/cache_prompt.py you can quite trivially include this functionality by adding metadata["chat_history"] = prompt:

    cache_dict = {}
    for i, c in enumerate(cache):
        cache_dict[f"{i}_keys"] = c.state[0]
        cache_dict[f"{i}_values"] = c.state[1]
    metadata = {}
    metadata["model"] = args.model
    metadata["chat_template"] = tokenizer.chat_template
    metadata["tokenizer_config"] = json.dumps(tokenizer_config)
    metadata["max_kv_size"] = str(args.max_kv_size)
    metadata["chat_history"] = prompt  # Add this line to save the prompt
    mx.save_safetensors(args.kv_cache_file, cache_dict, metadata)

(Might be more appropriate to save it as prompt_history, or cached_prompt, or such.)

Like I say, would be very helpful for checking and debugging model behaviour, and would be very helpful for managing chat history for chatbot applications. For instance, say you have a 10 turn conversation with a model lasting 4~8k tokens. If you want to minimize time-to-first-token latency for the user for every turn, the best way to do this is save the entire chat history up to that point into the KV-cache. Doing this at the moment requires maintaining a separate file, e.g. chat_history.json, and keeping this updated as the chat goes on. It would be easier to manage if the chat_history was instead kept within the metadata of the kv-cache itself, as that way the chatbot application could simply extract the chat history, append the most recent user and model turns, and then run cache_prompt on this.

Sorry if that's not super clear! Basically I just think it'd be useful for debugging and chat history management.

Currently I'm managing this in my own chatbot app pretty much this same way and, by and large it seems to be working :)

The text was updated successfully, but these errors were encountered:

awni added the enhancement New feature or request label Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] MLX_lm.cache_prompt | Save cached_prompt as plaintext in the kv-cache-file metadata #978

[Feature Request] MLX_lm.cache_prompt | Save cached_prompt as plaintext in the kv-cache-file metadata #978

mark-lord commented Sep 6, 2024 •

edited

Loading

[Feature Request] MLX_lm.cache_prompt | Save cached_prompt as plaintext in the kv-cache-file metadata #978

[Feature Request] MLX_lm.cache_prompt | Save cached_prompt as plaintext in the kv-cache-file metadata #978

Comments

mark-lord commented Sep 6, 2024 • edited Loading

mark-lord commented Sep 6, 2024 •

edited

Loading