Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] MLX_lm.cache_prompt | Save cached_prompt as plaintext in the kv-cache-file metadata #978

Open
mark-lord opened this issue Sep 6, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@mark-lord
Copy link

mark-lord commented Sep 6, 2024

Currently when you run MLX_lm.cache_prompt, the produced kv-cache-file contains the chat template, tokenizer config, model, and max_kv_size. It would be great if the actual text passed into it by the --prompt flag was also saved in the metadata. Would make it easier to debug any unexpected LLM behaviour as not being down to a prompt formatting issue.

Now, I'm not fully sure I'm doing this right; this is my first time suggesting a change to a repo! But I believe if you edit line 140 of mlx_lm/cache_prompt.py you can quite trivially include this functionality by adding metadata["chat_history"] = prompt:

    cache_dict = {}
    for i, c in enumerate(cache):
        cache_dict[f"{i}_keys"] = c.state[0]
        cache_dict[f"{i}_values"] = c.state[1]
    metadata = {}
    metadata["model"] = args.model
    metadata["chat_template"] = tokenizer.chat_template
    metadata["tokenizer_config"] = json.dumps(tokenizer_config)
    metadata["max_kv_size"] = str(args.max_kv_size)
    metadata["chat_history"] = prompt  # Add this line to save the prompt
    mx.save_safetensors(args.kv_cache_file, cache_dict, metadata)

(Might be more appropriate to save it as prompt_history, or cached_prompt, or such.)

Like I say, would be very helpful for checking and debugging model behaviour, and would be very helpful for managing chat history for chatbot applications. For instance, say you have a 10 turn conversation with a model lasting 4~8k tokens. If you want to minimize time-to-first-token latency for the user for every turn, the best way to do this is save the entire chat history up to that point into the KV-cache. Doing this at the moment requires maintaining a separate file, e.g. chat_history.json, and keeping this updated as the chat goes on. It would be easier to manage if the chat_history was instead kept within the metadata of the kv-cache itself, as that way the chatbot application could simply extract the chat history, append the most recent user and model turns, and then run cache_prompt on this.

Sorry if that's not super clear! Basically I just think it'd be useful for debugging and chat history management.

Currently I'm managing this in my own chatbot app pretty much this same way and, by and large it seems to be working :)

@awni awni added the enhancement New feature or request label Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants