You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently when you run MLX_lm.cache_prompt, the produced kv-cache-file contains the chat template, tokenizer config, model, and max_kv_size. It would be great if the actual text passed into it by the --prompt flag was also saved in the metadata. Would make it easier to debug any unexpected LLM behaviour as not being down to a prompt formatting issue.
Now, I'm not fully sure I'm doing this right; this is my first time suggesting a change to a repo! But I believe if you edit line 140 of mlx_lm/cache_prompt.py you can quite trivially include this functionality by adding metadata["chat_history"] = prompt:
cache_dict = {}
for i, c in enumerate(cache):
cache_dict[f"{i}_keys"] = c.state[0]
cache_dict[f"{i}_values"] = c.state[1]
metadata = {}
metadata["model"] = args.model
metadata["chat_template"] = tokenizer.chat_template
metadata["tokenizer_config"] = json.dumps(tokenizer_config)
metadata["max_kv_size"] = str(args.max_kv_size)
metadata["chat_history"] = prompt # Add this line to save the prompt
mx.save_safetensors(args.kv_cache_file, cache_dict, metadata)
(Might be more appropriate to save it as prompt_history, or cached_prompt, or such.)
Like I say, would be very helpful for checking and debugging model behaviour, and would be very helpful for managing chat history for chatbot applications. For instance, say you have a 10 turn conversation with a model lasting 4~8k tokens. If you want to minimize time-to-first-token latency for the user for every turn, the best way to do this is save the entire chat history up to that point into the KV-cache. Doing this at the moment requires maintaining a separate file, e.g. chat_history.json, and keeping this updated as the chat goes on. It would be easier to manage if the chat_history was instead kept within the metadata of the kv-cache itself, as that way the chatbot application could simply extract the chat history, append the most recent user and model turns, and then run cache_prompt on this.
Sorry if that's not super clear! Basically I just think it'd be useful for debugging and chat history management.
Currently I'm managing this in my own chatbot app pretty much this same way and, by and large it seems to be working :)
The text was updated successfully, but these errors were encountered:
Currently when you run MLX_lm.cache_prompt, the produced kv-cache-file contains the chat template, tokenizer config, model, and max_kv_size. It would be great if the actual text passed into it by the --prompt flag was also saved in the metadata. Would make it easier to debug any unexpected LLM behaviour as not being down to a prompt formatting issue.
Now, I'm not fully sure I'm doing this right; this is my first time suggesting a change to a repo! But I believe if you edit line 140 of mlx_lm/cache_prompt.py you can quite trivially include this functionality by adding metadata["chat_history"] = prompt:
(Might be more appropriate to save it as prompt_history, or cached_prompt, or such.)
Like I say, would be very helpful for checking and debugging model behaviour, and would be very helpful for managing chat history for chatbot applications. For instance, say you have a 10 turn conversation with a model lasting 4~8k tokens. If you want to minimize time-to-first-token latency for the user for every turn, the best way to do this is save the entire chat history up to that point into the KV-cache. Doing this at the moment requires maintaining a separate file, e.g. chat_history.json, and keeping this updated as the chat goes on. It would be easier to manage if the chat_history was instead kept within the metadata of the kv-cache itself, as that way the chatbot application could simply extract the chat history, append the most recent user and model turns, and then run cache_prompt on this.
Sorry if that's not super clear! Basically I just think it'd be useful for debugging and chat history management.
Currently I'm managing this in my own chatbot app pretty much this same way and, by and large it seems to be working :)
The text was updated successfully, but these errors were encountered: