-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inferencing on CPU (using fine tuned version of llama 3.1) #1012
Comments
Use llama.cpp to use GGUF. |
|
It will load multiple parts like loading one file... |
You have to quantize using this snippet: # Save locally.
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# To push to HF.
model.push_to_hub_gguf(".../...", tokenizer, quantization_method = "q4_k_m", token = "...") Then load this using Llama CPP. |
Is it manditory to HF ? ,if possible, could you please provide sample code snippet to load model using llama cpp ? |
Nobody said it's mandatory to push HF. If you looked closely: # Save locally.
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m") Then you can load this GGUF file using Llama.cpp. Please look for tutorials on Llama.cpp inference elsewhere, as Unsloth is a finetuning framework. This will be helpful: ggerganov/llama.cpp#2094 (comment) . |
Thanks for the link, helpful for understanding quantization methods. I am still looking for tutorials to use llama.cpp for inference. Could you please help me with it ? |
could you please provide sample snippet for the same ? |
When you load the first part in llama.cpp, it will load all the other parts automatically. Or you can merge all the splited parts if it's easier for you. Merge in Linux, |
@ApurvPujari Would it be possible to ask on our Discord server - sorry on the issue! We can help you async :) |
I have fine tuned "meta-llama-3.1-8b-bnb-4bit" model using unsloth. I have downloaded the lora weights and able to do inferencing using those on Colab GPU.
But i want use this fine tuned model for inferencing on CPU. How to do it ? I have executed "model.save_pretrained_gguf" for conversion to gguf. (Its giving me a floder having 4-5 pytorch files (.bin) along with other files)
I need guidance on next steps to use this model on CPU for inferencing. (I am new to LLMs)(I know i am asking for lot, but a code snippet would be great !)
The text was updated successfully, but these errors were encountered: