Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inferencing on CPU (using fine tuned version of llama 3.1) #1012

Open
ApurvPujari opened this issue Sep 11, 2024 · 14 comments
Open

Inferencing on CPU (using fine tuned version of llama 3.1) #1012

ApurvPujari opened this issue Sep 11, 2024 · 14 comments

Comments

@ApurvPujari
Copy link

I have fine tuned "meta-llama-3.1-8b-bnb-4bit" model using unsloth. I have downloaded the lora weights and able to do inferencing using those on Colab GPU.

But i want use this fine tuned model for inferencing on CPU. How to do it ? I have executed "model.save_pretrained_gguf" for conversion to gguf. (Its giving me a floder having 4-5 pytorch files (.bin) along with other files)

I need guidance on next steps to use this model on CPU for inferencing. (I am new to LLMs)(I know i am asking for lot, but a code snippet would be great !)

@rohhro
Copy link

rohhro commented Sep 11, 2024

Use llama.cpp to use GGUF.

@ApurvPujari
Copy link
Author

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

@rohhro
Copy link

rohhro commented Sep 11, 2024

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

@mahiatlinux
Copy link
Contributor

I have fine tuned "meta-llama-3.1-8b-bnb-4bit" model using unsloth. I have downloaded the lora weights and able to do inferencing using those on Colab GPU.

But i want use this fine tuned model for inferencing on CPU. How to do it ? I have executed "model.save_pretrained_gguf" for conversion to gguf. (Its giving me a floder having 4-5 pytorch files (.bin) along with other files)

I need guidance on next steps to use this model on CPU for inferencing. (I am new to LLMs)(I know i am asking for lot, but a code snippet would be great !)

You have to quantize using this snippet:

# Save locally.
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# To push to HF.
model.push_to_hub_gguf(".../...", tokenizer, quantization_method = "q4_k_m", token = "...")

Then load this using Llama CPP.

@ApurvPujari
Copy link
Author

I have fine tuned "meta-llama-3.1-8b-bnb-4bit" model using unsloth. I have downloaded the lora weights and able to do inferencing using those on Colab GPU.
But i want use this fine tuned model for inferencing on CPU. How to do it ? I have executed "model.save_pretrained_gguf" for conversion to gguf. (Its giving me a floder having 4-5 pytorch files (.bin) along with other files)
I need guidance on next steps to use this model on CPU for inferencing. (I am new to LLMs)(I know i am asking for lot, but a code snippet would be great !)

You have to quantize using this snippet:

# Save locally.
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# To push to HF.
model.push_to_hub_gguf(".../...", tokenizer, quantization_method = "q4_k_m", token = "...")

Then load this using Llama CPP.

Is it manditory to HF ? ,if possible, could you please provide sample code snippet to load model using llama cpp ?

@mahiatlinux
Copy link
Contributor

mahiatlinux commented Sep 11, 2024

I have fine tuned "meta-llama-3.1-8b-bnb-4bit" model using unsloth. I have downloaded the lora weights and able to do inferencing using those on Colab GPU.
But i want use this fine tuned model for inferencing on CPU. How to do it ? I have executed "model.save_pretrained_gguf" for conversion to gguf. (Its giving me a floder having 4-5 pytorch files (.bin) along with other files)
I need guidance on next steps to use this model on CPU for inferencing. (I am new to LLMs)(I know i am asking for lot, but a code snippet would be great !)

You have to quantize using this snippet:

# Save locally.
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# To push to HF.
model.push_to_hub_gguf(".../...", tokenizer, quantization_method = "q4_k_m", token = "...")

Then load this using Llama CPP.

Is it manditory to HF ? ,if possible, could you please provide sample code snippet to load model using llama cpp ?

Nobody said it's mandatory to push HF. If you looked closely:

# Save locally.
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

Then you can load this GGUF file using Llama.cpp. Please look for tutorials on Llama.cpp inference elsewhere, as Unsloth is a finetuning framework. This will be helpful: ggerganov/llama.cpp#2094 (comment) .

@ApurvPujari
Copy link
Author

Please look for tutorials on Llama.cpp inference elsewhere

Thanks for the link, helpful for understanding quantization methods. I am still looking for tutorials to use llama.cpp for inference. Could you please help me with it ?

@ApurvPujari
Copy link
Author

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

could you please provide sample snippet for the same ?
(I am new to this...)

@rohhro
Copy link

rohhro commented Sep 11, 2024

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

could you please provide sample snippet for the same ? (I am new to this...)

When you load the first part in llama.cpp, it will load all the other parts automatically.

Or you can merge all the splited parts if it's easier for you.
Merge:
~/llama.cpp/gguf-split --merge infile-00001-of-0000N.gguf outfile.gguf

Merge in Linux,
cat YourModel.gguf-split-* > YourModel.gguf

@ApurvPujari
Copy link
Author

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

could you please provide sample snippet for the same ? (I am new to this...)

Or you can merge all the splited parts if it's easier for you. For example, in Linux, cat YourModel.Q4_K.gguf-split-* > YourModel.Q4_K.gguf

In colab notebook, The unsloth has saved model in following way (when i executed command model.save_pretrained_gguf(...))
image (1)

In llama.cpp, they have only these conversion files :
image

So how should I proceed ?

@mahiatlinux
Copy link
Contributor

mahiatlinux commented Sep 12, 2024

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

could you please provide sample snippet for the same ? (I am new to this...)

Or you can merge all the splited parts if it's easier for you. For example, in Linux, cat YourModel.Q4_K.gguf-split-* > YourModel.Q4_K.gguf

In colab notebook, The unsloth has saved model in following way (when i executed command model.save_pretrained_gguf(...)) image (1)

In llama.cpp, they have only these conversion files : image

So how should I proceed ?

You only need to use the ones with .GGUF! Try the Q4_K_M only (for now).

@ApurvPujari
Copy link
Author

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

could you please provide sample snippet for the same ? (I am new to this...)

Or you can merge all the splited parts if it's easier for you. For example, in Linux, cat YourModel.Q4_K.gguf-split-* > YourModel.Q4_K.gguf

In colab notebook, The unsloth has saved model in following way (when i executed command model.save_pretrained_gguf(...)) image (1)
In llama.cpp, they have only these conversion files : image
So how should I proceed ?

You only need to use the ones with .GGUF! Try the Q4_K_M only (for now).

So I dont need all the files in the folder right ?
Just that .gguf one... and its already in gguf format so i dont have to convert it into anything....right ?
just load that file using llama.cpp and good to go... right ?

@ApurvPujari
Copy link
Author

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

could you please provide sample snippet for the same ? (I am new to this...)

Or you can merge all the splited parts if it's easier for you. For example, in Linux, cat YourModel.Q4_K.gguf-split-* > YourModel.Q4_K.gguf

In colab notebook, The unsloth has saved model in following way (when i executed command model.save_pretrained_gguf(...)) image (1)
In llama.cpp, they have only these conversion files : image
So how should I proceed ?

You only need to use the ones with .GGUF! Try the Q4_K_M only (for now).

I tried with following way, but getting garbage reponse
image
image
image
image

@danielhanchen
Copy link
Contributor

@ApurvPujari Would it be possible to ask on our Discord server - sorry on the issue! We can help you async :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants