Inferencing on CPU (using fine tuned version of llama 3.1) #1012

ApurvPujari · 2024-09-11T08:02:08Z

I have fine tuned "meta-llama-3.1-8b-bnb-4bit" model using unsloth. I have downloaded the lora weights and able to do inferencing using those on Colab GPU.

But i want use this fine tuned model for inferencing on CPU. How to do it ? I have executed "model.save_pretrained_gguf" for conversion to gguf. (Its giving me a floder having 4-5 pytorch files (.bin) along with other files)

I need guidance on next steps to use this model on CPU for inferencing. (I am new to LLMs)(I know i am asking for lot, but a code snippet would be great !)

rohhro · 2024-09-11T08:34:23Z

Use llama.cpp to use GGUF.

ApurvPujari · 2024-09-11T08:53:35Z

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

rohhro · 2024-09-11T08:55:53Z

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

mahiatlinux · 2024-09-11T09:06:12Z

I have fine tuned "meta-llama-3.1-8b-bnb-4bit" model using unsloth. I have downloaded the lora weights and able to do inferencing using those on Colab GPU.

But i want use this fine tuned model for inferencing on CPU. How to do it ? I have executed "model.save_pretrained_gguf" for conversion to gguf. (Its giving me a floder having 4-5 pytorch files (.bin) along with other files)

I need guidance on next steps to use this model on CPU for inferencing. (I am new to LLMs)(I know i am asking for lot, but a code snippet would be great !)

You have to quantize using this snippet:

# Save locally.
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# To push to HF.
model.push_to_hub_gguf(".../...", tokenizer, quantization_method = "q4_k_m", token = "...")

Then load this using Llama CPP.

ApurvPujari · 2024-09-11T09:09:35Z

I have fine tuned "meta-llama-3.1-8b-bnb-4bit" model using unsloth. I have downloaded the lora weights and able to do inferencing using those on Colab GPU.
But i want use this fine tuned model for inferencing on CPU. How to do it ? I have executed "model.save_pretrained_gguf" for conversion to gguf. (Its giving me a floder having 4-5 pytorch files (.bin) along with other files)
I need guidance on next steps to use this model on CPU for inferencing. (I am new to LLMs)(I know i am asking for lot, but a code snippet would be great !)

You have to quantize using this snippet:
# Save locally.
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# To push to HF.
model.push_to_hub_gguf(".../...", tokenizer, quantization_method = "q4_k_m", token = "...")
Then load this using Llama CPP.

Is it manditory to HF ? ,if possible, could you please provide sample code snippet to load model using llama cpp ?

mahiatlinux · 2024-09-11T09:15:57Z

I have fine tuned "meta-llama-3.1-8b-bnb-4bit" model using unsloth. I have downloaded the lora weights and able to do inferencing using those on Colab GPU.
But i want use this fine tuned model for inferencing on CPU. How to do it ? I have executed "model.save_pretrained_gguf" for conversion to gguf. (Its giving me a floder having 4-5 pytorch files (.bin) along with other files)
I need guidance on next steps to use this model on CPU for inferencing. (I am new to LLMs)(I know i am asking for lot, but a code snippet would be great !)

You have to quantize using this snippet:
# Save locally.
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# To push to HF.
model.push_to_hub_gguf(".../...", tokenizer, quantization_method = "q4_k_m", token = "...")
Then load this using Llama CPP.
Is it manditory to HF ? ,if possible, could you please provide sample code snippet to load model using llama cpp ?

Nobody said it's mandatory to push HF. If you looked closely:

# Save locally.
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

Then you can load this GGUF file using Llama.cpp. Please look for tutorials on Llama.cpp inference elsewhere, as Unsloth is a finetuning framework. This will be helpful: ggerganov/llama.cpp#2094 (comment) .

ApurvPujari · 2024-09-11T09:25:17Z

Please look for tutorials on Llama.cpp inference elsewhere

Thanks for the link, helpful for understanding quantization methods. I am still looking for tutorials to use llama.cpp for inference. Could you please help me with it ?

ApurvPujari · 2024-09-11T12:13:41Z

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

could you please provide sample snippet for the same ?
(I am new to this...)

rohhro · 2024-09-11T12:33:27Z

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

could you please provide sample snippet for the same ? (I am new to this...)

When you load the first part in llama.cpp, it will load all the other parts automatically.

Or you can merge all the splited parts if it's easier for you.
Merge:
~/llama.cpp/gguf-split --merge infile-00001-of-0000N.gguf outfile.gguf

Merge in Linux,
cat YourModel.gguf-split-* > YourModel.gguf

ApurvPujari · 2024-09-12T05:37:11Z

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

could you please provide sample snippet for the same ? (I am new to this...)

Or you can merge all the splited parts if it's easier for you. For example, in Linux, cat YourModel.Q4_K.gguf-split-* > YourModel.Q4_K.gguf

In colab notebook, The unsloth has saved model in following way (when i executed command model.save_pretrained_gguf(...))

In llama.cpp, they have only these conversion files :

So how should I proceed ?

mahiatlinux · 2024-09-12T05:55:19Z

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

could you please provide sample snippet for the same ? (I am new to this...)

Or you can merge all the splited parts if it's easier for you. For example, in Linux, cat YourModel.Q4_K.gguf-split-* > YourModel.Q4_K.gguf

In colab notebook, The unsloth has saved model in following way (when i executed command model.save_pretrained_gguf(...))

In llama.cpp, they have only these conversion files :

So how should I proceed ?

You only need to use the ones with .GGUF! Try the Q4_K_M only (for now).

ApurvPujari · 2024-09-12T05:59:04Z

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

could you please provide sample snippet for the same ? (I am new to this...)

Or you can merge all the splited parts if it's easier for you. For example, in Linux, cat YourModel.Q4_K.gguf-split-* > YourModel.Q4_K.gguf

In colab notebook, The unsloth has saved model in following way (when i executed command model.save_pretrained_gguf(...))
In llama.cpp, they have only these conversion files :
So how should I proceed ?

You only need to use the ones with .GGUF! Try the Q4_K_M only (for now).

So I dont need all the files in the folder right ?
Just that .gguf one... and its already in gguf format so i dont have to convert it into anything....right ?
just load that file using llama.cpp and good to go... right ?

ApurvPujari · 2024-09-12T11:49:19Z

Use llama.cpp to use GGUF.
But how ?
Most of the videos i have watched, they load single gguf bin file. But all I is a folder containg 4 bin files !

It will load multiple parts like loading one file...

could you please provide sample snippet for the same ? (I am new to this...)

Or you can merge all the splited parts if it's easier for you. For example, in Linux, cat YourModel.Q4_K.gguf-split-* > YourModel.Q4_K.gguf

In colab notebook, The unsloth has saved model in following way (when i executed command model.save_pretrained_gguf(...))
In llama.cpp, they have only these conversion files :
So how should I proceed ?

You only need to use the ones with .GGUF! Try the Q4_K_M only (for now).

I tried with following way, but getting garbage reponse

danielhanchen · 2024-09-14T08:22:05Z

@ApurvPujari Would it be possible to ask on our Discord server - sorry on the issue! We can help you async :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inferencing on CPU (using fine tuned version of llama 3.1) #1012

Inferencing on CPU (using fine tuned version of llama 3.1) #1012

ApurvPujari commented Sep 11, 2024

rohhro commented Sep 11, 2024

ApurvPujari commented Sep 11, 2024

rohhro commented Sep 11, 2024

mahiatlinux commented Sep 11, 2024

ApurvPujari commented Sep 11, 2024

mahiatlinux commented Sep 11, 2024 •

edited

Loading

ApurvPujari commented Sep 11, 2024

ApurvPujari commented Sep 11, 2024

rohhro commented Sep 11, 2024 •

edited

Loading

ApurvPujari commented Sep 12, 2024

mahiatlinux commented Sep 12, 2024 •

edited

Loading

ApurvPujari commented Sep 12, 2024

ApurvPujari commented Sep 12, 2024

danielhanchen commented Sep 14, 2024

Inferencing on CPU (using fine tuned version of llama 3.1) #1012

Inferencing on CPU (using fine tuned version of llama 3.1) #1012

Comments

ApurvPujari commented Sep 11, 2024

rohhro commented Sep 11, 2024

ApurvPujari commented Sep 11, 2024

rohhro commented Sep 11, 2024

mahiatlinux commented Sep 11, 2024

ApurvPujari commented Sep 11, 2024

mahiatlinux commented Sep 11, 2024 • edited Loading

ApurvPujari commented Sep 11, 2024

ApurvPujari commented Sep 11, 2024

rohhro commented Sep 11, 2024 • edited Loading

ApurvPujari commented Sep 12, 2024

mahiatlinux commented Sep 12, 2024 • edited Loading

ApurvPujari commented Sep 12, 2024

ApurvPujari commented Sep 12, 2024

danielhanchen commented Sep 14, 2024

mahiatlinux commented Sep 11, 2024 •

edited

Loading

rohhro commented Sep 11, 2024 •

edited

Loading

mahiatlinux commented Sep 12, 2024 •

edited

Loading