Model `llama-2-7b.Q4_0.gguf` Loads with `llama.cpp` but Fails with `whisper.cpp` #1316

souza-david · 2023-09-23T13:06:14Z

souza-david
Sep 23, 2023

Description:
Hello! When I try to run the model llama-2-7b.Q4_0.gguf (TheBloke repo) using llama.cpp, everything works fine. However, when I attempt to use the same model with whisper.cpp talk-llama, I encounter an error. Additionally, I'd like to mention that executing ./main -m models/ggml-small.en.bin -f samples/jfk.wav works correctly without any issues.

Steps to Reproduce:

Load the llama-2-7b.Q4_0.gguf model using llama.cpp (Works without issues).
Attempt to use the above model with whisper.cpp talk-llama using the following command:

./talk-llama -mw ./models/ggml-small.en.bin -ml ../llama.cpp/models/llama-2-7b.Q4_0.gguf -p "Hey, there" -t 4

Expected Behavior:
The model should load and work without any issues, just as it does with llama.cpp.

Actual Behavior:
An error message is displayed, stating:

gguf_init_from_file: tensor 'output.weight' number of elements (131072000) is not a multiple of block size (0)
error loading model: llama_model_loader: failed to load model from ../llama.cpp/models/llama-2-7b.Q4_0.gguf
llama_load_model_from_file: failed to load model

This is followed by a segmentation fault.

Additional Information:

Device: Apple M2
Model file: llama-2-7b.Q4_0.gguf
Whisper model file: ./models/ggml-small.en.bin

I Would appreciate any guidance or insights into why this might be happening and how to resolve it. Thanks for your time!

Full Error Message:

 whisper.cpp git:(master) ./talk-llama -mw ./models/ggml-small.en.bin -ml ../llama.cpp/models/llama-2-7b.Q4_0.gguf -p "Hey, there" -t 4
whisper_init_from_file_no_state: loading model from './models/ggml-small.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: loading '/Users/davidsouza/src/llama_cpp/whisper.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x13c7085f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_add_row                        0x13c708d30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul                            0x13c709270 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_row                        0x13c7098c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_scale                          0x13c709e00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_silu                           0x13c70a340 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_relu                           0x13c70a880 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_gelu                           0x13c70adc0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max                       0x13c70b460 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max_4                     0x13c70bb00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf                  0x13c70c180 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf_8                0x13c70c970 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f32                   0x13c70d040 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f16                   0x13c70d710 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x13c70dde0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x13c70e4b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q8_0                  0x13c70eb80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x13c70f250 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x13c70f920 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x13c710160 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x13c710830 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x13c710f00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rms_norm                       0x13c7115e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_norm                           0x13c711e30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_f32_f32                0x13c7126b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x13c712f30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_1row           0x13c7137b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_l4             0x13c714230 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x13c7149b0 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x13c715130 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q8_0_f32               0x13c715b10 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x13c716290 | th_max =  640 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x13c7167d0 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x13c716f50 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x13c7176d0 | th_max =  640 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x13c717e50 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f32_f32                 0x13c718680 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f16_f32                 0x13c718eb0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                0x13c7196e0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32                0x13c719f10 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                0x13c71a740 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                0x13c71af70 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                0x13c71b7a0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                0x13c71bfd0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                0x13c71c800 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                0x13c71d030 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_rope                           0x13c71d7d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_alibi_f32                      0x13c71e2a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x13c71eb50 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x13c71f400 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x13c71fcb0 | th_max = 1024 | th_width =   32
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 10922.67 MB
ggml_metal_init: maxTransferRate               = built-in GPU
whisper_init_state: Metal context initialized
whisper_init_state: max tensor size =    75.97 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =   466.00 MB, (  466.50 / 10922.67)
ggml_metal_add_buffer: allocated 'meta_conv       ' buffer, size =     1.48 MB, (  467.98 / 10922.67)
ggml_metal_add_buffer: allocated 'meta_encode     ' buffer, size =     1.48 MB, (  469.47 / 10922.67)
ggml_metal_add_buffer: allocated 'meta_cross      ' buffer, size =     1.48 MB, (  470.95 / 10922.67)
ggml_metal_add_buffer: allocated 'meta_decode     ' buffer, size =     1.48 MB, (  472.44 / 10922.67)
ggml_metal_add_buffer: allocated 'data_conv       ' buffer, size =    18.50 MB, (  490.94 / 10922.67)
ggml_metal_add_buffer: allocated 'data_encode     ' buffer, size =   120.58 MB, (  611.52 / 10922.67)
ggml_metal_add_buffer: allocated 'data_cross      ' buffer, size =     4.41 MB, (  615.92 / 10922.67)
ggml_metal_add_buffer: allocated 'data_decode     ' buffer, size =    34.70 MB, (  650.62 / 10922.67)
ggml_metal_add_buffer: allocated 'kv_cross        ' buffer, size =    52.75 MB, (  703.38 / 10922.67)
ggml_metal_add_buffer: allocated 'kv_self_0       ' buffer, size =    15.77 MB, (  719.14 / 10922.67)
gguf_init_from_file: tensor 'output.weight' number of elements (131072000) is not a multiple of block size (0)
error loading model: llama_model_loader: failed to load model from ../llama.cpp/models/llama-2-7b.Q4_0.gguf

llama_load_model_from_file: failed to load model

main: processing, 4 threads, lang = en, task = transcribe, timestamps = 0 ...

init: found 1 capture devices:
init:    - Capture device #0: 'MacBook Pro Microphone'
init: attempt to open default capture device ...
init: obtained spec for input device (SDL Id = 2):
init:     - sample rate:       16000
init:     - format:            33056 (required: 33056)
init:     - channels:          1 (required: 1)
init:     - samples per frame: 1024
[1]    1191 segmentation fault  ./talk-llama -mw ./models/ggml-small.en.bin -ml  -p "Hey, there" -t 4

souza-david · 2023-09-24T08:27:22Z

souza-david
Sep 24, 2023
Author

Further Investigation on Model Compatibility:

I've done some additional testing to narrow down the problem. I converted a llama model from Hugging Face, Meta repo, using the following command:
python convert.py hugging_face/llama-2-chat --outfile llama-2-7b-chat-Q-8.gguf --outtype q8_0

Pleasingly, the converted model (llama-2-7b-chat-Q-8.gguf) is working as expected with both llama.cpp and whisper.cpp. This leads me to believe that the issue might be specifically tied to the TheBloke llama-2-7b.Q4_0.gguf model.

0 replies

sid255 · 2023-10-30T05:42:31Z

sid255
Oct 30, 2023

The issue seems to be with the newer quantization models, a Q8 gguf model works fine.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model `llama-2-7b.Q4_0.gguf` Loads with `llama.cpp` but Fails with `whisper.cpp` #1316

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Model llama-2-7b.Q4_0.gguf Loads with llama.cpp but Fails with whisper.cpp #1316

souza-david Sep 23, 2023

Replies: 2 comments

souza-david Sep 24, 2023 Author

sid255 Oct 30, 2023

Model `llama-2-7b.Q4_0.gguf` Loads with `llama.cpp` but Fails with `whisper.cpp` #1316

souza-david
Sep 23, 2023

souza-david
Sep 24, 2023
Author

sid255
Oct 30, 2023