Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Unable to load GGUF models after update #9852

Open
FitzWM opened this issue Oct 11, 2024 · 8 comments
Open

Bug: Unable to load GGUF models after update #9852

FitzWM opened this issue Oct 11, 2024 · 8 comments
Labels
bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)

Comments

@FitzWM
Copy link

FitzWM commented Oct 11, 2024

What happened?

This started as a problem with Ooba, but I'm seeing the same issue with KoboldCPP and llama.cpp. I updated Ooba the other day, after maybe a week or two of not doing so. While it seems to have gone fine and opens without any errors, I'm now unable to load various GGUF models (Command-R, 35b-beta-long, New Dawn) that worked fine just before on my RTX 4070 Ti Super. It has 16 GB of VRAM, which isn't major leagues, I know, but like I said, all of these models worked perfectly with these same settings a few days ago. I'm still able to load smaller models in Ooba via ExLlamav2_HF, and like I said, I'm getting the same problem with KCPP and llama.cpp itself. I posted on the Ooba GitHub but haven't gotten any responses in several days, so I thought I would try here.

Models and settings (flash-attention and tensorcores enabled):

  • Command-R (35b): 16k context, 10 layers, default 8000000 RoPE base
  • 35b-beta-long (35b): 16k context, 10 layers, default 8000000 RoPE base
  • New Dawn (70b): 16k context, 20 layers, default 3000000 RoPE base

Things I've tried:

  • Ran models at 12k and 8k context.
  • Lowered GPU layers.
  • Set GPU layers to 0 and tried to load on CPU only. Crashed fastest of all.
  • Disabled flash-attention, tensorcores, and both.
  • Manually updated Ooba via entering the Python env and running python pip -r requirements.txt --upgrade. Updated several things, including llama.cpp and llama-cpp-python, but no change.
  • Updated llama.cpp and KoboldCPP and tried with them. Exact same issue with any GGUF model.
  • Checked for any NVIDIA or CUDA updates for my OS. None.
  • Restarted Kwin to clear out my VRAM.
  • Swapped from KDE to XFCE to minimize VRAM load and any possible Kwin / Wayland weirdness. Crashed even faster, somehow.
  • Restarted my PC.
  • Cloned fresh instances of Ooba, KoboldCPP, and llama.cpp.
  • Reinstalled my NVIDIA drivers and CUDA.

System Info:

OS: Arch Linux 6.11.2
GPU: NVIDIA RTX 4070 Ti Super
GPU Driver: nvidia-dkms 560.35.03-5
CUDA version: 12.6.1
RAM: 64 GB DDR4-4000

Name and Version

$ ./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
version: 3907 (9677640)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

ggml_cuda_host_malloc: failed to allocate 9984.00 MiB of pinned memory: invalid argument

The above is for loading 1 GPU layer on beta-long-35b. The actual number given obviously varies. Given that I have 16 GB of available VRAM - and that it worked perfectly before - this seems to be a bug to me.
@FitzWM FitzWM added bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) labels Oct 11, 2024
@slaren
Copy link
Collaborator

slaren commented Oct 11, 2024

ggml_cuda_host_malloc: failed to allocate 9984.00 MiB of pinned memory: invalid argument

This is not an error and it is not likely to be the cause. Please include the full output.

Also try running with the environment variable GGML_CUDA_NO_PINNED=1.

@FitzWM
Copy link
Author

FitzWM commented Oct 11, 2024

I'm doing my best, but the crash literally blows up my whole terminal when it happens. It doesn't just exit out of the program, so it's hard to get anything. I tried several different terminal emulators, as well. Is there a log I can grab somewhere?

Edit: Same crash with that env, sadly. OK, I take it back, I think? It crashed the first time I tried it, but now it seems to be working every time?

Edit 2: Nope. Crashing again, sometimes on load, always when trying to generate. I also tried redirecting the output to a file and using tee, but neither seemed to get me the output properly.

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2024

It's hard to diagnose this if we don't even know how it crashes. You can try running it under gdb to try to get a stack trace when it crashes. That said, it seems likely that it is crashing within the CUDA driver or library, and if that's the case, chances that it is a driver issue. You can also try running a git bisect to find the commit that introduced the issue, if any.

@FitzWM
Copy link
Author

FitzWM commented Oct 11, 2024

Alright, it took me a little fiddling to figure out how to get gdb to work, but here's the output. Nothing really jumps out to me, but I'm hoping there's something I'm not seeing.

(gdb) run
Starting program: /home/fitz/ai/llama.cpp/llama-server -t 7 -m /home/fitz/ai/text-generation-webui/models/35b-beta-long-gguf/35b-beta-long-Q5_K_M.gguf -c 8192 --rope-freq-base 8000000 --port 8888 -ngl 10

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n]) y
Debuginfod has been enabled.
To make this setting permanent, add 'set debuginfod enabled on' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7fffce200000 (LWP 138675)]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
[New Thread 0x7fffc1e00000 (LWP 138679)]
build: 3907 (96776405) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
system info: n_threads = 7, n_threads_batch = 7, total_threads = 16

system_info: n_threads = 7 (n_threads_batch = 7) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

[New Thread 0x7fffc1400000 (LWP 138680)]
[New Thread 0x7fffc0a00000 (LWP 138681)]
[New Thread 0x7fffbbe00000 (LWP 138682)]
[New Thread 0x7fffbb400000 (LWP 138683)]
[New Thread 0x7fffbaa00000 (LWP 138684)]
[New Thread 0x7fffba000000 (LWP 138685)]
[New Thread 0x7fffb9600000 (LWP 138686)]
[New Thread 0x7fffb8c00000 (LWP 138687)]
main: HTTP server is listening, hostname: 127.0.0.1, port: 8888, http threads: 15
main: loading model
[New Thread 0x7fffb3e00000 (LWP 138688)]
[New Thread 0x7fffb3400000 (LWP 138689)]
[New Thread 0x7fffb2a00000 (LWP 138690)]
[New Thread 0x7fffb2000000 (LWP 138691)]
[New Thread 0x7fffb1600000 (LWP 138692)]
[New Thread 0x7fffb0c00000 (LWP 138693)]
[New Thread 0x7fffb0200000 (LWP 138694)]
[New Thread 0x7fffaf800000 (LWP 138695)]
llama_model_loader: loaded meta data with 30 key-value pairs and 322 tensors from /home/fitz/ai/text-generation-webui/models/35b-beta-long-gguf/35b-beta-long-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = command-r
llama_model_loader: - kv   1:                               general.name str              = 35b-beta-long
llama_model_loader: - kv   2:                      command-r.block_count u32              = 40
llama_model_loader: - kv   3:                   command-r.context_length u32              = 128000
llama_model_loader: - kv   4:                 command-r.embedding_length u32              = 8192
llama_model_loader: - kv   5:              command-r.feed_forward_length u32              = 22528
llama_model_loader: - kv   6:             command-r.attention.head_count u32              = 64
llama_model_loader: - kv   7:          command-r.attention.head_count_kv u32              = 64
llama_model_loader: - kv   8:                   command-r.rope.freq_base f32              = 8000000.000000
llama_model_loader: - kv   9: command-r.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:     command-r.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                      command-r.logit_scale f32              = 0.062500
llama_model_loader: - kv  13:                command-r.rope.scaling.type str              = none
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = command-r
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 6
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                      quantize.imatrix.file str              = /models/35b-beta-long-GGUF/35b-beta-l...
llama_model_loader: - kv  27:                   quantize.imatrix.dataset str              = /training_data/groups_merged.txt
llama_model_loader: - kv  28:             quantize.imatrix.entries_count i32              = 280
llama_model_loader: - kv  29:              quantize.imatrix.chunks_count i32              = 95
llama_model_loader: - type  f32:   41 tensors
llama_model_loader: - type q5_K:  240 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: control-looking token: '<|im_end|>' was not control-type; this is probably a bug in the model. its type will be overridden
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 1008
llm_load_vocab: token to piece cache size = 1.8528 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = command-r
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 253333
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 128000
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 64
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 8192
llm_load_print_meta: n_embd_v_gqa     = 8192
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 6.2e-02
llm_load_print_meta: n_ff             = 22528
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = none
llm_load_print_meta: freq_base_train  = 8000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 128000
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 35B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 34.98 B
llm_load_print_meta: model size       = 23.28 GiB (5.72 BPW) 
llm_load_print_meta: general.name     = 35b-beta-long
llm_load_print_meta: BOS token        = 5 '<s>'
llm_load_print_meta: EOS token        = 6 '</s>'
llm_load_print_meta: PAD token        = 0 '<PAD>'
llm_load_print_meta: LF token         = 136 'Ä'
llm_load_print_meta: EOT token        = 255001 '<|im_end|>'
llm_load_print_meta: EOG token        = 6 '</s>'
llm_load_print_meta: EOG token        = 255001 '<|im_end|>'
llm_load_print_meta: max token length = 1024
[New Thread 0x7fffad800000 (LWP 138696)]
[New Thread 0x7ffface00000 (LWP 138697)]
llm_load_tensors: ggml ctx size =    0.31 MiB
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/41 layers to GPU
llm_load_tensors:        CPU buffer size = 23839.41 MiB
llm_load_tensors:      CUDA0 buffer size =  5613.44 MiB
..........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 8000000.0
llama_new_context_with_model: freq_scale = 1

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2024

This log seems incomplete, it should break into gdb when it crashes, and tell you where it happened. Then you can use the command bt to get the full call stack.

@FitzWM
Copy link
Author

FitzWM commented Oct 11, 2024

It crashes any terminal I try to run it in within gdb, as well. Which, yeah, makes no sense, given that's kind of the point of using gdb here. I've tried Konsole, Yakuake, GNOME Terminal, and XFCE Terminal. It even crashed me out of a separate TTY, forcing me to hard restart. Given that, I'm going to try downgrading my driver, although I swear it worked on this one before.

@FitzWM
Copy link
Author

FitzWM commented Oct 11, 2024

Hm, rolling back to nvidia-555.58 seems to let me use my models like I used to. No idea what about 560 breaks things, but it's let me generate a dozen or so responses so far.

Scratch that. It still crashes, just not quite as often. I'm honestly at a loss at this point. It's basically unpredictable whether it will load or, if it does, whether it will generate. It's especially baffling because I'm certain it worked on this driver version before.

@FitzWM
Copy link
Author

FitzWM commented Oct 12, 2024

Neither downgrading to NVIDIA 555.58 nor upgrading to a newer version of Arch's 560.35 package fixed the issue. Neither did upgrade to CUDA 12.6.2. Tried updating llama.cpp, KoboldCPP, and Ooba, as well, but same issue. I can fill up my VRAM with ExLLamav2 and generate without issue, so I don't think it's a CUDA or driver issue, but I don't really have any idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)
Projects
None yet
Development

No branches or pull requests

2 participants