latest ggml version sync #1

dridri · 2024-08-10T12:12:13Z

This is an attempt to rebase on the latest commit of ggml master branch.

Looking at how many commits there are, it may be easier to just overwrite the tortoise branch instead of merging.

This breaks compatibility with current tortoise.cpp version, which needs the following PR : balisujohn/tortoise.cpp#20

* Update HIPBLAS CMake * Fix HIPBlas for Debian * Add hipBLAS build instructions * Fix Clang detection * Set ROCM_PATH correctly

* ggml : use atomic_flag for critical section * add windows shims

* tests : add non-cont concat tests * cuda : non-cont concat support ggml-ci

* tests : add rope tests ggml-ci * ggml : fixes (hopefully) ggml-ci * tests : add non-cont tests ggml-ci * cuda : add asserts for rope/norm + fix DS2 ggml-ci * ggml : assert contiguousness * tests : reduce RoPE tests ggml-ci

* faster avx512 exp implementation * x->r * improve accuracy, handle special cases * remove `e`

* ggml : fix loongson compile warnings ggml-ci * Fix loongarch quantize test fail. Fix unexpected error introduced during rebase code. * tests : disable json test due to lack of python on the CI node ggml-ci --------- Co-authored-by: junchao-loongson <[email protected]>

* CUDA: quantized KV support for FA vec * try CI fix * fix commented-out kernel variants * add q8_0 q4_0 tests * fix nwarps > batch size * split fattn compile via extern templates * fix flake8 * fix metal tests * fix cmake * make generate_cu_files.py executable * add autogenerated .cu files * fix AMD * error if type_v != FP16 and not flash_attn * remove obsolete code

compilade pointed this out on the previous MR

op_getrows_f32 is required since ggerganov/llama.cpp#6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.

* Finish Vulkan mul_mat_id implementation * Add Vulkan sum_rows and div ops * Fix MUL_MAT_ID matrix matrix shader * Fix MUL_MAT_ID matrix vector shader dispatch size * Fix MUL_MAT_ID matrix vector shader and dispatch code * Update Vulkan CPU offload for MUL_MAT_ID * Fix crash when using split mode none and setting a main GPU

* ggml: Added OpenMP for multi-threads processing * ggml : Limit the number of threads used to avoid deadlock * update shared state n_threads in parallel region * clear numa affinity for main thread even with openmp * enable openmp by default * fix msvc build * disable openmp on macos * ci : disable openmp with thread sanitizer * Update ggml.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* llama : offload to RPC in addition to other backends * - fix copy_tensor being called on the src buffer instead of the dst buffer - always initialize views in the view_src buffer - add RPC backend to Makefile build - add endpoint to all RPC object names * add rpc-server to Makefile * Update llama.cpp Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

@JohannesGaessler

This enforces a check that -fno-finite-math-only was set and that the operating compiling mode is not in finite maths mode. This is because during rewriting of silu and softmax for cpu #7154 there emerged an issue where the result that was observed when >1 slot was nondeterministic as found by @JohannesGaessler. @LostRuins narrowed the problem down to -ffinite-math-only which was theorised to be due to SiLU, instead of flushing small values to 0, returns NaN or some other garbage. @jart proposed a fix that @ggerganov then implemented in this fix ref ggerganov/llama.cpp#7154 (comment)

ggml-ci

Previously the code would have failed to cope in the case that the number of nodes changes in an existing CUDA graph. This fixes the issue by removing an unnecessary conditional.

* ggml : unify rope norm/neox (CPU) * ggml : fix compile warning * ggml : remove GLM rope mode ggml-ci * metal : better rope implementation ggml-ci * cuda : better rope implementation ggml-ci * naming : n_orig_ctx -> n_ctx_orig ggml-ci * dev : add reminders to update backends ggml-ci * vulkan : fix ggml_rope_ext() usage * cuda : fix array size + indents ggml-ci

* CUDA: refactor mmq, dmmv, mmvq * fix out-of-bounds write * struct for qk, qr, qi * fix cmake build * mmq_type_traits

* vulkan : reuse parent extra for views * Fix validation error when multiple compute contexts are used in a graph --------- Co-authored-by: 0cc4m <[email protected]>

Signed-off-by: Ben Ashbaugh <[email protected]>

* CUDA: int8 tensor cores for MMQ (legacy quants) * fix out-of-bounds writes * __builtin_assume -> GGML_CUDA_ASSUME * fix writeback returning too early

* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same

ggml-ci

…ad is shorter than existing padded dimension

… variable stride

…d prefix and postfix padding with cuda backend

…ing tests

paperManu and others added 30 commits June 5, 2024 11:41

zig : fix build (ggerganov#840)

62125ae

cmake : update HIPBLAS (ggerganov#847)

9d562d7

* Update HIPBLAS CMake * Fix HIPBlas for Debian * Add hipBLAS build instructions * Fix Clang detection * Set ROCM_PATH correctly

ggml : use atomic_flag for critical section (llama/7598)

c11302c

* ggml : use atomic_flag for critical section * add windows shims

llama-bench : add support for the RPC backend (llama/7435)

cbe312b

cuda : non-cont concat support (llama/7610)

ec4f235

* tests : add non-cont concat tests * cuda : non-cont concat support ggml-ci

ggml : fix YARN + add tests + add asserts (llama/7617)

3429c39

* tests : add rope tests ggml-ci * ggml : fixes (hopefully) ggml-ci * tests : add non-cont tests ggml-ci * cuda : add asserts for rope/norm + fix DS2 ggml-ci * ggml : assert contiguousness * tests : reduce RoPE tests ggml-ci

metal : add missing asserts (llama/7617)

0999c76

metal : remove invalid asserts (llama/7617)

936fa11

ggml : fix loongarch build (O2 issue) (llama/7636)

0acc2e5

faster avx512 exp implementation (llama/7551)

8207168

* faster avx512 exp implementation * x->r * improve accuracy, handle special cases * remove `e`

CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681)

655a5b4

Fix FlashAttention debug test, FP32 assert (llama/7684)

224096f

fix bug introduced in using calloc (llama/7701)

8cf2b1d

compilade pointed this out on the previous MR

kompute : implement op_getrows_f32 (llama/6403)

5ed1871

op_getrows_f32 is required since ggerganov/llama.cpp#6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.

ggml : remove OpenCL (llama/7735)

9b05346

ggml-ci

Allow number of nodes in CUDA graph to change (llama/7738)

a524d3f

Previously the code would have failed to cope in the case that the number of nodes changes in an existing CUDA graph. This fixes the issue by removing an unnecessary conditional.

CUDA: refactor mmq, dmmv, mmvq (llama/7716)

87dd7dd

* CUDA: refactor mmq, dmmv, mmvq * fix out-of-bounds write * struct for qk, qr, qi * fix cmake build * mmq_type_traits

fix softmax r2r result wrong issue (llama/7811)

5eb9833

vulkan : reuse parent extra for views (llama/7806)

a7d88a6

* vulkan : reuse parent extra for views * Fix validation error when multiple compute contexts are used in a graph --------- Co-authored-by: 0cc4m <[email protected]>

CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)

0610ddd

use the correct SYCL context for host USM allocations (llama/7777)

a32a2b8

Signed-off-by: Ben Ashbaugh <[email protected]>

CUDA: use tensor cores for MMQ (llama/7676)

c570abc

* CUDA: int8 tensor cores for MMQ (legacy quants) * fix out-of-bounds writes * __builtin_assume -> GGML_CUDA_ASSUME * fix writeback returning too early

CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)

a7eefa5

slaren and others added 29 commits August 8, 2024 13:45

ggml-backend : fix async copy from CPU (llama/8897)

67c3e78

* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same

sync : llama.cpp

9793ab7

ggml-ci

scripts : update sync scripts (#0)

3058ec3

scripts : remove obsolete header (#0)

3266c07

scripts : sync sycl (#0)

723445e

sync : vulkan (llama/0)

bc97237

ggml : add CANN backend (llama/0)

a06c683

ggml-ci

initial draft of ggml_pad_reflect_1d

9b59b9a

added tests for ggml_pad_reflect_1d, also added assert to make sure p…

61741de

…ad is shorter than existing padded dimension

conv transpose 1d passing test for 1d input and kernel

5776bf8

working for different input and output channel counts, added test for…

7630f88

… variable stride

initial draft appears to work with stride other than 1

a3bcf73

working with all old and new conv1d tests

a0ac8a6

added a test for large tensors

3504590

removed use cuda hardcoding

6f0cbaa

restored test-conv-transpose.c

8349447

working version of ggml_pad with option for prefix padding

829f036

removed unnecessary arguments

3feecce

now supports up to 4d prefix andpostfix padding with CPU, and up to 3…

8f4703d

…d prefix and postfix padding with cuda backend

adds ggml_pad_ext and appropriately extends tests

6f8f64e

partial draft of ggml_unfold_1d, some logic still placeholder

669158d

CPU version of ggml_unfold_1d seems to work with basic tests added

ec3e339

added cuda kernel for ggml_unfold_1d and verified it works with exist…

3d90cf1

…ing tests

cleanup

24836ec

op count merge

0e974c2

patched getrows

4f9ccd9

some changes towards float32 conv1d

8893aa4

crude working version of ggml_conv_1d float 32 support

409120a

ggml master sync rebase fix

e3cb751

dridri mentioned this pull request Aug 10, 2024

latest ggml version sync balisujohn/tortoise.cpp#20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

latest ggml version sync #1

latest ggml version sync #1

dridri commented Aug 10, 2024

latest ggml version sync #1

Are you sure you want to change the base?

latest ggml version sync #1

Conversation

dridri commented Aug 10, 2024