sync : llama.cpp #1184

ggerganov · 2025-04-08T08:24:28Z

No description provided.

* llama : add option to override tensor buffers * ggml : fix possible underflow in ggml_nbytes

…llama/12559) When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when: dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1)) previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each. This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.

When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.

… (llama/12705)

* CANN: Fix memory waste in aclnn_tensor * CANN: fix backend ops fail * CANN: fix acl_tensor memory alloc. * CANN: format * CANN: remove trailing whitespace

…s (llama/9017) * CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates. Fixes #12152 * Addressed comments * fix HIP builds * properly sync to stream * removed ggml_cuda_cpy_fn_ptrs * move stream sync before free * guard to only use indirection with graphs * style fixes * check for errors --------- Co-authored-by: slaren <[email protected]>

* [CANN]support sin cos argmax Signed-off-by: noemotiovon <[email protected]> * [CANN]codestyle adjustment Signed-off-by: noemotiovon <[email protected]> * [CANN]Remove redundant code Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]> Co-authored-by: noemotiovon <[email protected]>

* fix MUSA compiler warning * replace (void) with GGML_UNUSED

* Prefer vector flash decoding kernel for Gemma models Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models. * Update ggml/src/ggml-cuda/fattn.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

…2744)

…llama/12630) There seems to be a bubble waking up from waitForFences, which costs a few percent performance and also increased variance in performance. This change inserts an "almost_ready" fence when the graph is about 80% complete and we waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting for the final fence to be signaled.

…2747) fixes error for compiler paths with spaces

…io project/solution (llama/12625)

nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.

Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.

Signed-off-by: Xiaodong Ye <[email protected]>

* CANN: Refactor to reduce duplicate code * CANN: fix review comment

…et_tensor (llama/12734)

ggml-ci

…uffer_set_tensor" (llama/12812) * Revert "sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_s…" This reverts commit 518a01480eb3a7c80a4951b430db9dee55428310. * Update ggml/src/ggml-sycl/ggml-sycl.cpp * Update ggml/src/ggml-sycl/ggml-sycl.cpp * rm tail space

ggml-ci

slaren and others added 27 commits April 8, 2025 11:23

llama : add option to override model tensor buffers (llama/11397)

53d7f83

* llama : add option to override tensor buffers * ggml : fix possible underflow in ggml_nbytes

Vulkan: Fix mmq int dot float cache size (llama/12722)

d3b2ef6

cmake: remove caching from vulkan coopmat checks (llama/12719)

b1cb823

opencl: use max_alloc_size in backend ctx instead of querying again…

cd2abee

… (llama/12705)

CANN: Fix failed test cases (llama/12708)

2de3d0e

* CANN: Fix memory waste in aclnn_tensor * CANN: fix backend ops fail * CANN: fix acl_tensor memory alloc. * CANN: format * CANN: remove trailing whitespace

fix MUSA compiler warning (llama/12704)

0e538c4

* fix MUSA compiler warning * replace (void) with GGML_UNUSED

vulkan: Fix missing cmake logic for dot product extension (llama/12721)

cdb685e

vulkan: set cmake minimum and project name in vulkan-shaders (llama/1…

a17eace

…2744)

cmake: fix ggml-shaders-gen compiler paths containing spaces (llama/1…

21e404f

…2747) fixes error for compiler paths with spaces

sycl: allow ggml-sycl configuration and compilation using Visual Stud…

babf217

…io project/solution (llama/12625)

Vulkan: Tune Vulkan mmq int dot shader for performance (llama/12767)

1a67444

vulkan: Use unclamped loads for flash attention mask (llama/12720)

ff13fab

nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.

vulkan: fix NaN issue in flash attention shader (llama/12776)

8d5fa1e

Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.

musa: fix compilation warnings in mp_22/31 (llama/12780)

631561f

Signed-off-by: Xiaodong Ye <[email protected]>

CANN: Refactor to reduce duplicate code (llama/12731)

34474b3

* CANN: Refactor to reduce duplicate code * CANN: fix review comment

CANN: fix typo in ggml-cann (llama/12733)

386bf69

sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_s…

7eeb2aa

…et_tensor (llama/12734)

cuda : fix HIP and MUSA BF16 (llama/0)

b0e5715

ggml-ci

opencl: better identify Adreno GPU (llama/12760)

26e2e53

sync : llama.cpp

54f2309

ggml-ci

ggerganov merged commit 1e965e8 into master Apr 8, 2025
10 of 11 checks passed

ggerganov deleted the sync-llama-25-04-08 branch April 8, 2025 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : llama.cpp #1184

sync : llama.cpp #1184

ggerganov commented Apr 8, 2025

sync : llama.cpp #1184

sync : llama.cpp #1184

Conversation

ggerganov commented Apr 8, 2025