llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-02-06 16:40:34 +01:00

Author	SHA1	Message	Date
Jeff Bolz	5245729e33	vulkan: fix diag_mask_inf (#11323 ) With robustbufferaccess disabled, this shader was showing OOB stores. There is a bounds check in the code, but the workgrouop dimensions were reversed vs CUDA and it was running the wrong number of threads. So fix the workgroup dimensions and disable robustness for this pipeline.	2025-01-23 08:01:17 +01:00
Jeff Bolz	aea8ddd516	vulkan: fix coopmat2 validation failures (#11284 ) mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16. coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.	2025-01-20 10:38:32 -06:00
Jeff Bolz	44e18ef939	vulkan: fix coopmat2 flash attention for non-contiguous inputs (#11281 ) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes #11268.	2025-01-18 09:26:50 +01:00
Jeff Bolz	bd38ddea01	vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (#11166 ) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination	2025-01-16 22:47:10 +01:00
Jeff Bolz	466300fe14	vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (#11206 ) Do masking on whole dwords, fetch all scales at once.	2025-01-16 22:23:49 +01:00
Jeff Bolz	206bc53422	vulkan: optimize coopmat2 q2_k dequant function (#11130 )	2025-01-16 22:16:39 +01:00
Eve	adc5dd92e8	vulkan: scale caching for k quants + misc fixes (#11081 ) * q6_k scale caching * 16 bit unpack * q4_k test (slow) * revert it * q3_k * q2_k * little stuff * try precalculating products of a and q2_k scales * Revert "try precalculating products of a and q2_k scales" This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b. * unpack should be u16, add vim swap to gitignore (about time) * better q4_k scales * q5_k * better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations * q2_k better dequant * q3_k optimizations * q3_k use hmask simd from cpu avx version * make the caches happy * q3_k separate out calculation * q2_k separate out * little stuff * use calc_superblock everywhere * q2_k optimize scale calculation * more barriers	2025-01-15 19:50:13 +00:00
Junil Kim	1d8504338e	fix: ggml: fix vulkan-shaders-gen build (#10448 ) * fix: ggml: fix vulkan-shaders-gen build The vulkan-shaders-gen target was not being built correctly in case of cross-compilation. Other outputs need to be built for the cross compile target, but vulkan-shaders-gen needs to be built for the host. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup - Add GGML_SHADERS_GEN_TOOLCHAIN CMake option. - Auto-detect host toolchain if not set. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup Use configure_file to generate host_toolchain.cmake from template * fix: ggml: Fix compile error Fix compile error not finding vulkan-shaders-gen * fix: vulkan-shaders-gen build and path handling Fix build issues with vulkan-shaders-gen: - Add target dependency for correct build order - Use CMAKE_HOST_SYSTEM_NAME for executable suffix - Fix MSVC output directory in host toolchain - Normalize path handling for cross-compilation * fix: improve host compiler detection in vulkan shader build Improve host compiler detection for vulkan shader generation: - Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches - Consolidate compiler detection logic - Fix Windows-specific MSVC detection - Ensure correct compiler search in cross-compilation * refactor: Simplify CMake function for detecting host compiler Simplified the CMake function to improve the process of detecting the host compiler. * fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt Since `vulkan-shader-gen.cpp` only requires the `glslc` executable and not the Vulkan headers or libraries, CMakeLists.txt needs to be corrected. (See: `ecc93d0558`) * refactor: Rename host_toolchain.cmake.in - Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in * refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN	2025-01-15 14:17:42 +01:00
0cc4m	c3f9d25706	Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (#11161 ) * Vulkan: Remove float16 use in shaders * Fix validation error about subgroup_size_control extension	2025-01-10 06:39:33 +01:00
Molly Sophia	ee7136c6d1	llama: add support for QRWKV6 model architecture (#11001 ) llama: add support for QRWKV6 model architecture (#11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix some typos Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix cuda warning Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update README.md Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: compilade <git@compilade.net> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>	2025-01-10 09:58:08 +08:00
Mathieu Baudier	02f0430141	Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (#11117 ) * Disable GL_KHR_cooperative_matrix Vulkan extension if not available. * Perform Vulkan extensions checks in a more sensible order * Remove unnecessary #ifdef directive	2025-01-08 09:18:13 +01:00
ag2s20150909	bec2183f2c	fix: Vulkan shader gen binary path when Cross-compiling (#11096 ) * fix: Vulkan shader gen binary path when cross compiling	2025-01-08 09:17:29 +01:00
0cc4m	b56f079e28	Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver (#11074 ) * Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver * Add (TM) to AMD name check	2025-01-04 21:09:59 +01:00
Gilad S.	c31fc8b966	fix: Vulkan shader gen binary path (#11037 )	2025-01-04 09:17:31 +01:00
Jeff Bolz	716bd6dec3	vulkan: optimize mul_mat for small values of N (#10991 ) Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base.	2024-12-30 18:27:11 +01:00
Jeff Bolz	a813badbbd	vulkan: im2col and matmul optimizations for stable diffusion (#10942 ) * tests: Add im2col perf tests * vulkan: optimize im2col, more elements per thread * vulkan: increase small tile size for NV_coopmat2 * vulkan: change im2col to 512 elements per workgroup	2024-12-29 10:16:34 +01:00
Jeff Bolz	fdd2188912	vulkan: Use push constant offset to handle misaligned descriptors (#10987 )	2024-12-29 09:35:11 +01:00
Eve	d79d8f39b4	vulkan: multi-row k quants (#10846 ) * multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default	2024-12-26 16:54:44 +01:00
Peter	d283d02bf2	examples, ggml : fix GCC compiler warnings (#10983 ) Warning types fixed (observed under MSYS2 GCC 14.2.0): * format '%ld' expects argument of type 'long int', but argument has type 'size_t' * llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers] (emitted for all struct field except first)	2024-12-26 14:59:11 +01:00
Jeff Bolz	ebdee9478c	vulkan: build fixes for 32b (#10927 ) * vulkan: build fixes for 32b Should fix #10923 * vulkan: initialize some buffer/offset variables	2024-12-22 10:44:01 +01:00
Jeff Bolz	a91a41364b	vulkan: optimize coopmat2 dequant functions (#10855 ) Change the code to do 16b loads when possible and extract the appropriate component late, so the code is effectively decoding a pair of elements and then selecting one. This can allow more commoning to happen in the compiler when neighboring elements are loaded.	2024-12-21 08:04:45 +01:00
gn64	8dd19a4812	vulkan : fix soft_max.comp division by zero (whisper/2633) This change prevents a division by zero error when p.KY is 0.	2024-12-17 18:35:49 +02:00
Eve	7b1ec53f56	vulkan: bugfixes for small subgroup size systems + llvmpipe test (#10809 ) * ensure mul mat shaders work on systems with subgroup size less than 32 more fixes add test * only s_warptile_mmq needs to be run with 32 threads or more	2024-12-17 06:52:55 +01:00
Zhiyuan Li	160bc039c8	rwkv6: add wkv6 support for Vulkan backend (#10829 ) * rwkv_wkv6 vulkan shader * RWKV_WKV6 Vulkan op tests passed Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * add [[unroll]] and remove unnecessary conditions * add uma support * fix erros in EditorConfig Checker --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Molly Sophia <mollysophia379@gmail.com>	2024-12-16 22:00:46 +01:00
HimariO	ba1cb19cdd	llama : add Qwen2VL support + multimodal RoPE (#10361 ) * Barebone Qwen2VL LLM convertor * Add Qwen2VL cli entrypoint * [WIP] add qwen2vl arch * Verify m-rope output * Add vl-rope/2d-rope support for qwen2vl ViT * update qwen2vl cli tool * update 5D tensor op workaround * [WIP] qwen2vl vision model * make batch and clip utils compatible with qwen2vl * [WIP] create inference workflow, gguf convert script but fix * correcting vision-rope behavior, add the missing last layer back to ViT * add arg parser to qwen2vl_surgery * replace variable size array with vector * cuda-gdb cmake preset * add fp32 mrope, vision rope kernel * add fp16 support for qwen2vl and m-rope * add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION` * fix rope op mode switching, out dated func args * update `llama_hparams` * update to keep up stream changes * resolve linter, test errors * add makefile entry, update speical image padding token * add mrope unit test, fix few compiler warnings * rename `mrope` related function, params * minor updates on debug util, bug fixs * add `m-rope` testcase to `test-backend-ops` * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix traililng whitespce * store `llama_hparams.rope_sections` with fixed size array * update position id tensor size check in GGML_OP_ROPE * minor updates * update `ggml_backend__supports_op` of unsupported backends remote old `rope_section` compare operator --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-14 14:43:46 +02:00
Eve	64ae065511	vulkan: small mul_mat_vec optimizations (#10665 ) * double the number of rows per workgroup * Update ggml-vulkan.cpp * Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats * only increase the number of rows for amd and subgroup size 64 * fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested * use subgroup min and max to check for gcn (requires https://github.com/ggerganov/llama.cpp/pull/10721) * manual merge ggml-vulkan.cpp * set min and max subgroup size in any case * Also double the number of rows for Intel GPUs	2024-12-13 09:42:04 +01:00
0cc4m	4064c0e3b6	Vulkan: Use improved q4_k and q5_k dequant code in dequant shaders (#10798 )	2024-12-12 18:36:00 +01:00
0cc4m	dc5301d565	Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats (#10721 ) * Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats * Fix subgroup size control extension support check Add accf32 and accf16 checks for coopmats * Also disable coopmats on amdvlk	2024-12-12 18:35:37 +01:00
Jeff Bolz	b685daf386	vulkan: request round-to-even for fp16 in im2col/rope_head (#10767 ) Vulkan doesn't mandate a specific rounding mode, but the shader_float_controls feature allows rounding mode to be requested if the implementation supports it.	2024-12-10 21:23:17 +01:00
Eve	dafae66cc2	vulkan: dynamic subgroup size for the remaining k quants (#10745 ) * q5_k q4_k q3_k q2_k q6_k multi row example * revert as multi row isnt faster for k quants	2024-12-10 20:33:23 +01:00
Jeff Bolz	a05e2afcc2	vulkan: disable spirv-opt for coopmat shaders (#10763 ) There are some bugs in the 1.3.296 SDK, so disable this. It isn't strictly necessary anyway. Add missing dependency on vulkan-shaders-gen, so shaders get recompiled when it changes. Fix coopmat support reporting when glslc doesn't support NV_coopmat2.	2024-12-10 18:22:20 +01:00
Jeff Bolz	3d98b4cb22	vulkan: fix compile warnings (#10731 )	2024-12-09 08:24:01 +01:00
stduhpf	06d70147e6	Vulkan: fix NaN in tanh.comp with AMD proprietary driver on Windows (#10723 ) * Vulkan: fix NaN in tanh.comp * Faster NaN-free tanh	2024-12-08 19:19:19 +01:00
Jeff Bolz	ecc93d0558	vulkan: compile a test shader in cmake to check for coopmat2 support (#10713 )	2024-12-08 09:05:55 +01:00
0cc4m	3df784b305	Vulkan: VK_KHR_cooperative_matrix support to speed up prompt processing (#10597 ) * Vulkan: Implement VK_KHR_cooperative_matrix support in the matrix matrix multiplication shader * Improve performance with better q4_k and q5_k dequant and store unrolling * Add Vulkan MUL_MAT and MUL_MAT_ID accumulator precision selection * Rework mulmat shader selection and compilation logic, avoid compiling shaders that won't get used by device * Vulkan: Implement accumulator switch for specific mul mat mat shaders * Vulkan: Unroll more loops for more mul mat mat performance * Vulkan: Add VK_AMD_shader_core_properties2 support to read Compute Unit count for split_k logic * Disable coopmat support on AMD proprietary driver * Remove redundant checks * Add environment variable GGML_VK_DISABLE_COOPMAT to disable VK_KHR_cooperative_matrix support * Fix rebase typo * Fix coopmat2 MUL_MAT_ID pipeline selection	2024-12-07 10:24:15 +01:00
Jeff Bolz	c9c6e01dae	vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash attention (#10206 )	2024-12-05 20:15:05 +01:00
Jeff Bolz	2759916d86	vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (#10642 )	2024-12-04 08:28:59 +01:00
Jeff Bolz	cc98896db8	vulkan: optimize and reenable split_k (#10637 ) Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.	2024-12-03 20:29:54 +01:00
Eve	0533e7fb38	vulkan: Dynamic subgroup size support for Q6_K mat_vec (#10536 ) * subgroup 64 version with subgroup add. 15% faster scalable version tested for subgroup sizes 16-128 * check for subgroup multiple of 16 and greater than 16 * subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45) * force 16 sequential threads per block * make 16 subgroup size a constant	2024-11-30 08:00:02 +01:00
Diego Devesa	7cc2d2c889	ggml : move AMX to the CPU backend (#10570 ) * ggml : move AMX to the CPU backend --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-29 21:54:58 +01:00
Jeff Bolz	f095a649ec	vulkan: get the first command buffer submitted sooner (#10499 ) This is an incremental improvement over #9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space. With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.	2024-11-29 07:18:02 +01:00
Jeff Bolz	c31ed2abfc	vulkan: define all quant data structures in types.comp (#10440 )	2024-11-27 08:32:54 +01:00
Jeff Bolz	5b3466bedf	vulkan: Handle GPUs with less shared memory (#10468 ) There have been reports of failure to compile on systems with <= 32KB of shared memory (e.g. #10037). This change makes the large tile size fall back to a smaller size if necessary, and makes mul_mat_id fall back to CPU if there's only 16KB of shared memory.	2024-11-27 08:30:27 +01:00
Jeff Bolz	249a7902ec	vulkan: further optimize q5_k mul_mat_vec (#10479 )	2024-11-27 08:21:59 +01:00
Jeff Bolz	71a64989a5	vulkan: skip integer div/mod in get_offsets for batch_idx==0 (#10506 )	2024-11-27 08:08:54 +01:00
Jeff Bolz	4a57d362e1	vulkan: optimize Q2_K and Q3_K mul_mat_vec (#10459 )	2024-11-27 08:00:50 +01:00
Jeff Bolz	904109ed0d	vulkan: fix group_norm (#10496 ) Fix bad calculation of the end of the range. Add a backend test that covers the bad case (taken from stable diffusion). Fixes https://github.com/leejet/stable-diffusion.cpp/issues/439.	2024-11-26 16:45:05 +01:00
Junil Kim	0eb4e12bee	vulkan: Fix a vulkan-shaders-gen arugment parsing error (#10484 ) The vulkan-shaders-gen was not parsing the --no-clean argument correctly. Because the previous code was parsing the arguments which have a value only and the --no-clean argument does not have a value, it was not being parsed correctly. This commit can now correctly parse arguments that don't have values.	2024-11-26 01:47:20 +00:00
Diego Devesa	5931c1f233	ggml : add support for dynamic loading of backends (#10469 ) * ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-25 15:13:39 +01:00
Jeff Bolz	9abe9eeae9	vulkan: predicate max operation in soft_max shaders/soft_max (#10437 ) Fixes #10434	2024-11-20 20:47:36 +01:00

1 2

58 Commits