llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-16 15:18:26 +01:00

Author	SHA1	Message	Date
Jeff Bolz	b685daf386	vulkan: request round-to-even for fp16 in im2col/rope_head (#10767 ) Vulkan doesn't mandate a specific rounding mode, but the shader_float_controls feature allows rounding mode to be requested if the implementation supports it.	2024-12-10 21:23:17 +01:00
Eve	dafae66cc2	vulkan: dynamic subgroup size for the remaining k quants (#10745 ) * q5_k q4_k q3_k q2_k q6_k multi row example * revert as multi row isnt faster for k quants	2024-12-10 20:33:23 +01:00
Jeff Bolz	a05e2afcc2	vulkan: disable spirv-opt for coopmat shaders (#10763 ) There are some bugs in the 1.3.296 SDK, so disable this. It isn't strictly necessary anyway. Add missing dependency on vulkan-shaders-gen, so shaders get recompiled when it changes. Fix coopmat support reporting when glslc doesn't support NV_coopmat2.	2024-12-10 18:22:20 +01:00
Jeff Bolz	3d98b4cb22	vulkan: fix compile warnings (#10731 )	2024-12-09 08:24:01 +01:00
Jeff Bolz	ecc93d0558	vulkan: compile a test shader in cmake to check for coopmat2 support (#10713 )	2024-12-08 09:05:55 +01:00
0cc4m	3df784b305	Vulkan: VK_KHR_cooperative_matrix support to speed up prompt processing (#10597 ) * Vulkan: Implement VK_KHR_cooperative_matrix support in the matrix matrix multiplication shader * Improve performance with better q4_k and q5_k dequant and store unrolling * Add Vulkan MUL_MAT and MUL_MAT_ID accumulator precision selection * Rework mulmat shader selection and compilation logic, avoid compiling shaders that won't get used by device * Vulkan: Implement accumulator switch for specific mul mat mat shaders * Vulkan: Unroll more loops for more mul mat mat performance * Vulkan: Add VK_AMD_shader_core_properties2 support to read Compute Unit count for split_k logic * Disable coopmat support on AMD proprietary driver * Remove redundant checks * Add environment variable GGML_VK_DISABLE_COOPMAT to disable VK_KHR_cooperative_matrix support * Fix rebase typo * Fix coopmat2 MUL_MAT_ID pipeline selection	2024-12-07 10:24:15 +01:00
Jeff Bolz	c9c6e01dae	vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash attention (#10206 )	2024-12-05 20:15:05 +01:00
Jeff Bolz	2759916d86	vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (#10642 )	2024-12-04 08:28:59 +01:00
Jeff Bolz	cc98896db8	vulkan: optimize and reenable split_k (#10637 ) Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.	2024-12-03 20:29:54 +01:00
Eve	0533e7fb38	vulkan: Dynamic subgroup size support for Q6_K mat_vec (#10536 ) * subgroup 64 version with subgroup add. 15% faster scalable version tested for subgroup sizes 16-128 * check for subgroup multiple of 16 and greater than 16 * subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45) * force 16 sequential threads per block * make 16 subgroup size a constant	2024-11-30 08:00:02 +01:00
Jeff Bolz	f095a649ec	vulkan: get the first command buffer submitted sooner (#10499 ) This is an incremental improvement over #9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space. With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.	2024-11-29 07:18:02 +01:00
Jeff Bolz	5b3466bedf	vulkan: Handle GPUs with less shared memory (#10468 ) There have been reports of failure to compile on systems with <= 32KB of shared memory (e.g. #10037). This change makes the large tile size fall back to a smaller size if necessary, and makes mul_mat_id fall back to CPU if there's only 16KB of shared memory.	2024-11-27 08:30:27 +01:00
Jeff Bolz	904109ed0d	vulkan: fix group_norm (#10496 ) Fix bad calculation of the end of the range. Add a backend test that covers the bad case (taken from stable diffusion). Fixes https://github.com/leejet/stable-diffusion.cpp/issues/439.	2024-11-26 16:45:05 +01:00
Diego Devesa	5931c1f233	ggml : add support for dynamic loading of backends (#10469 ) * ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-25 15:13:39 +01:00
Jeff Bolz	1bacb9f625	vulkan: further optimize mul_mat_vec using larger loads (#10387 ) * vulkan: Use pipeline_robustness to disable robustness in mul_mat_vec. Add some early returns for nonexistent rows in mul_mat_vec shaders. These can only be hit when dispatching a 2D grid of workgroups. Fix the logic for the 2D grid of workgroups to round up. Enable the pipeline robustness extension if it's available, and use it to disable robustness for these pipelines. The instructions to do the bounds checking contend for the same ALU resources as the bit twiddling dequant instructions. * vulkan: Add GLSL structure aliases for quant types to allow larger loads In Vulkan it's not possible to cast pointer types, so instead you have to declare an aliased binding for the memory with a different type. This commit adds aliases for the quant formats using 16b ints, and in a few places where the struct size is a multiple of 4 also using 32b ints. Currently only q4_k's aliases are used, but others will be used in subsequent commits. * vulkan: use larger loads in q5_k and q6_k shaders. Similar to the optimization I did in q4_k recently, this vectorizes some loads and reduces the number of bit twiddling instructions. * vulkan: use larger K step per iteration in mul_mat_vec. Add vec4 dequantization functions, and use them to do K=8 per iteration in mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B which helps reduce the load on the memory system. The K_PER_ITER==2 logic is still there, just for F16/F32, and really only because they support unaligned sizes. Tweak the num_iters/unrolling logic to be simpler and catch a couple missed unrolling opportunities.	2024-11-20 08:11:00 +01:00
Jeff Bolz	b3e585988f	vulkan: Optimize soft_max (#10301 ) * vulkan: Optimize soft_max Large soft_max could already saturate memory, but small/medium sizes were pretty slow. The bulk of the gains for them comes from using a smaller workgroup size, and making the workgroup size match the subgroup size also makes the barriers much cheaper. Cache some values in locals to avoid refetching/recomputing. And stamp out a few "template instantiations" so smaller cases will fully unroll. Add a missing early return for OOB rows. This happens when there are more than 512 rows and the dispatch is 512 x H. * vulkan: Further soft_max optimizations Restore the workgroup size of 512 case, use it for >1024. Use unrollable loops for more iteration counts.	2024-11-19 08:25:17 +01:00
0cc4m	9b75f03cd2	Vulkan: Fix device info output format specifiers (#10366 ) * Vulkan: Fix device info output format specifiers * Vulkan: Use zu printf specifier for size_t instead of ld	2024-11-18 11:02:43 +01:00
Jeff Bolz	772703c8ff	vulkan: Optimize some mat-vec mul quant shaders (#10296 ) Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops. Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before. Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions.	2024-11-16 07:26:57 +01:00
thewh1teagle	3225008973	ggml : vulkan logs (whisper/2547)	2024-11-15 15:44:06 +02:00
Diego Devesa	ae8de6d50a	ggml : build backends as libraries (#10256 ) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2024-11-14 18:04:35 +01:00

20 Commits