With robustbufferaccess disabled, this shader was showing OOB stores. There
is a bounds check in the code, but the workgrouop dimensions were reversed vs
CUDA and it was running the wrong number of threads. So fix the workgroup
dimensions and disable robustness for this pipeline.
mul mat and flash attention shaders were loading f32 types directly into
A/B matrices, which happens to work but is technically invalid usage.
For FA, we can load it as an Accumulator matrix and convert and this
is not in the inner loop and is cheap enough. For mul mat, it's more
efficient to do this conversion in a separate pass and have the input(s)
be f16.
coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId
requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.
Add code similar to mul_mm_cm2 to force alignment of strides, to avoid
a performance regression.
Add noncontiguous FA tests in test-backend-ops.
Fixes#11268.
* vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl
Shaders are based on cpy.cu.
* vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32
* ggml: copy q->f32 assumes some contiguity in the destination
* q6_k scale caching
* 16 bit unpack
* q4_k test (slow)
* revert it
* q3_k
* q2_k
* little stuff
* try precalculating products of a and q2_k scales
* Revert "try precalculating products of a and q2_k scales"
This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b.
* unpack should be u16, add vim swap to gitignore (about time)
* better q4_k scales
* q5_k
* better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations
* q2_k better dequant
* q3_k optimizations
* q3_k use hmask simd from cpu avx version
* make the caches happy
* q3_k separate out calculation
* q2_k separate out
* little stuff
* use calc_superblock everywhere
* q2_k optimize scale calculation
* more barriers
* fix: ggml: fix vulkan-shaders-gen build
The vulkan-shaders-gen target was not being built correctly
in case of cross-compilation.
Other outputs need to be built for the cross compile target,
but vulkan-shaders-gen needs to be built for the host.
* refactor: ggml: Improve vulkan-shaders-gen toolchain setup
- Add GGML_SHADERS_GEN_TOOLCHAIN CMake option.
- Auto-detect host toolchain if not set.
* refactor: ggml: Improve vulkan-shaders-gen toolchain setup
Use configure_file to generate host_toolchain.cmake from template
* fix: ggml: Fix compile error
Fix compile error not finding vulkan-shaders-gen
* fix: vulkan-shaders-gen build and path handling
Fix build issues with vulkan-shaders-gen:
- Add target dependency for correct build order
- Use CMAKE_HOST_SYSTEM_NAME for executable suffix
- Fix MSVC output directory in host toolchain
- Normalize path handling for cross-compilation
* fix: improve host compiler detection in vulkan shader build
Improve host compiler detection for vulkan shader generation:
- Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches
- Consolidate compiler detection logic
- Fix Windows-specific MSVC detection
- Ensure correct compiler search in cross-compilation
* refactor: Simplify CMake function for detecting host compiler
Simplified the CMake function to improve the process of detecting the host compiler.
* fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt
Since `vulkan-shader-gen.cpp` only requires the `glslc` executable
and not the Vulkan headers or libraries, CMakeLists.txt needs to
be corrected.
(See: ecc93d0558)
* refactor: Rename host_toolchain.cmake.in
- Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in
* refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN
Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN
* Disable GL_KHR_cooperative_matrix Vulkan extension if not available.
* Perform Vulkan extensions checks in a more sensible order
* Remove unnecessary #ifdef directive
Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where
the batch_strides are overloaded to hold the row strides. Put the loads from the
B matrix in the innermost loop because it should cache better.
Share some code for reducing the result values to memory in mul_mat_vec_base.
* tests: Add im2col perf tests
* vulkan: optimize im2col, more elements per thread
* vulkan: increase small tile size for NV_coopmat2
* vulkan: change im2col to 512 elements per workgroup
Warning types fixed (observed under MSYS2 GCC 14.2.0):
* format '%ld' expects argument of type 'long int', but argument has type 'size_t'
* llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers] (emitted for all struct field except first)
Change the code to do 16b loads when possible and extract the appropriate
component late, so the code is effectively decoding a pair of elements and
then selecting one. This can allow more commoning to happen in the compiler
when neighboring elements are loaded.
* ensure mul mat shaders work on systems with subgroup size less than 32
more fixes
add test
* only s_warptile_mmq needs to be run with 32 threads or more
* double the number of rows per workgroup
* Update ggml-vulkan.cpp
* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats
* only increase the number of rows for amd and subgroup size 64
* fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested
* use subgroup min and max to check for gcn (requires https://github.com/ggerganov/llama.cpp/pull/10721)
* manual merge ggml-vulkan.cpp
* set min and max subgroup size in any case
* Also double the number of rows for Intel GPUs
* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats
* Fix subgroup size control extension support check
Add accf32 and accf16 checks for coopmats
* Also disable coopmats on amdvlk
Vulkan doesn't mandate a specific rounding mode, but the shader_float_controls
feature allows rounding mode to be requested if the implementation supports it.
There are some bugs in the 1.3.296 SDK, so disable this. It isn't strictly
necessary anyway.
Add missing dependency on vulkan-shaders-gen, so shaders get recompiled when it
changes.
Fix coopmat support reporting when glslc doesn't support NV_coopmat2.
* Vulkan: Implement VK_KHR_cooperative_matrix support in the matrix matrix multiplication shader
* Improve performance with better q4_k and q5_k dequant and store unrolling
* Add Vulkan MUL_MAT and MUL_MAT_ID accumulator precision selection
* Rework mulmat shader selection and compilation logic, avoid compiling shaders that won't get used by device
* Vulkan: Implement accumulator switch for specific mul mat mat shaders
* Vulkan: Unroll more loops for more mul mat mat performance
* Vulkan: Add VK_AMD_shader_core_properties2 support to read Compute Unit count for split_k logic
* Disable coopmat support on AMD proprietary driver
* Remove redundant checks
* Add environment variable GGML_VK_DISABLE_COOPMAT to disable VK_KHR_cooperative_matrix support
* Fix rebase typo
* Fix coopmat2 MUL_MAT_ID pipeline selection
* subgroup 64 version with subgroup add. 15% faster
scalable version
tested for subgroup sizes 16-128
* check for subgroup multiple of 16 and greater than 16
* subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45)
* force 16 sequential threads per block
* make 16 subgroup size a constant
This is an incremental improvement over #9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.
With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
There have been reports of failure to compile on systems with <= 32KB
of shared memory (e.g. #10037). This change makes the large tile size
fall back to a smaller size if necessary, and makes mul_mat_id fall
back to CPU if there's only 16KB of shared memory.
The vulkan-shaders-gen was not parsing the --no-clean argument correctly.
Because the previous code was parsing the arguments which have a value only
and the --no-clean argument does not have a value, it was not being parsed
correctly. This commit can now correctly parse arguments that don't have values.