* feat: load all backends from a user-provided search path
* fix: Windows search path
* refactor: rename `ggml_backend_load_all_in_search_path` to `ggml_backend_load_all_from_path`
* refactor: rename `search_path` to `dir_path`
* fix: change `NULL` to `nullptr`
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* fix: change `NULL` to `nullptr`
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Vulkan doesn't mandate a specific rounding mode, but the shader_float_controls
feature allows rounding mode to be requested if the implementation supports it.
* Renames NVIDIA GPU-architecture flags to avoid name clashes with WinAPI. (e.g. CC_PASCAL, GPU architecture or WinAPI pascal compiler flag?)
* Reverts erroneous rename in SYCL-code.
* Renames GGML_CUDA_MIN_CC_DP4A to GGML_CUDA_CC_DP4A.
* Renames the rest of the compute capability macros for consistency.
There are some bugs in the 1.3.296 SDK, so disable this. It isn't strictly
necessary anyway.
Add missing dependency on vulkan-shaders-gen, so shaders get recompiled when it
changes.
Fix coopmat support reporting when glslc doesn't support NV_coopmat2.
* rename ggml-cpu-aarch64.c to .cpp
* reformat extra cpu backend.
- clean Q4_0_N_M and IQ4_0_N_M
- remove from "file" tensor type
- allow only with dynamic repack
- extract cpu extra bufts and convert to C++
- hbm
- "aarch64"
- more generic use of extra buffer
- generalise extra_supports_op
- new API for "cpu-accel":
- amx
- aarch64
* clang-format
* Clean Q4_0_N_M ref
Enable restrict on C++
* add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack
* added/corrected control on tensor size for Q4 repacking.
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add debug logs on repacks.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Vulkan: Implement VK_KHR_cooperative_matrix support in the matrix matrix multiplication shader
* Improve performance with better q4_k and q5_k dequant and store unrolling
* Add Vulkan MUL_MAT and MUL_MAT_ID accumulator precision selection
* Rework mulmat shader selection and compilation logic, avoid compiling shaders that won't get used by device
* Vulkan: Implement accumulator switch for specific mul mat mat shaders
* Vulkan: Unroll more loops for more mul mat mat performance
* Vulkan: Add VK_AMD_shader_core_properties2 support to read Compute Unit count for split_k logic
* Disable coopmat support on AMD proprietary driver
* Remove redundant checks
* Add environment variable GGML_VK_DISABLE_COOPMAT to disable VK_KHR_cooperative_matrix support
* Fix rebase typo
* Fix coopmat2 MUL_MAT_ID pipeline selection
* metal : Extend how Llama.cpp locates metal resources (#10675)
* It searches the resource file in the directory where the current
binary is located as well.
* Resolves symbolic links.
Rationale:
When we plug this dependency into a Bazel build and run it in the
context of Bazel (e.g. testing):
* the execution directory is often very different from where the files
are located and no direct control over this (Bazel sandboxing),
* the Bazel sandbox often use symbolic links to make files available.
With this patch, we can have the resource file added to the target,
can build and run tests in the context of Bazel.
* Update ggml/src/ggml-metal/ggml-metal.m
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-metal/ggml-metal.m
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml_pad_reflect_1d defined in header
* implemented on CPU
* called the forward pass
* impl Metal kernel
* added Metal kernel
* added OP_PAD_REFLECT_1D in test-backend-ops.cpp
* add test-pad-reflect-1d test case
* test case support multiple backend
* [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend
Move to compile time selection to backend to avoid latency at run time.
Add it to all mkl gemm calls and only for NVIDIA backend.
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
* Formatting
* Address PR comments to increase readibility
---------
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
* kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit
* same problem in vec32
---------
Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>
* subgroup 64 version with subgroup add. 15% faster
scalable version
tested for subgroup sizes 16-128
* check for subgroup multiple of 16 and greater than 16
* subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45)
* force 16 sequential threads per block
* make 16 subgroup size a constant
This is an incremental improvement over #9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.
With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
* CANN: Fix the bug build fail on Ascend310P under two cases:
1) Manual specify SOC_TYPE
2) Under some unusual compile environment
* Update the cann backend News content: Support F16 and F32 data type model for Ascend 310P NPU.
* fix CANN compile fail bug: the assert in ascend kernel function doesn't supportted on some CANN version
There have been reports of failure to compile on systems with <= 32KB
of shared memory (e.g. #10037). This change makes the large tile size
fall back to a smaller size if necessary, and makes mul_mat_id fall
back to CPU if there's only 16KB of shared memory.