This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for FP32 datatype.
This change results in a consistent 90%
improvement in input processing time, and 20%
to 80% improvement in output processing time,
across various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.
Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
* metal : opt-in compile flag for BF16
ggml-ci
* ci : use BF16
ggml-ci
* swift : switch back to v12
* metal : has_float -> use_float
ggml-ci
* metal : fix BF16 check in MSL
ggml-ci
* ggml : add ggml_flash_attn_ext_get_prec
* metal : use F16 precision in FA kernels
ggml-ci
* metal : minor clean-up
* metal : compile-guard bf16 FA kernels
ggml-ci
* build : remove obsolete compile flag [no ci]
* metal : prevent int overflows [no ci]
* cuda : disable BF16 FA
ggml-ci
* metal : fix BF16 requirement for FA kernels
ggml-ci
* make : clean-up [no ci]
* rwkv6: rename to wkv6
* rwkv6: support avx2 avx512 armv8 armv9
* rwkv6: update cuda file name
* rwkv6: rename params
* wkv on sycl
* sycl: add some ops
* sycl: Enhance OP support judgment
* wkv6: drop armv9 and tranfer to GGML style
ggml-ci
* sync : ggml
* update the function to use appropriate types
* fix define error
* Update ggml/src/ggml-cpu.c
* add appropriate asserts
* move element-wise functions outside
* put the declaration outside the loop
* rewrite to be more inline with the common pattern for distributing threads
* use recommended way GGML_TENSOR_LOCALS
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Plamen Minev <pacominev@gmail.com>
Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <airdldl@163.com>
* ggml : add initial BF16 support
ggml-ci
* metal : add mul_mat_id BF16 support
ggml-ci
* metal : check for bfloat support on the Metal device
ggml-ci
* metal : better var names [no ci]
* metal : do not build bfloat kernels when not supported
ggml-ci
* metal : try to fix BF16 support check
ggml-ci
* metal : this should correctly check bfloat support
* metal : add quantized FA (vec) support
ggml-ci
* metal : add quantized FA (non-vec) support
* metal : fix support check
ggml-ci
* metal : clean-up
* metal : clean-up (cont)
* metal : fix shared memory calc + reduce smem + comments
* metal : float-correctness
* metal : minor [no ci]
* q6_k instruction reordering attempt
* better subtract method
* should be theoretically faster
small improvement with shuffle lut, likely because all loads are already done at that stage
* optimize bit fiddling
* handle -32 offset separately. bsums exists for a reason!
* use shift
* Update ggml-quants.c
* have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86
* llama : fix buffer checks for mamba and rwk
* llama : fix missing worst case flag during reserve
* cuda : fix supports_op for norm
* disable sched SET_CAUSE
* ggml : fix gguf string leak when reading kv pairs fails
* ggml : avoid crashing with GGML_ABORT when the KV has an invalid type
* ggml : avoid crashing on failed memory allocations when loading a gguf file
* ggml: Add POOL2D OP for GPU ACC to the Vulkan.
- The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend.
- A GGML_OP_POOL_2D shader has been added. (Pooling)
- The encoding performance of the CLIP model improved from 2.8s on the CPU to 0.7s on the GPU.
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
* [fix] Correct the incorrect order of the parameters.
fix casting to int.
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
---------
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
* metal : support permuted matrix multiplicaions
ggml-ci
* cont : use nb01 directly for row steps
ggml-ci
* cont : add comments [no ci]
* metal : minor refactor
* metal : minor
This commit removes the setting of the `used` field of the contexts in
the global state (g_state) in `ggml_init`.
The motivation for this change is that I believe that this additional
initialization might not be required after the changes in Commit
45fc4fed0b9fb5b1af4a8525cbebb95e11208732 ("sync : latest changes from
whisper.cpp"), which changed the initialization of the contexts field
from `{ 0 }` to `{ { 0 } }`:
```console
g_state = (struct ggml_state) {
- /*.contexts =*/ { 0 },
+ /*.contexts =*/ { { 0 } },
};
```
My understanding is that the `{0}` initialization might not have
zero-initialized all the nested fields in every array element because of
compiler differences, and might have been the reason for having the
explicit setting of the `used` fields to false.
* [CANN] Adapt to dynamically loadable backends mechanism
* Fix the Bug: inference running result is garbled in debug running model for LM models who's type is Q4_0 class
* Handle the review comments of this pull request
add intel amx isa detection
add vnni kernel for gemv cases
add vnni and amx kernel support for block_q8_0
code cleanup
fix packing B issue
enable openmp
fine tune amx kernel
switch to aten parallel pattern
add error message for nested parallelism
code cleanup
add f16 support in ggml-amx
add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS
update CMakeList
update README
fix some compilation warning
fix compiler warning when amx is not enabled
minor change
ggml-ci
move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp
ggml-ci
update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16
ggml-ci
add amx as an ggml-backend
update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h
minor change
update CMakeLists.txt
minor change
apply weight prepacking in set_tensor method in ggml-backend
fix compile error
ggml-ci
minor change
ggml-ci
update CMakeLists.txt
ggml-ci
add march dependency
minor change
ggml-ci
change ggml_backend_buffer_is_host to return false for amx backend
ggml-ci
fix supports_op
use device reg for AMX backend
ggml-ci
minor change
ggml-ci
minor change
fix rebase
set .buffer_from_host_ptr to be false for AMX backend
* fix: use `vm_allocate` to allocate CPU backend buffer on macOS
* fix: switch to `posix_memalign` to keep existing `free()` usages work
* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS
* style: formatting
* fix: move const outside of `#ifndef`
* style: formatting
* fix: unused var
* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`
* fix: unused var
* fix: page align to `GGUF_DEFAULT_ALIGNMENT`
* fix: page align to `TENSOR_ALIGNMENT`
* fix: convert `TENSOR_ALIGNMENT` to a macro
* fix: increase page size to `32` on iOS
* fix: iOS page size
* fix: `hbw_posix_memalign` alignment
This commit removes the buffer_id field from the leaf_alloc struct.
The motivation for is that this field is only written to and never
read/used as far as I can tell. Each tensor_alloc has a buffer_id field
and this is what caused me to look into this more closely, to
understand what the buffer_id in leaf_alloc was used for.
* Vectorize load instructions in dmmv f16 CUDA kernel
Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.
* addressed comment
* Update ggml/src/ggml-cuda/dmmv.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* ggml : do not use BLAS with types without to_float
* ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies
* ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits
it's not really internal if everybody uses it
* docs : clarify building Android on Termux
* docs : update building Android on Termux
* docs : add cross-compiling for Android
* cmake : link dl explicitly for Android
* ggml : add metal backend registry / device
ggml-ci
* metal : fix names [no ci]
* metal : global registry and device instances
ggml-ci
* cont : alternative initialization of global objects
ggml-ci
* llama : adapt to backend changes
ggml-ci
* fixes
* metal : fix indent
* metal : fix build when MTLGPUFamilyApple3 is not available
ggml-ci
* fix merge
* metal : avoid unnecessary singleton accesses
ggml-ci
* metal : minor fix [no ci]
* metal : g_state -> g_ggml_ctx_dev_main [no ci]
* metal : avoid reference of device context in the backend context
ggml-ci
* metal : minor [no ci]
* metal : fix maxTransferRate check
* metal : remove transfer rate stuff
---------
Co-authored-by: slaren <slarengh@gmail.com>
* Single allocation of encode_async block with non-ARC capture in ggml-metal.m
* Moving Block_release to the deallocation code
* Release encode block when re-setting encoding buffer count if needed
* Update ggml/src/ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add scaffolding for ggml logging macros
* Metal backend now uses GGML logging
* Cuda backend now uses GGML logging
* Cann backend now uses GGML logging
* Add enum tag to parameters
* Use C memory allocation funcs
* Fix compile error
* Use GGML_LOG instead of GGML_PRINT
* Rename llama_state to llama_logger_state
* Prevent null format string
* Fix whitespace
* Remove log callbacks from ggml backends
* Remove cuda log statement
* vulkan : do not use tensor->extra
This patch allows using the Vulkan backend with the RPC backend as
tensor->extra is no longer used.
Ref: #8536
* Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (#2)
---------
Co-authored-by: 0cc4m <picard12@live.de>
When the device's warp size is less than 16,
it is possible for loadstride_a (mul_mm.comp:114)
and loadstride_b (mul_mm.comp:115) to be set to 0.
Because they are calculated as: the workgroup size,
multiplied by LOAD_VEC_* (which can be 1) and divided by 16.
And the workgroup size is set to be the same as the
warp/subgroup size.
The loadstride_* variables are used as increments in the
loops that populate the buffers used for the multiplication.
When they are 0 they cause an infinite loop.
But infinite loops without side-effects are UB and the
values of loadstride_* are known at compile time.
So, the compiler quietly optimizes all the loops away.
As a consequence, the buffers are not populated and
the multiplication result is just a matrix with all elements
set to 0.
We prevent the UB by making sure that the workgroup size
will never be less than 16, even if our device has a
smaller warp size (e.g. 8).
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
a return before a barrier (that happens only in some threads in
a workgroup) leads to UB.
While the old code actually works on some devices,
it fails on some others (i.e. "smaller" GPUs).
BTW, I think it would be better to set specialization constants
when the graph is built, in that way the local workgroup
could be sized appropriately.
But it would take a lot of work.
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
* ggml: Added run-time detection of neon, i8mm and sve
Adds run-time detection of the Arm instructions set features
neon, i8mm and sve for Linux and Apple build targets.
* ggml: Extend feature detection to include non aarch64 Arm arch
* ggml: Move definition of ggml_arm_arch_features to the global data section
* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels
* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.
* fix for build errors
* remove prints from the low-level code
* Rebase to the latest upstream