llama.cpp/ggml
Eve 18429220bd
AVX BF16 and single scale quant optimizations (#10212)
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
2024-11-15 12:47:58 +01:00
..
include backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921) 2024-11-15 01:28:50 +01:00
src AVX BF16 and single scale quant optimizations (#10212) 2024-11-15 12:47:58 +01:00
.gitignore vulkan : cmake integration (#8119) 2024-07-13 18:12:39 +02:00
CMakeLists.txt backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921) 2024-11-15 01:28:50 +01:00