llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-10 12:30:50 +01:00

History

Vectorize load instructions in dmmv f16 CUDA kernel (#9816 )

* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

2024-10-14 02:49:08 +02:00

cmake

llama : reorganize source code + improve CMake (#8006 )

2024-06-26 18:33:02 +03:00

include

rpc : add backend registry / device interfaces (#9812 )

2024-10-10 20:14:55 +02:00

src

Vectorize load instructions in dmmv f16 CUDA kernel (#9816 )

2024-10-14 02:49:08 +02:00

.gitignore

vulkan : cmake integration (#8119 )

2024-07-13 18:12:39 +02:00

CMakeLists.txt

cmake : do not hide GGML options + rename option (#9465 )

2024-09-16 10:27:50 +03:00