llama.cpp/tests at b2303 - llama.cpp - Gitea: Git with a cup of tea

Mirrors/llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-10 04:20:24 +01:00

History

Kawrakow 0becb22ac0

IQ4_XS: a 4.25 bpw quantization (#5747 )

* Try IQ4_NL with blocks of 64 - does not look good

* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

* iq4_xs: CUDA works - 133.2 t/s

* iq4_xs: AVX2 dot product

* iq4_xs: ARM_NEON dot product

* iq4_nl: Metal implementation

As usual, Metal / Apple Silicon don't like my quants.

* iq3_xs: minor fix

* iq4_xs: shrink by using IQ3_S for attn_k and attn_q

* iq4_xs: revert using IQ3_S for attn_k and attn_v

PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.

* Fix CI

* iq4_xs: Added forgotten check for 256 divisibility

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2024-02-27 16:34:24 +02:00

..

.gitignore

tests : .gitignore obj files

2024-02-08 09:46:47 +02:00

CMakeLists.txt

llama : add llama_chat_apply_template() (#5538 )

2024-02-19 10:23:37 +02:00

get-model.cpp

ci : add model tests + script wrapper (#4586 )

2024-01-26 14:18:00 +02:00

get-model.h

ci : add model tests + script wrapper (#4586 )

2024-01-26 14:18:00 +02:00

test-autorelease.cpp

ggml : add numa options (#5377 )

2024-02-16 11:31:07 +02:00

test-backend-ops.cpp

IQ4_XS: a 4.25 bpw quantization (#5747 )

2024-02-27 16:34:24 +02:00

test-c.c

Nomic Vulkan backend (#4456 )

2024-01-29 15:50:50 -05:00

test-chat-template.cpp

Add Gemma chat template (#5665 )

2024-02-22 19:10:21 +01:00

test-double-float.cpp

ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861 )

2023-10-30 19:19:15 +02:00

test-grad0.cpp

cuda : improve cuda pool efficiency using virtual memory (#4606 )

2023-12-24 14:34:22 +01:00

test-grammar-parser.cpp

ggml, common, examples, tests : fixed type arguments in printf (#5528 )

2024-02-18 18:20:12 +02:00

test-llama-grammar.cpp

ggml, common, examples, tests : fixed type arguments in printf (#5528 )

2024-02-18 18:20:12 +02:00

test-model-load-cancel.cpp

ggml : add numa options (#5377 )

2024-02-16 11:31:07 +02:00

test-opt.cpp

code : normalize enum names (#5697 )

2024-02-25 12:09:09 +02:00

test-quantize-fns.cpp

Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721 )

2024-02-26 18:28:38 +02:00

test-quantize-perf.cpp

ggml : add mmla kernels for quantized GEMM (#4966 )

2024-02-11 15:22:33 +02:00

test-rope.cpp

llama : custom attention mask + parallel decoding + no context swaps (#3228 )

2023-09-28 19:04:36 +03:00

test-sampling.cpp

sampling: fix top_k <= 0 (#5388 )

2024-02-08 09:46:30 +01:00

test-tokenizer-0-falcon.cpp

ggml : add numa options (#5377 )

2024-02-16 11:31:07 +02:00

test-tokenizer-0-falcon.py

ci : add flake8 to github actions (python linting) (#4129 )

2023-11-20 11:35:47 +01:00

test-tokenizer-0-llama.cpp

ggml : add numa options (#5377 )

2024-02-16 11:31:07 +02:00

test-tokenizer-0-llama.py

ci : add flake8 to github actions (python linting) (#4129 )

2023-11-20 11:35:47 +01:00

test-tokenizer-1-bpe.cpp

ggml : add numa options (#5377 )

2024-02-16 11:31:07 +02:00

test-tokenizer-1-llama.cpp

ggml : add numa options (#5377 )

2024-02-16 11:31:07 +02:00