mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-06 02:48:57 +01:00
f4d7e54974
* iq3_xxs: quantize/dequantize RMSE seems a bit high-ish at about half-way between q2_K and q3_K, so need to check more. * iq3_xxs: CUDA dequantize works * iq2_xxs: tuning quantization * iq3_xxs: starting to look better PPL on wiki.test.raw LLaMA-v1-7B: 6.4218 LLaMA-v2-7B: 6.3560 Mistral-7B : 6.0717 This is better than Q3_K_XS, with a 5% reduction in quantized model size. * iq3_xxs: CUDA dot product We have PP-512: 5891 t/s TG-128: 143.9 t/s * iq3_xxs: scalar and AVX2 dot products * iq3_xxs: ARM_NEON and Metal Metal performance is decent, ARM_NEON is pathetic * iq3_xxs: slightly better grid points * Faster iq3_xxs and iq2_xs dot products on CUDA * iq3_xxs: add some quant mix * iq3_xxs: fix failing quantization test Dot product still fails. Is this real? * iq3_xxs: hopefully fix ROCm * iq3_xxs: failing tests This time the dot product accuracy did find an actual bug in the AVX2 implementation. * Add IQ3_XXS to test-backend-ops --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> |
||
---|---|---|
.. | ||
CMakeLists.txt | ||
quantize.cpp | ||
README.md |
quantize
TODO
Llama 2 7B
Quantization | Bits per Weight (BPW) |
---|---|
Q2_K | 3.35 |
Q3_K_S | 3.50 |
Q3_K_M | 3.91 |
Q3_K_L | 4.27 |
Q4_K_S | 4.58 |
Q4_K_M | 4.84 |
Q5_K_S | 5.52 |
Q5_K_M | 5.68 |
Q6_K | 6.56 |
Llama 2 13B
Quantization | Bits per Weight (BPW) |
---|---|
Q2_K | 3.34 |
Q3_K_S | 3.48 |
Q3_K_M | 3.89 |
Q3_K_L | 4.26 |
Q4_K_S | 4.56 |
Q4_K_M | 4.83 |
Q5_K_S | 5.51 |
Q5_K_M | 5.67 |
Q6_K | 6.56 |
Llama 2 70B
Quantization | Bits per Weight (BPW) |
---|---|
Q2_K | 3.40 |
Q3_K_S | 3.47 |
Q3_K_M | 3.85 |
Q3_K_L | 4.19 |
Q4_K_S | 4.53 |
Q4_K_M | 4.80 |
Q5_K_S | 5.50 |
Q5_K_M | 5.65 |
Q6_K | 6.56 |