From cbababee1f5ba8cd64eef187bffa07881d7e0cd8 Mon Sep 17 00:00:00 2001 From: Brian Date: Sat, 25 May 2024 11:45:12 +1000 Subject: [PATCH] Updated Tensor Encoding Schemes (markdown) --- Tensor-Encoding-Schemes.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/Tensor-Encoding-Schemes.md b/Tensor-Encoding-Schemes.md index 52395e6..e353927 100644 --- a/Tensor-Encoding-Schemes.md +++ b/Tensor-Encoding-Schemes.md @@ -20,7 +20,12 @@ This is not definitive, but is helpful when reading sourcecode or console output - `IQ`: i-quant based models. X bits per weight, where `X` could be `4` (for 4 bits) or `8` (for 8 bits) etc... - ``: This represents different strategies of packing quantized weights into a gguf file. This is because we may want a mix of different bit sizes for weights of varying importance, or we may be encoding a general offset to a block or super-block. This may be omitted if trivial or initial attempt, refer to encoding scheme name table for details. -## Tensor Scheme Mapping +## Tensor Encoding Scheme PR + +* [PR Filter of all Tensor Encoding Scheme Related Pull Requests](https://github.com/ggerganov/llama.cpp/issues?q=label%3A%22Tensor+Encoding+Scheme%22) + - This is useful for locating Pull Requests where we merged in a new encoding scheme (or fix a bug with encoding etc...) + +## Tensor Encoding Scheme Mapping | Scheme | `ggml_ftype` C enumeration name | `ggml_type` C enum name | Bits/Weight | Data Type | Block Configuration | Quantized Weight Formula | Initial Commits Or Pull Request Sources (of `ggml_type`) | | -------- | ------------------------------- | ----------------------- | ----------- | ----------------------------- | ---------------------------------------------------------------------- | ----------------------------------------------- | ------------------------------------------------------------------------ | @@ -39,12 +44,12 @@ This is not definitive, but is helpful when reading sourcecode or console output | Q8_1 | - | GGML_TYPE_Q8_1 | 8 | round to nearest quantization | Each block has 32 weights | w = q * block_scale + block_minimum | [llama.cpp PR: Add Q8_0 quantization for intermediate results #951 (Note: Renamed to Q8_1 in later commit)](https://github.com/ggerganov/llama.cpp/pull/951) | | Q5_0 | GGML_FTYPE_MOSTLY_Q5_0 | GGML_TYPE_Q5_0 | 5 | round to nearest quantization | Each block has 32 weights | w = q * block_scale | [llama.cpp PR: Add Q5_0 and Q5_1 quantization #1187](https://github.com/ggerganov/llama.cpp/pull/1187) | | Q5_1 | GGML_FTYPE_MOSTLY_Q5_1 | GGML_TYPE_Q5_1 | 5 | round to nearest quantization | Each block has 32 weights | w = q * block_scale + block_minimum | [llama.cpp PR: Add Q5_0 and Q5_1 quantization #1187](https://github.com/ggerganov/llama.cpp/pull/1187) | -| KQ2 | GGML_FTYPE_MOSTLY_Q2_K | GGML_TYPE_Q2_K | 2.5625 | k-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = q * block_scale (4-bit) + block_min (4-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) | -| KQ3 | GGML_FTYPE_MOSTLY_Q3_K | GGML_TYPE_Q3_K | 3.4375 | k-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = q * block_scale (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) | -| KQ4 | GGML_FTYPE_MOSTLY_Q4_K | GGML_TYPE_Q4_K | 4.5 | k-quantization | Superblocks has 8 blocks ( 32 weights per block) | w = q * block_scale (6-bit) + block_min (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) | -| KQ5 | GGML_FTYPE_MOSTLY_Q5_K | GGML_TYPE_Q5_K | 5.5 | k-quantization | Superblocks has 8 blocks ( 32 weights per block) | w = q * block_scale (6-bit) + block_min (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) | -| KQ6 | GGML_FTYPE_MOSTLY_Q6_K | GGML_TYPE_Q6_K | 6.5625 | k-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = q * block_scale (8-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) | -| KQ8 | - | GGML_TYPE_Q8_K | 8.0 | k-quantization | Superblocks has 1 blocks (256 weights per block) (Only used for intermediate quants) | w = q * block_scale (8-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) | +| Q2_K | GGML_FTYPE_MOSTLY_Q2_K | GGML_TYPE_Q2_K | 2.5625 | k-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = q * block_scale (4-bit) + block_min (4-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) | +| Q3_K | GGML_FTYPE_MOSTLY_Q3_K | GGML_TYPE_Q3_K | 3.4375 | k-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = q * block_scale (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) | +| Q4_K | GGML_FTYPE_MOSTLY_Q4_K | GGML_TYPE_Q4_K | 4.5 | k-quantization | Superblocks has 8 blocks ( 32 weights per block) | w = q * block_scale (6-bit) + block_min (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) | +| Q5_K | GGML_FTYPE_MOSTLY_Q5_K | GGML_TYPE_Q5_K | 5.5 | k-quantization | Superblocks has 8 blocks ( 32 weights per block) | w = q * block_scale (6-bit) + block_min (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) | +| Q6_K | GGML_FTYPE_MOSTLY_Q6_K | GGML_TYPE_Q6_K | 6.5625 | k-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = q * block_scale (8-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) | +| Q8_K | - | GGML_TYPE_Q8_K | 8.0 | k-quantization | Superblocks has 1 blocks (256 weights per block) (Only used for intermediate quants) | w = q * block_scale (8-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) | | IQ1_S | GGML_FTYPE_MOSTLY_IQ1_S | GGML_TYPE_IQ1_S | 1.5 | i-quantization | Superblocks has 8 blocks ( 32 weights per block) | w = func(superblock_scale, importance_matrix) | [llama.cpp PR: 1.5 bit quantization #5453](https://github.com/ggerganov/llama.cpp/pull/5453) | | IQ1_M | GGML_FTYPE_MOSTLY_IQ1_M | GGML_TYPE_IQ1_M | 1.75 | i-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = func(superblock_scale, importance_matrix) | [llama.cpp PR: IQ1_M: 1.75 bpw quantization #6302](https://github.com/ggerganov/llama.cpp/pull/6302) | | IQ2_XXS | GGML_FTYPE_MOSTLY_IQ2_XXS | GGML_TYPE_IQ2_XXS | 2.0625 | i-quantization | Superblocks has 8 blocks ( 32 weights per block) | w = func(superblock_scale, importance_matrix) | [llama.cpp PR: SOTA 2-bit quants #4773](https://github.com/ggerganov/llama.cpp/pull/4773) |