Updated Tensor Encoding Schemes (markdown)

2024-11-22 08:17:58 +01:00 · 2024-05-25 11:45:12 +10:00 · 2024-05-25 11:45:12 +10:00 · cbababee1f
commit cbababee1f
parent c02e3689b0
1 changed files with 12 additions and 7 deletions
--- a/Tensor-Encoding-Schemes.md
+++ b/Tensor-Encoding-Schemes.md
@ -20,7 +20,12 @@ This is not definitive, but is helpful when reading sourcecode or console output
 	  - `IQ<X>`: i-quant based models. X bits per weight, where `X` could be `4` (for 4 bits) or `8` (for 8 bits) etc...
  - `<Variants>`: This represents different strategies of packing quantized weights into a gguf file. This is because we may want a mix of different bit sizes for weights of varying importance, or we may be encoding a general offset to a block or super-block. This may be omitted if trivial or initial attempt, refer to encoding scheme name table for details.
-## Tensor Scheme Mapping
+## Tensor Encoding Scheme PR
 * [PR Filter of all Tensor Encoding Scheme Related Pull Requests](https://github.com/ggerganov/llama.cpp/issues?q=label%3A%22Tensor+Encoding+Scheme%22)
    - This is useful for locating Pull Requests where we merged in a new encoding scheme (or fix a bug with encoding etc...)
 ## Tensor Encoding Scheme Mapping
 | Scheme   | `ggml_ftype` C enumeration name | `ggml_type` C enum name | Bits/Weight | Data Type                     | Block Configuration                                                    | Quantized Weight Formula                        | Initial Commits Or Pull Request Sources (of `ggml_type`)                 |
 | -------- | ------------------------------- | ----------------------- | ----------- | ----------------------------- | ---------------------------------------------------------------------- | ----------------------------------------------- | ------------------------------------------------------------------------ |
@ -39,12 +44,12 @@ This is not definitive, but is helpful when reading sourcecode or console output
 | Q8_1     | -                               | GGML_TYPE_Q8_1          | 8           | round to nearest quantization | Each block has 32 weights                                              | w = q * block_scale + block_minimum             | [llama.cpp PR: Add Q8_0 quantization for intermediate results #951 (Note: Renamed to Q8_1 in later commit)](https://github.com/ggerganov/llama.cpp/pull/951) |
 | Q5_0     | GGML_FTYPE_MOSTLY_Q5_0          | GGML_TYPE_Q5_0          | 5           | round to nearest quantization | Each block has 32 weights                                              | w = q * block_scale                             | [llama.cpp PR: Add Q5_0 and Q5_1 quantization #1187](https://github.com/ggerganov/llama.cpp/pull/1187) |
 | Q5_1     | GGML_FTYPE_MOSTLY_Q5_1          | GGML_TYPE_Q5_1          | 5           | round to nearest quantization | Each block has 32 weights                                              | w = q * block_scale + block_minimum             | [llama.cpp PR: Add Q5_0 and Q5_1 quantization #1187](https://github.com/ggerganov/llama.cpp/pull/1187) |
-| KQ2      | GGML_FTYPE_MOSTLY_Q2_K          | GGML_TYPE_Q2_K          | 2.5625      | k-quantization                | Superblocks has 16 blocks ( 16 weights per block)                      | w = q * block_scale (4-bit) + block_min (4-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
+| Q2_K     | GGML_FTYPE_MOSTLY_Q2_K          | GGML_TYPE_Q2_K          | 2.5625      | k-quantization                | Superblocks has 16 blocks ( 16 weights per block)                      | w = q * block_scale (4-bit) + block_min (4-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
-| KQ3      | GGML_FTYPE_MOSTLY_Q3_K          | GGML_TYPE_Q3_K          | 3.4375      | k-quantization                | Superblocks has 16 blocks ( 16 weights per block)                      | w = q * block_scale (6-bit)                     | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
+| Q3_K     | GGML_FTYPE_MOSTLY_Q3_K          | GGML_TYPE_Q3_K          | 3.4375      | k-quantization                | Superblocks has 16 blocks ( 16 weights per block)                      | w = q * block_scale (6-bit)                     | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
-| KQ4      | GGML_FTYPE_MOSTLY_Q4_K          | GGML_TYPE_Q4_K          | 4.5         | k-quantization                | Superblocks has  8 blocks ( 32 weights per block)                      | w = q * block_scale (6-bit) + block_min (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
+| Q4_K     | GGML_FTYPE_MOSTLY_Q4_K          | GGML_TYPE_Q4_K          | 4.5         | k-quantization                | Superblocks has  8 blocks ( 32 weights per block)                      | w = q * block_scale (6-bit) + block_min (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
-| KQ5      | GGML_FTYPE_MOSTLY_Q5_K          | GGML_TYPE_Q5_K          | 5.5         | k-quantization                | Superblocks has  8 blocks ( 32 weights per block)                      | w = q * block_scale (6-bit) + block_min (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
+| Q5_K     | GGML_FTYPE_MOSTLY_Q5_K          | GGML_TYPE_Q5_K          | 5.5         | k-quantization                | Superblocks has  8 blocks ( 32 weights per block)                      | w = q * block_scale (6-bit) + block_min (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
-| KQ6      | GGML_FTYPE_MOSTLY_Q6_K          | GGML_TYPE_Q6_K          | 6.5625      | k-quantization                | Superblocks has 16 blocks ( 16 weights per block)                      | w = q * block_scale (8-bit)                     | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
+| Q6_K     | GGML_FTYPE_MOSTLY_Q6_K          | GGML_TYPE_Q6_K          | 6.5625      | k-quantization                | Superblocks has 16 blocks ( 16 weights per block)                      | w = q * block_scale (8-bit)                     | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
-| KQ8      | -                               | GGML_TYPE_Q8_K          | 8.0         | k-quantization                | Superblocks has  1 blocks (256 weights per block) (Only used for intermediate quants) | w = q * block_scale (8-bit)      | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
+| Q8_K     | -                               | GGML_TYPE_Q8_K          | 8.0         | k-quantization                | Superblocks has  1 blocks (256 weights per block) (Only used for intermediate quants) | w = q * block_scale (8-bit)      | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
 | IQ1_S    | GGML_FTYPE_MOSTLY_IQ1_S         | GGML_TYPE_IQ1_S         | 1.5         | i-quantization                | Superblocks has  8 blocks ( 32 weights per block)                      | w = func(superblock_scale, importance_matrix)   | [llama.cpp PR: 1.5 bit quantization #5453](https://github.com/ggerganov/llama.cpp/pull/5453) |
 | IQ1_M    | GGML_FTYPE_MOSTLY_IQ1_M         | GGML_TYPE_IQ1_M         | 1.75        | i-quantization                | Superblocks has 16 blocks ( 16 weights per block)                      | w = func(superblock_scale, importance_matrix)   | [llama.cpp PR: IQ1_M: 1.75 bpw quantization #6302](https://github.com/ggerganov/llama.cpp/pull/6302) |
 | IQ2_XXS  | GGML_FTYPE_MOSTLY_IQ2_XXS       | GGML_TYPE_IQ2_XXS       | 2.0625      | i-quantization                | Superblocks has  8 blocks ( 32 weights per block)                      | w = func(superblock_scale, importance_matrix)   | [llama.cpp PR: SOTA 2-bit quants #4773](https://github.com/ggerganov/llama.cpp/pull/4773) |