Updated Tensor Encoding Schemes (markdown)

Brian 2024-05-25 11:45:12 +10:00
parent c02e3689b0
commit cbababee1f

@ -20,7 +20,12 @@ This is not definitive, but is helpful when reading sourcecode or console output
- `IQ<X>`: i-quant based models. X bits per weight, where `X` could be `4` (for 4 bits) or `8` (for 8 bits) etc...
- `<Variants>`: This represents different strategies of packing quantized weights into a gguf file. This is because we may want a mix of different bit sizes for weights of varying importance, or we may be encoding a general offset to a block or super-block. This may be omitted if trivial or initial attempt, refer to encoding scheme name table for details.
## Tensor Scheme Mapping
## Tensor Encoding Scheme PR
* [PR Filter of all Tensor Encoding Scheme Related Pull Requests](https://github.com/ggerganov/llama.cpp/issues?q=label%3A%22Tensor+Encoding+Scheme%22)
- This is useful for locating Pull Requests where we merged in a new encoding scheme (or fix a bug with encoding etc...)
## Tensor Encoding Scheme Mapping
| Scheme | `ggml_ftype` C enumeration name | `ggml_type` C enum name | Bits/Weight | Data Type | Block Configuration | Quantized Weight Formula | Initial Commits Or Pull Request Sources (of `ggml_type`) |
| -------- | ------------------------------- | ----------------------- | ----------- | ----------------------------- | ---------------------------------------------------------------------- | ----------------------------------------------- | ------------------------------------------------------------------------ |
@ -39,12 +44,12 @@ This is not definitive, but is helpful when reading sourcecode or console output
| Q8_1 | - | GGML_TYPE_Q8_1 | 8 | round to nearest quantization | Each block has 32 weights | w = q * block_scale + block_minimum | [llama.cpp PR: Add Q8_0 quantization for intermediate results #951 (Note: Renamed to Q8_1 in later commit)](https://github.com/ggerganov/llama.cpp/pull/951) |
| Q5_0 | GGML_FTYPE_MOSTLY_Q5_0 | GGML_TYPE_Q5_0 | 5 | round to nearest quantization | Each block has 32 weights | w = q * block_scale | [llama.cpp PR: Add Q5_0 and Q5_1 quantization #1187](https://github.com/ggerganov/llama.cpp/pull/1187) |
| Q5_1 | GGML_FTYPE_MOSTLY_Q5_1 | GGML_TYPE_Q5_1 | 5 | round to nearest quantization | Each block has 32 weights | w = q * block_scale + block_minimum | [llama.cpp PR: Add Q5_0 and Q5_1 quantization #1187](https://github.com/ggerganov/llama.cpp/pull/1187) |
| KQ2 | GGML_FTYPE_MOSTLY_Q2_K | GGML_TYPE_Q2_K | 2.5625 | k-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = q * block_scale (4-bit) + block_min (4-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
| KQ3 | GGML_FTYPE_MOSTLY_Q3_K | GGML_TYPE_Q3_K | 3.4375 | k-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = q * block_scale (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
| KQ4 | GGML_FTYPE_MOSTLY_Q4_K | GGML_TYPE_Q4_K | 4.5 | k-quantization | Superblocks has 8 blocks ( 32 weights per block) | w = q * block_scale (6-bit) + block_min (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
| KQ5 | GGML_FTYPE_MOSTLY_Q5_K | GGML_TYPE_Q5_K | 5.5 | k-quantization | Superblocks has 8 blocks ( 32 weights per block) | w = q * block_scale (6-bit) + block_min (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
| KQ6 | GGML_FTYPE_MOSTLY_Q6_K | GGML_TYPE_Q6_K | 6.5625 | k-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = q * block_scale (8-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
| KQ8 | - | GGML_TYPE_Q8_K | 8.0 | k-quantization | Superblocks has 1 blocks (256 weights per block) (Only used for intermediate quants) | w = q * block_scale (8-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
| Q2_K | GGML_FTYPE_MOSTLY_Q2_K | GGML_TYPE_Q2_K | 2.5625 | k-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = q * block_scale (4-bit) + block_min (4-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
| Q3_K | GGML_FTYPE_MOSTLY_Q3_K | GGML_TYPE_Q3_K | 3.4375 | k-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = q * block_scale (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
| Q4_K | GGML_FTYPE_MOSTLY_Q4_K | GGML_TYPE_Q4_K | 4.5 | k-quantization | Superblocks has 8 blocks ( 32 weights per block) | w = q * block_scale (6-bit) + block_min (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
| Q5_K | GGML_FTYPE_MOSTLY_Q5_K | GGML_TYPE_Q5_K | 5.5 | k-quantization | Superblocks has 8 blocks ( 32 weights per block) | w = q * block_scale (6-bit) + block_min (6-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
| Q6_K | GGML_FTYPE_MOSTLY_Q6_K | GGML_TYPE_Q6_K | 6.5625 | k-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = q * block_scale (8-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
| Q8_K | - | GGML_TYPE_Q8_K | 8.0 | k-quantization | Superblocks has 1 blocks (256 weights per block) (Only used for intermediate quants) | w = q * block_scale (8-bit) | [llama.cpp PR: k-quants #1684](https://github.com/ggerganov/llama.cpp/pull/1684) |
| IQ1_S | GGML_FTYPE_MOSTLY_IQ1_S | GGML_TYPE_IQ1_S | 1.5 | i-quantization | Superblocks has 8 blocks ( 32 weights per block) | w = func(superblock_scale, importance_matrix) | [llama.cpp PR: 1.5 bit quantization #5453](https://github.com/ggerganov/llama.cpp/pull/5453) |
| IQ1_M | GGML_FTYPE_MOSTLY_IQ1_M | GGML_TYPE_IQ1_M | 1.75 | i-quantization | Superblocks has 16 blocks ( 16 weights per block) | w = func(superblock_scale, importance_matrix) | [llama.cpp PR: IQ1_M: 1.75 bpw quantization #6302](https://github.com/ggerganov/llama.cpp/pull/6302) |
| IQ2_XXS | GGML_FTYPE_MOSTLY_IQ2_XXS | GGML_TYPE_IQ2_XXS | 2.0625 | i-quantization | Superblocks has 8 blocks ( 32 weights per block) | w = func(superblock_scale, importance_matrix) | [llama.cpp PR: SOTA 2-bit quants #4773](https://github.com/ggerganov/llama.cpp/pull/4773) |