mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2024-11-22 08:17:58 +01:00
Updated Tensor Encoding Schemes (markdown)
parent
fcd653145c
commit
c5ee522ea7
@ -100,4 +100,34 @@ typedef struct {
|
|||||||
};
|
};
|
||||||
} block_q2_K;
|
} block_q2_K;
|
||||||
static_assert(sizeof(block_q2_K) == 2*sizeof(ggml_half) + QK_K/16 + QK_K/4, "wrong q2_K block size/padding");
|
static_assert(sizeof(block_q2_K) == 2*sizeof(ggml_half) + QK_K/16 + QK_K/4, "wrong q2_K block size/padding");
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## How are these tensors packed?
|
||||||
|
|
||||||
|
This is as explained by [compilade](https://github.com/compilade) in [this thread](https://github.com/ggerganov/llama.cpp/pull/8151#issuecomment-2256706172).
|
||||||
|
|
||||||
|
Regarding how to find the bit pattern structure of a packed tensor block in the gguf file... there isn't a consistent encoding scheme for each block as sometimes a single field in the structs stores multiple types of values, like in `Q4_K` where `block_q4_K.scales` stores 6-bit scales and mins in some pattern. The easiest way to understand what the bits mean is to have a look at the respective `dequantize_row` function of each type.
|
||||||
|
|
||||||
|
### block_q4_K.scales packing example
|
||||||
|
|
||||||
|
The 12 bytes in Q4_K `.scales` are packed a bit like this, where the uppercased letters are bits for the scales and lowercased letters are the bits of the mins as seen below, which corresponds to this function [as shown here](https://github.com/ggerganov/llama.cpp/blob/75af08c475e285888f66556d0f459c533b7deb95/ggml/src/ggml-quants.c#L1891-L1898):
|
||||||
|
|
||||||
|
```
|
||||||
|
0: EEAAAAAA
|
||||||
|
1: FFBBBBBB
|
||||||
|
2: GGCCCCCC
|
||||||
|
3: HHDDDDDD
|
||||||
|
4: eeaaaaaa
|
||||||
|
5: ffbbbbbb
|
||||||
|
6: ggcccccc
|
||||||
|
7: hhdddddd
|
||||||
|
8: eeeeEEEE
|
||||||
|
9: ffffFFFF
|
||||||
|
10: ggggGGGG
|
||||||
|
11: hhhhHHHH
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that this is packing a 6bit scale and mins but split across multiple bytes. This use of byte offsets and bitwise operations is likely done to be more friendlier to parallel processing.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user