From c5ee522ea7c2528bb94a0d8f6a8728abffcf0c03 Mon Sep 17 00:00:00 2001
From: Brian <mofosyne@gmail.com>
Date: Tue, 30 Jul 2024 19:36:43 +1000
Subject: [PATCH] Updated Tensor Encoding Schemes (markdown)

---
 Tensor-Encoding-Schemes.md | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/Tensor-Encoding-Schemes.md b/Tensor-Encoding-Schemes.md
index a3ca17a..ab8822b 100644
--- a/Tensor-Encoding-Schemes.md
+++ b/Tensor-Encoding-Schemes.md
@@ -100,4 +100,34 @@ typedef struct {
     };
 } block_q2_K;
 static_assert(sizeof(block_q2_K) == 2*sizeof(ggml_half) + QK_K/16 + QK_K/4, "wrong q2_K block size/padding");
-```
\ No newline at end of file
+```
+
+## How are these tensors packed?
+
+This is as explained by [compilade](https://github.com/compilade) in [this thread](https://github.com/ggerganov/llama.cpp/pull/8151#issuecomment-2256706172).
+
+Regarding how to find the bit pattern structure of a packed tensor block in the gguf file... there isn't a consistent encoding scheme for each block as sometimes a single field in the structs stores multiple types of values, like in `Q4_K` where `block_q4_K.scales` stores 6-bit scales and mins in some pattern. The easiest way to understand what the bits mean is to have a look at the respective `dequantize_row` function of each type.
+
+### block_q4_K.scales packing example
+
+The 12 bytes in Q4_K `.scales` are packed a bit like this, where the uppercased letters are bits for the scales and lowercased letters are the bits of the mins as seen below, which corresponds to this function [as shown here](https://github.com/ggerganov/llama.cpp/blob/75af08c475e285888f66556d0f459c533b7deb95/ggml/src/ggml-quants.c#L1891-L1898):
+
+```
+0: EEAAAAAA
+1: FFBBBBBB
+2: GGCCCCCC
+3: HHDDDDDD
+4: eeaaaaaa
+5: ffbbbbbb
+6: ggcccccc
+7: hhdddddd
+8: eeeeEEEE
+9: ffffFFFF
+10: ggggGGGG
+11: hhhhHHHH
+```
+
+Note that this is packing a 6bit scale and mins but split across multiple bytes. This use of byte offsets and bitwise operations is likely done to be more friendlier to parallel processing.
+
+
+