Merge remote-tracking branch 'origin/master' into bins

This commit is contained in:
Olivier Chafik 2024-06-10 15:38:41 +01:00
commit daeaeb1222
30 changed files with 1233 additions and 607 deletions

View File

@ -0,0 +1,5 @@
- Self Reported Review Complexity:
- [ ] Review Complexity : Low
- [ ] Review Complexity : Medium
- [ ] Review Complexity : High
- [ ] I have read the [contributing guidelines](CONTRIBUTING.md)

View File

@ -16,11 +16,9 @@ on:
branches: branches:
- master - master
paths: ['.github/workflows/server.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*'] paths: ['.github/workflows/server.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*']
pull_request_target: pull_request:
types: [opened, synchronize, reopened] types: [opened, synchronize, reopened]
paths: ['.github/workflows/server.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*'] paths: ['.github/workflows/server.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*']
schedule:
- cron: '2 4 * * *'
concurrency: concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }} group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}
@ -115,7 +113,7 @@ jobs:
server-windows: server-windows:
runs-on: windows-latest runs-on: windows-2019
steps: steps:
- name: Clone - name: Clone

14
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,14 @@
# Contributing Guidelines
## Checklist
* Make sure your PR follows the [coding guidelines](https://github.com/ggerganov/llama.cpp/blob/master/README.md#coding-guidelines)
* Test your changes using the commands in the [`tests`](tests) folder. For instance, running the `./tests/test-backend-ops` command tests different backend implementations of the GGML library
* Execute [the full CI locally on your machine](ci/README.md) before publishing
## PR formatting
* Please rate the complexity of your PR (i.e. `Review Complexity : Low`, `Review Complexity : Medium`, `Review Complexity : High`). This makes it easier for maintainers to triage the PRs.
- The PR template has a series of review complexity checkboxes `[ ]` that you can mark as `[X]` for your conveience. Refer to [About task lists](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/about-task-lists) for more information.
* If the pull request only contains documentation changes (e.g., updating READMEs, adding new wiki pages), please add `[no ci]` to the commit title. This will skip unnecessary CI checks and help reduce build times.
* When squashing multiple commits on merge, use the following format for your commit title: `<module> : <commit title> (#<issue_number>)`. For example: `utils : Fix typo in utils.py (#1234)`

View File

@ -53,7 +53,6 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
<li><a href="#quantization">Quantization</a></li> <li><a href="#quantization">Quantization</a></li>
<li><a href="#interactive-mode">Interactive mode</a></li> <li><a href="#interactive-mode">Interactive mode</a></li>
<li><a href="#constrained-output-with-grammars">Constrained output with grammars</a></li> <li><a href="#constrained-output-with-grammars">Constrained output with grammars</a></li>
<li><a href="#instruct-mode">Instruct mode</a></li>
<li><a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a></li> <li><a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a></li>
<li><a href="#seminal-papers-and-background-on-the-models">Seminal papers and background on the models</a></li> <li><a href="#seminal-papers-and-background-on-the-models">Seminal papers and background on the models</a></li>
<li><a href="#perplexity-measuring-model-quality">Perplexity (measuring model quality)</a></li> <li><a href="#perplexity-measuring-model-quality">Perplexity (measuring model quality)</a></li>
@ -769,34 +768,6 @@ The `grammars/` folder contains a handful of sample grammars. To write your own,
For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one. For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
### Instruct mode
1. First, download and place the `ggml` model into the `./models` folder
2. Run the `main` tool like this:
```
./examples/alpaca.sh
```
Sample run:
```
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMA.
- If you want to submit another line, end your input in '\'.
Below is an instruction that describes a task. Write a response that appropriately completes the request.
> How many letters are there in the English alphabet?
There 26 letters in the English Alphabet
> What is the most common way of transportation in Amsterdam?
The majority (54%) are using public transit. This includes buses, trams and metros with over 100 lines throughout the city which make it very accessible for tourists to navigate around town as well as locals who commute by tram or metro on a daily basis
> List 5 words that start with "ca".
cadaver, cauliflower, cabbage (vegetable), catalpa (tree) and Cailleach.
>
```
### Obtaining and using the Facebook LLaMA 2 model ### Obtaining and using the Facebook LLaMA 2 model
- Refer to [Facebook's LLaMA download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) if you want to access the model data. - Refer to [Facebook's LLaMA download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) if you want to access the model data.

View File

@ -200,19 +200,13 @@ void gpt_params_handle_model_default(gpt_params & params) {
} }
params.hf_file = params.model; params.hf_file = params.model;
} else if (params.model.empty()) { } else if (params.model.empty()) {
std::string cache_directory = fs_get_cache_directory(); params.model = fs_get_cache_file(string_split(params.hf_file, '/').back());
const bool success = fs_create_directory_with_parents(cache_directory);
if (!success) {
throw std::runtime_error("failed to create cache directory: " + cache_directory);
}
params.model = cache_directory + string_split(params.hf_file, '/').back();
} }
} else if (!params.model_url.empty()) { } else if (!params.model_url.empty()) {
if (params.model.empty()) { if (params.model.empty()) {
auto f = string_split(params.model_url, '#').front(); auto f = string_split(params.model_url, '#').front();
f = string_split(f, '?').front(); f = string_split(f, '?').front();
f = string_split(f, '/').back(); params.model = fs_get_cache_file(string_split(f, '/').back());
params.model = "models/" + f;
} }
} else if (params.model.empty()) { } else if (params.model.empty()) {
params.model = DEFAULT_MODEL_PATH; params.model = DEFAULT_MODEL_PATH;
@ -2279,6 +2273,16 @@ std::string fs_get_cache_directory() {
return ensure_trailing_slash(cache_directory); return ensure_trailing_slash(cache_directory);
} }
std::string fs_get_cache_file(const std::string & filename) {
GGML_ASSERT(filename.find(DIRECTORY_SEPARATOR) == std::string::npos);
std::string cache_directory = fs_get_cache_directory();
const bool success = fs_create_directory_with_parents(cache_directory);
if (!success) {
throw std::runtime_error("failed to create cache directory: " + cache_directory);
}
return cache_directory + filename;
}
// //
// Model utils // Model utils

View File

@ -277,6 +277,7 @@ bool fs_validate_filename(const std::string & filename);
bool fs_create_directory_with_parents(const std::string & path); bool fs_create_directory_with_parents(const std::string & path);
std::string fs_get_cache_directory(); std::string fs_get_cache_directory();
std::string fs_get_cache_file(const std::string & filename);
// //
// Model utils // Model utils

View File

@ -47,11 +47,12 @@ class Model:
_model_classes: dict[str, type[Model]] = {} _model_classes: dict[str, type[Model]] = {}
dir_model: Path dir_model: Path
ftype: int ftype: gguf.LlamaFileType
is_big_endian: bool is_big_endian: bool
endianess: gguf.GGUFEndian endianess: gguf.GGUFEndian
use_temp_file: bool use_temp_file: bool
lazy: bool lazy: bool
model_name: str | None
part_names: list[str] part_names: list[str]
is_safetensors: bool is_safetensors: bool
hparams: dict[str, Any] hparams: dict[str, Any]
@ -64,7 +65,7 @@ class Model:
# subclasses should define this! # subclasses should define this!
model_arch: gguf.MODEL_ARCH model_arch: gguf.MODEL_ARCH
def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, is_big_endian: bool, use_temp_file: bool, eager: bool): def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, is_big_endian: bool, use_temp_file: bool, eager: bool, model_name: str | None):
if type(self) is Model: if type(self) is Model:
raise TypeError(f"{type(self).__name__!r} should not be directly instantiated") raise TypeError(f"{type(self).__name__!r} should not be directly instantiated")
self.dir_model = dir_model self.dir_model = dir_model
@ -73,10 +74,11 @@ class Model:
self.endianess = gguf.GGUFEndian.BIG if is_big_endian else gguf.GGUFEndian.LITTLE self.endianess = gguf.GGUFEndian.BIG if is_big_endian else gguf.GGUFEndian.LITTLE
self.use_temp_file = use_temp_file self.use_temp_file = use_temp_file
self.lazy = not eager self.lazy = not eager
self.part_names = Model.get_model_part_names(self.dir_model, ".safetensors") self.model_name = model_name
self.part_names = Model.get_model_part_names(self.dir_model, "model", ".safetensors")
self.is_safetensors = len(self.part_names) > 0 self.is_safetensors = len(self.part_names) > 0
if not self.is_safetensors: if not self.is_safetensors:
self.part_names = Model.get_model_part_names(self.dir_model, ".bin") self.part_names = Model.get_model_part_names(self.dir_model, "pytorch_model", ".bin")
self.hparams = Model.load_hparams(self.dir_model) self.hparams = Model.load_hparams(self.dir_model)
self.block_count = self.find_hparam(["n_layers", "num_hidden_layers", "n_layer"]) self.block_count = self.find_hparam(["n_layers", "num_hidden_layers", "n_layer"])
self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count) self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)
@ -94,7 +96,7 @@ class Model:
ftype_lw: str = ftype_up.lower() ftype_lw: str = ftype_up.lower()
# allow templating the file name with the output ftype, useful with the "auto" ftype # allow templating the file name with the output ftype, useful with the "auto" ftype
self.fname_out = fname_out.parent / fname_out.name.format(ftype_lw, outtype=ftype_lw, ftype=ftype_lw, OUTTYPE=ftype_up, FTYPE=ftype_up) self.fname_out = fname_out.parent / fname_out.name.format(ftype_lw, outtype=ftype_lw, ftype=ftype_lw, OUTTYPE=ftype_up, FTYPE=ftype_up)
self.gguf_writer = gguf.GGUFWriter(self.fname_out, gguf.MODEL_ARCH_NAMES[self.model_arch], endianess=self.endianess, use_temp_file=self.use_temp_file) self.gguf_writer = gguf.GGUFWriter(path=None, arch=gguf.MODEL_ARCH_NAMES[self.model_arch], endianess=self.endianess, use_temp_file=self.use_temp_file)
@classmethod @classmethod
def __init_subclass__(cls): def __init_subclass__(cls):
@ -182,7 +184,7 @@ class Model:
return new_name return new_name
def set_gguf_parameters(self): def set_gguf_parameters(self):
self.gguf_writer.add_name(self.dir_model.name) self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_block_count(self.block_count) self.gguf_writer.add_block_count(self.block_count)
if (n_ctx := self.find_hparam(["max_position_embeddings", "n_ctx"], optional=True)) is not None: if (n_ctx := self.find_hparam(["max_position_embeddings", "n_ctx"], optional=True)) is not None:
@ -324,21 +326,21 @@ class Model:
def write(self): def write(self):
self.write_tensors() self.write_tensors()
self.gguf_writer.write_header_to_file() self.gguf_writer.write_header_to_file(self.fname_out)
self.gguf_writer.write_kv_data_to_file() self.gguf_writer.write_kv_data_to_file()
self.gguf_writer.write_tensors_to_file(progress=True) self.gguf_writer.write_tensors_to_file(progress=True)
self.gguf_writer.close() self.gguf_writer.close()
def write_vocab(self): def write_vocab(self):
self.gguf_writer.write_header_to_file() self.gguf_writer.write_header_to_file(self.fname_out)
self.gguf_writer.write_kv_data_to_file() self.gguf_writer.write_kv_data_to_file()
self.gguf_writer.close() self.gguf_writer.close()
@staticmethod @staticmethod
def get_model_part_names(dir_model: Path, suffix: str) -> list[str]: def get_model_part_names(dir_model: Path, prefix: str, suffix: str) -> list[str]:
part_names: list[str] = [] part_names: list[str] = []
for filename in os.listdir(dir_model): for filename in os.listdir(dir_model):
if filename.endswith(suffix): if filename.startswith(prefix) and filename.endswith(suffix):
part_names.append(filename) part_names.append(filename)
part_names.sort() part_names.sort()
@ -665,7 +667,7 @@ class GPTNeoXModel(Model):
def set_gguf_parameters(self): def set_gguf_parameters(self):
block_count = self.hparams["num_hidden_layers"] block_count = self.hparams["num_hidden_layers"]
self.gguf_writer.add_name(self.dir_model.name) self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_context_length(self.hparams["max_position_embeddings"]) self.gguf_writer.add_context_length(self.hparams["max_position_embeddings"])
self.gguf_writer.add_embedding_length(self.hparams["hidden_size"]) self.gguf_writer.add_embedding_length(self.hparams["hidden_size"])
self.gguf_writer.add_block_count(block_count) self.gguf_writer.add_block_count(block_count)
@ -798,7 +800,7 @@ class MPTModel(Model):
def set_gguf_parameters(self): def set_gguf_parameters(self):
block_count = self.hparams["n_layers"] block_count = self.hparams["n_layers"]
self.gguf_writer.add_name(self.dir_model.name) self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_context_length(self.hparams["max_seq_len"]) self.gguf_writer.add_context_length(self.hparams["max_seq_len"])
self.gguf_writer.add_embedding_length(self.hparams["d_model"]) self.gguf_writer.add_embedding_length(self.hparams["d_model"])
self.gguf_writer.add_block_count(block_count) self.gguf_writer.add_block_count(block_count)
@ -850,7 +852,7 @@ class OrionModel(Model):
raise ValueError("gguf: can not find ctx length parameter.") raise ValueError("gguf: can not find ctx length parameter.")
self.gguf_writer.add_file_type(self.ftype) self.gguf_writer.add_file_type(self.ftype)
self.gguf_writer.add_name(self.dir_model.name) self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_source_hf_repo(hf_repo) self.gguf_writer.add_source_hf_repo(hf_repo)
self.gguf_writer.add_tensor_data_layout("Meta AI original pth") self.gguf_writer.add_tensor_data_layout("Meta AI original pth")
self.gguf_writer.add_context_length(ctx_length) self.gguf_writer.add_context_length(ctx_length)
@ -887,7 +889,7 @@ class BaichuanModel(Model):
else: else:
raise ValueError("gguf: can not find ctx length parameter.") raise ValueError("gguf: can not find ctx length parameter.")
self.gguf_writer.add_name(self.dir_model.name) self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_source_hf_repo(hf_repo) self.gguf_writer.add_source_hf_repo(hf_repo)
self.gguf_writer.add_tensor_data_layout("Meta AI original pth") self.gguf_writer.add_tensor_data_layout("Meta AI original pth")
self.gguf_writer.add_context_length(ctx_length) self.gguf_writer.add_context_length(ctx_length)
@ -1010,7 +1012,7 @@ class XverseModel(Model):
else: else:
raise ValueError("gguf: can not find ctx length parameter.") raise ValueError("gguf: can not find ctx length parameter.")
self.gguf_writer.add_name(self.dir_model.name) self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_source_hf_repo(hf_repo) self.gguf_writer.add_source_hf_repo(hf_repo)
self.gguf_writer.add_tensor_data_layout("Meta AI original pth") self.gguf_writer.add_tensor_data_layout("Meta AI original pth")
self.gguf_writer.add_context_length(ctx_length) self.gguf_writer.add_context_length(ctx_length)
@ -1206,7 +1208,7 @@ class StableLMModel(Model):
hparams = self.hparams hparams = self.hparams
block_count = hparams["num_hidden_layers"] block_count = hparams["num_hidden_layers"]
self.gguf_writer.add_name(self.dir_model.name) self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_context_length(hparams["max_position_embeddings"]) self.gguf_writer.add_context_length(hparams["max_position_embeddings"])
self.gguf_writer.add_embedding_length(hparams["hidden_size"]) self.gguf_writer.add_embedding_length(hparams["hidden_size"])
self.gguf_writer.add_block_count(block_count) self.gguf_writer.add_block_count(block_count)
@ -1681,7 +1683,7 @@ class GPT2Model(Model):
model_arch = gguf.MODEL_ARCH.GPT2 model_arch = gguf.MODEL_ARCH.GPT2
def set_gguf_parameters(self): def set_gguf_parameters(self):
self.gguf_writer.add_name(self.dir_model.name) self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_block_count(self.hparams["n_layer"]) self.gguf_writer.add_block_count(self.hparams["n_layer"])
self.gguf_writer.add_context_length(self.hparams["n_ctx"]) self.gguf_writer.add_context_length(self.hparams["n_ctx"])
self.gguf_writer.add_embedding_length(self.hparams["n_embd"]) self.gguf_writer.add_embedding_length(self.hparams["n_embd"])
@ -2248,7 +2250,7 @@ class GemmaModel(Model):
hparams = self.hparams hparams = self.hparams
block_count = hparams["num_hidden_layers"] block_count = hparams["num_hidden_layers"]
self.gguf_writer.add_name(self.dir_model.name) self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_context_length(hparams["max_position_embeddings"]) self.gguf_writer.add_context_length(hparams["max_position_embeddings"])
self.gguf_writer.add_embedding_length(hparams["hidden_size"]) self.gguf_writer.add_embedding_length(hparams["hidden_size"])
self.gguf_writer.add_block_count(block_count) self.gguf_writer.add_block_count(block_count)
@ -2348,7 +2350,7 @@ class MambaModel(Model):
# Fail early for models which don't have a block expansion factor of 2 # Fail early for models which don't have a block expansion factor of 2
assert d_inner == 2 * d_model assert d_inner == 2 * d_model
self.gguf_writer.add_name(self.dir_model.name) self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_context_length(2**20) # arbitrary value; for those who use the default self.gguf_writer.add_context_length(2**20) # arbitrary value; for those who use the default
self.gguf_writer.add_embedding_length(d_model) self.gguf_writer.add_embedding_length(d_model)
self.gguf_writer.add_feed_forward_length(0) # unused, but seemingly required when loading self.gguf_writer.add_feed_forward_length(0) # unused, but seemingly required when loading
@ -2852,7 +2854,7 @@ def main() -> None:
logger.error(f"Model {hparams['architectures'][0]} is not supported") logger.error(f"Model {hparams['architectures'][0]} is not supported")
sys.exit(1) sys.exit(1)
model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian, args.use_temp_file, args.no_lazy) model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian, args.use_temp_file, args.no_lazy, args.model_name)
logger.info("Set model parameters") logger.info("Set model parameters")
model_instance.set_gguf_parameters() model_instance.set_gguf_parameters()

View File

@ -1,19 +0,0 @@
#!/bin/bash
#
# Temporary script - will be removed in the future
#
cd `dirname $0`
cd ..
./llama-cli -m ./models/alpaca.13b.ggmlv3.q8_0.bin \
--color \
-f ./prompts/alpaca.txt \
--ctx_size 2048 \
-n -1 \
-ins -b 256 \
--top_k 10000 \
--temp 0.2 \
--repeat_penalty 1.1 \
-t 7

View File

@ -1,15 +0,0 @@
#!/bin/bash
#
# Temporary script - will be removed in the future
#
cd `dirname $0`
cd ..
./llama-cli --color --instruct --threads 4 \
--model ./models/gpt4all-7B/gpt4all-lora-quantized.bin \
--file ./prompts/alpaca.txt \
--batch_size 8 --ctx_size 2048 -n -1 \
--repeat_last_n 64 --repeat_penalty 1.3 \
--n_predict 128 --temp 0.1 --top_k 40 --top_p 0.95

View File

@ -218,20 +218,64 @@ void IMatrixCollector::save_imatrix(int ncall) const {
fname += std::to_string(ncall); fname += std::to_string(ncall);
} }
// avoid writing imatrix entries that do not have full data
// this can happen with MoE models where some of the experts end up not being exercised by the provided training data
int n_entries = 0;
std::vector<std::string> to_store;
bool is_first = true; // for printing
for (const auto & kv : m_stats) {
const int n_all = kv.second.counts.size();
if (n_all == 0) {
continue;
}
int n_zeros = 0;
for (const int c : kv.second.counts) {
if (c == 0) {
n_zeros++;
}
}
if (n_zeros != 0 && is_first) {
fprintf(stderr, "\n");
is_first = false;
}
if (n_zeros == n_all) {
fprintf(stderr, "%s: entry '%40s' has no data - skipping\n", __func__, kv.first.c_str());
continue;
}
if (n_zeros > 0) {
fprintf(stderr, "%s: entry '%40s' has partial data (%.2f%%) - skipping\n", __func__, kv.first.c_str(), 100.0f * (n_all - n_zeros) / n_all);
continue;
}
n_entries++;
to_store.push_back(kv.first);
}
if (to_store.size() < m_stats.size()) {
fprintf(stderr, "%s: warning: storing only %zu out of %zu entries\n", __func__, to_store.size(), m_stats.size());
}
std::ofstream out(fname, std::ios::binary); std::ofstream out(fname, std::ios::binary);
int n_entries = m_stats.size();
out.write((const char *) &n_entries, sizeof(n_entries)); out.write((const char *) &n_entries, sizeof(n_entries));
for (const auto & p : m_stats) { for (const auto & name : to_store) {
int len = p.first.size(); const auto & stat = m_stats.at(name);
int len = name.size();
out.write((const char *) &len, sizeof(len)); out.write((const char *) &len, sizeof(len));
out.write(p.first.c_str(), len); out.write(name.c_str(), len);
out.write((const char *) &p.second.ncall, sizeof(p.second.ncall)); out.write((const char *) &stat.ncall, sizeof(stat.ncall));
int nval = p.second.values.size(); int nval = stat.values.size();
out.write((const char *) &nval, sizeof(nval)); out.write((const char *) &nval, sizeof(nval));
if (nval > 0) { if (nval > 0) {
std::vector<float> tmp(nval); std::vector<float> tmp(nval);
for (int i = 0; i < nval; i++) { for (int i = 0; i < nval; i++) {
tmp[i] = (p.second.values[i] / static_cast<float>(p.second.counts[i])) * static_cast<float>(p.second.ncall); tmp[i] = (stat.values[i] / static_cast<float>(stat.counts[i])) * static_cast<float>(stat.ncall);
} }
out.write((const char*)tmp.data(), nval*sizeof(float)); out.write((const char*)tmp.data(), nval*sizeof(float));
} }

View File

@ -1,18 +0,0 @@
#!/bin/bash
#
# Temporary script - will be removed in the future
#
cd `dirname $0`
cd ..
./llama-cli -m models/available/Llama2/13B/llama-2-13b.ggmlv3.q4_0.bin \
--color \
--ctx_size 2048 \
-n -1 \
-ins -b 256 \
--top_k 10000 \
--temp 0.2 \
--repeat_penalty 1.1 \
-t 8

View File

@ -1,18 +0,0 @@
#!/bin/bash
#
# Temporary script - will be removed in the future
#
cd `dirname $0`
cd ..
./llama-cli -m models/available/Llama2/7B/llama-2-7b.ggmlv3.q4_0.bin \
--color \
--ctx_size 2048 \
-n -1 \
-ins -b 256 \
--top_k 10000 \
--temp 0.2 \
--repeat_penalty 1.1 \
-t 8

View File

@ -6,10 +6,6 @@
#include "ggml-metal.h" #include "ggml-metal.h"
#endif #endif
#ifdef GGML_USE_SYCL
#include "ggml-sycl.h"
#endif
#include "ggml-rpc.h" #include "ggml-rpc.h"
#ifdef _WIN32 #ifdef _WIN32
# include <windows.h> # include <windows.h>
@ -83,12 +79,6 @@ static ggml_backend_t create_backend() {
if (!backend) { if (!backend) {
fprintf(stderr, "%s: ggml_backend_metal_init() failed\n", __func__); fprintf(stderr, "%s: ggml_backend_metal_init() failed\n", __func__);
} }
#elif GGML_USE_SYCL
fprintf(stderr, "%s: using SYCL backend\n", __func__);
backend = ggml_backend_sycl_init(0); // init device 0
if (!backend) {
fprintf(stderr, "%s: ggml_backend_sycl_init() failed\n", __func__);
}
#endif #endif
// if there aren't GPU Backends fallback to CPU backend // if there aren't GPU Backends fallback to CPU backend

View File

@ -416,7 +416,7 @@
message = html`<${Probabilities} data=${data} />` message = html`<${Probabilities} data=${data} />`
} else { } else {
const text = isArrayMessage ? const text = isArrayMessage ?
data.map(msg => msg.content).join('').replace(/^\s+/, '') : data.map(msg => msg.content).join('') :
data; data;
message = isCompletionMode ? message = isCompletionMode ?
text : text :

View File

@ -147,7 +147,7 @@ struct server_slot {
int32_t n_prompt_tokens = 0; int32_t n_prompt_tokens = 0;
int32_t n_prompt_tokens_processed = 0; int32_t n_prompt_tokens_processed = 0;
json prompt; std::string prompt;
// when a task is submitted, we first tokenize the prompt and store it here // when a task is submitted, we first tokenize the prompt and store it here
std::vector<llama_token> prompt_tokens; std::vector<llama_token> prompt_tokens;
@ -822,13 +822,8 @@ struct server_context {
continue; continue;
} }
// skip the slot if it does not contains prompt
if (!slot.prompt.is_string()) {
continue;
}
// current slot's prompt // current slot's prompt
std::string slot_prompt = slot.prompt.get<std::string>(); std::string slot_prompt = slot.prompt;
// length of the current slot's prompt // length of the current slot's prompt
int slot_prompt_len = slot_prompt.size(); int slot_prompt_len = slot_prompt.size();
@ -958,13 +953,16 @@ struct server_context {
if (!task.infill) { if (!task.infill) {
const auto & prompt = data.find("prompt"); const auto & prompt = data.find("prompt");
if (prompt == data.end()) { if (prompt == data.end()) {
send_error(task, "Either \"prompt\" or \"messages\" must be provided", ERROR_TYPE_INVALID_REQUEST); send_error(task, "\"prompt\" must be provided", ERROR_TYPE_INVALID_REQUEST);
return false; return false;
} else {
slot.prompt = *prompt;
} }
if (slot.prompt.is_array() && slot.prompt.size() == 0) {
send_error(task, "\"prompt\" cannot be an empty array", ERROR_TYPE_INVALID_REQUEST); if (prompt->is_string()) {
slot.prompt = prompt->get<std::string>();
} else if (prompt->is_array() && prompt->size() == 1 && prompt->at(0).is_string()) {
slot.prompt = prompt->at(0).get<std::string>();
} else {
send_error(task, "\"prompt\" must be a string or an array of strings", ERROR_TYPE_INVALID_REQUEST);
return false; return false;
} }
} }
@ -1582,14 +1580,18 @@ struct server_context {
switch (task.type) { switch (task.type) {
case SERVER_TASK_TYPE_COMPLETION: case SERVER_TASK_TYPE_COMPLETION:
{ {
int id_slot = json_value(task.data, "id_slot", -1); const int id_slot = json_value(task.data, "id_slot", -1);
std::string prompt = json_value(task.data, "prompt", std::string());
server_slot * slot; server_slot * slot;
if (id_slot != -1) { if (id_slot != -1) {
slot = get_slot_by_id(id_slot); slot = get_slot_by_id(id_slot);
} else { } else {
std::string prompt;
if (task.data.contains("prompt") && task.data.at("prompt").is_string()) {
json_value(task.data, "prompt", std::string());
}
slot = get_available_slot(prompt); slot = get_available_slot(prompt);
} }

6
flake.lock generated
View File

@ -20,11 +20,11 @@
}, },
"nixpkgs": { "nixpkgs": {
"locked": { "locked": {
"lastModified": 1716948383, "lastModified": 1717786204,
"narHash": "sha256-SzDKxseEcHR5KzPXLwsemyTR/kaM9whxeiJohbL04rs=", "narHash": "sha256-4q0s6m0GUcN7q+Y2DqD27iLvbcd1G50T2lv08kKxkSI=",
"owner": "NixOS", "owner": "NixOS",
"repo": "nixpkgs", "repo": "nixpkgs",
"rev": "ad57eef4ef0659193044870c731987a6df5cf56b", "rev": "051f920625ab5aabe37c920346e3e69d7d34400e",
"type": "github" "type": "github"
}, },
"original": { "original": {

View File

@ -1347,10 +1347,30 @@ static void ggml_cuda_set_peer_access(const int n_tokens, int main_device) {
GGML_UNUSED(main_device); GGML_UNUSED(main_device);
} }
static cudaError_t ggml_cuda_Memcpy2DPeerAsync(
void * dst, int dstDevice, size_t dpitch, void * src, int srcDevice, size_t spitch, size_t width, size_t height, cudaStream_t stream) {
#if !defined(GGML_USE_HIPBLAS)
// cudaMemcpy2DAsync may fail with copies between vmm pools of different devices
cudaMemcpy3DPeerParms p = {};
p.dstDevice = dstDevice;
p.dstPtr = make_cudaPitchedPtr(dst, dpitch, dpitch, height);
p.srcDevice = srcDevice;
p.srcPtr = make_cudaPitchedPtr(src, spitch, spitch, height);
p.extent = make_cudaExtent(width, height, 1);
return cudaMemcpy3DPeerAsync(&p, stream);
#else
// HIP does not support cudaMemcpy3DPeerAsync or vmm pools
GGML_UNUSED(dstDevice);
GGML_UNUSED(srcDevice);
return cudaMemcpy2DAsync(dst, dpitch, src, spitch, width, height, cudaMemcpyDeviceToDevice, stream);
#endif // !defined(GGML_USE_HIPBLAS)
}
static void ggml_cuda_op_mul_mat( static void ggml_cuda_op_mul_mat(
ggml_backend_cuda_context & ctx, ggml_backend_cuda_context & ctx,
const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, ggml_cuda_op_mul_mat_t op, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, ggml_cuda_op_mul_mat_t op,
const bool convert_src1_to_q8_1) { quantize_cuda_t quantize_src1) {
const int64_t ne00 = src0->ne[0]; const int64_t ne00 = src0->ne[0];
const int64_t ne01 = src0->ne[1]; const int64_t ne01 = src0->ne[1];
@ -1407,7 +1427,9 @@ static void ggml_cuda_op_mul_mat(
} }
struct dev_data { struct dev_data {
ggml_cuda_pool_alloc<char> src0_dd_alloc; int cc;
ggml_cuda_pool_alloc<char> src0_dd_alloc;
ggml_cuda_pool_alloc<float> src1_ddf_alloc; ggml_cuda_pool_alloc<float> src1_ddf_alloc;
ggml_cuda_pool_alloc<char> src1_ddq_alloc; ggml_cuda_pool_alloc<char> src1_ddq_alloc;
ggml_cuda_pool_alloc<float> dst_dd_alloc; ggml_cuda_pool_alloc<float> dst_dd_alloc;
@ -1426,6 +1448,8 @@ static void ggml_cuda_op_mul_mat(
int used_devices = 0; int used_devices = 0;
for (int id = 0; id < ggml_backend_cuda_get_device_count(); ++id) { for (int id = 0; id < ggml_backend_cuda_get_device_count(); ++id) {
dev[id].cc = ggml_cuda_info().devices[id].cc;
// by default, use all rows // by default, use all rows
dev[id].row_low = 0; dev[id].row_low = 0;
dev[id].row_high = ne01; dev[id].row_high = ne01;
@ -1476,11 +1500,15 @@ static void ggml_cuda_op_mul_mat(
dev[id].src1_ddf = dev[id].src1_ddf_alloc.alloc(ctx.pool(id), ggml_nelements(src1)); dev[id].src1_ddf = dev[id].src1_ddf_alloc.alloc(ctx.pool(id), ggml_nelements(src1));
} }
if (convert_src1_to_q8_1) { if (quantize_src1) {
dev[id].src1_ddq = dev[id].src1_ddq_alloc.alloc(ctx.pool(id), nrows1*src1_padded_col_size*q8_1_ts/q8_1_bs); size_t src_1_ddq_size = nrows1*src1_padded_col_size*q8_1_ts/q8_1_bs;
if (quantize_src1 == quantize_mmq_q8_1_cuda) {
src_1_ddq_size += get_mmq_x_max_host(dev[id].cc)*sizeof(block_q8_1_mmq);
}
dev[id].src1_ddq = dev[id].src1_ddq_alloc.alloc(ctx.pool(id), src_1_ddq_size);
if (src1_on_device && src1_is_contiguous) { if (src1_on_device && src1_is_contiguous) {
quantize_row_q8_1_cuda(dev[id].src1_ddf, dev[id].src1_ddq, ne10, nrows1, src1_padded_col_size, stream); quantize_src1(dev[id].src1_ddf, dev[id].src1_ddq, ne10, ne11, ne12*ne13, src1_padded_col_size, src0->type, stream);
CUDA_CHECK(cudaGetLastError()); CUDA_CHECK(cudaGetLastError());
} }
} }
@ -1526,7 +1554,12 @@ static void ggml_cuda_op_mul_mat(
const int64_t i03 = i0 / ne12; const int64_t i03 = i0 / ne12;
const int64_t i02 = i0 % ne12; const int64_t i02 = i0 % ne12;
const size_t src1_ddq_i_offset = (i0*ne11 + src1_col_0) * src1_padded_col_size*q8_1_ts/q8_1_bs; size_t src1_ddq_i_offset = i0*ne11 * src1_padded_col_size*q8_1_ts/q8_1_bs;
if (quantize_src1 == quantize_mmq_q8_1_cuda) {
src1_ddq_i_offset += src1_col_0 * sizeof(block_q8_1_mmq);
} else {
src1_ddq_i_offset += src1_col_0 * src1_padded_col_size*q8_1_ts/q8_1_bs;
}
// for split tensors the data begins at i0 == i0_offset_low // for split tensors the data begins at i0 == i0_offset_low
char * src0_dd_i = dev[id].src0_dd + (i0/i02_divisor) * (ne01*ne00*src0_ts)/src0_bs; char * src0_dd_i = dev[id].src0_dd + (i0/i02_divisor) * (ne01*ne00*src0_ts)/src0_bs;
@ -1543,10 +1576,17 @@ static void ggml_cuda_op_mul_mat(
// copy src0, src1 to device if necessary // copy src0, src1 to device if necessary
if (src1_is_contiguous) { if (src1_is_contiguous) {
if (id != ctx.device) { if (id != ctx.device) {
if (convert_src1_to_q8_1) { if (quantize_src1) {
char * src1_ddq_i_source = dev[ctx.device].src1_ddq + src1_ddq_i_offset; char * src1_ddq_i_source = dev[ctx.device].src1_ddq + src1_ddq_i_offset;
CUDA_CHECK(cudaMemcpyPeerAsync(src1_ddq_i, id, src1_ddq_i_source, ctx.device, if (quantize_src1 == quantize_mmq_q8_1_cuda) {
src1_ncols*src1_padded_col_size*q8_1_ts/q8_1_bs, stream)); const size_t pitch = ne11*sizeof(block_q8_1_mmq);
const size_t width = src1_ncols*sizeof(block_q8_1_mmq);
const size_t height = src1_padded_col_size/(4*QK8_1);
CUDA_CHECK(ggml_cuda_Memcpy2DPeerAsync(src1_ddq_i, id, pitch, src1_ddq_i_source, ctx.device, pitch, width, height, stream));
} else {
CUDA_CHECK(cudaMemcpyPeerAsync(
src1_ddq_i, id, src1_ddq_i_source, ctx.device, src1_ncols*src1_padded_col_size*q8_1_ts/q8_1_bs, stream));
}
} else { } else {
float * src1_ddf_i_source = (float *) src1->data; float * src1_ddf_i_source = (float *) src1->data;
src1_ddf_i_source += (i0*ne11 + src1_col_0) * ne10; src1_ddf_i_source += (i0*ne11 + src1_col_0) * ne10;
@ -1561,8 +1601,8 @@ static void ggml_cuda_op_mul_mat(
GGML_ASSERT(false); GGML_ASSERT(false);
} }
if (convert_src1_to_q8_1 && !src1_is_contiguous) { if (quantize_src1 && !src1_is_contiguous) {
quantize_row_q8_1_cuda(src1_ddf_i, src1_ddq_i, ne10, src1_ncols, src1_padded_col_size, stream); quantize_src1(src1_ddf_i, src1_ddq_i, ne10, src1_ncols, 1, src1_padded_col_size, src0->type, stream);
CUDA_CHECK(cudaGetLastError()); CUDA_CHECK(cudaGetLastError());
} }
@ -1587,22 +1627,8 @@ static void ggml_cuda_op_mul_mat(
float * dhf_dst_i = (float *) ((char *) dst_off_device + i02*nb2 + i03*nb3); float * dhf_dst_i = (float *) ((char *) dst_off_device + i02*nb2 + i03*nb3);
GGML_ASSERT(dst->nb[1] == ne0*sizeof(float)); GGML_ASSERT(dst->nb[1] == ne0*sizeof(float));
dhf_dst_i += src1_col_0*ne0 + dev[id].row_low; dhf_dst_i += src1_col_0*ne0 + dev[id].row_low;
#if !defined(GGML_USE_HIPBLAS) CUDA_CHECK(ggml_cuda_Memcpy2DPeerAsync(
// cudaMemcpy2DAsync may fail with copies between vmm pools of different devices dhf_dst_i, ctx.device, ne0*sizeof(float), dst_dd_i, id, row_diff*sizeof(float), row_diff*sizeof(float), src1_ncols, stream));
cudaMemcpy3DPeerParms p = {};
p.dstDevice = ctx.device;
p.dstPtr = make_cudaPitchedPtr(dhf_dst_i, ne0*sizeof(float), row_diff, src1_ncols);
p.srcDevice = id;
p.srcPtr = make_cudaPitchedPtr(dst_dd_i, row_diff*sizeof(float), row_diff, src1_ncols);
p.extent = make_cudaExtent(row_diff*sizeof(float), src1_ncols, 1);
CUDA_CHECK(cudaMemcpy3DPeerAsync(&p, stream));
#else
// HIP does not support cudaMemcpy3DPeerAsync or vmm pools
CUDA_CHECK(cudaMemcpy2DAsync(dhf_dst_i, ne0*sizeof(float),
dst_dd_i, row_diff*sizeof(float),
row_diff*sizeof(float), src1_ncols,
cudaMemcpyDeviceToDevice, stream));
#endif
} else { } else {
float * dhf_dst_i = (float *) ((char *) dst_off_device + i02*nb2 + i03*nb3); float * dhf_dst_i = (float *) ((char *) dst_off_device + i02*nb2 + i03*nb3);
GGML_ASSERT(dst->nb[1] == ne0*sizeof(float)); GGML_ASSERT(dst->nb[1] == ne0*sizeof(float));
@ -1941,13 +1967,13 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
// KQ + KQV multi-batch // KQ + KQV multi-batch
ggml_cuda_mul_mat_batched_cublas(ctx, src0, src1, dst); ggml_cuda_mul_mat_batched_cublas(ctx, src0, src1, dst);
} else if (use_dequantize_mul_mat_vec) { } else if (use_dequantize_mul_mat_vec) {
ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_dequantize_mul_mat_vec, false); ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_dequantize_mul_mat_vec, nullptr);
} else if (use_mul_mat_vec_q) { } else if (use_mul_mat_vec_q) {
ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_vec_q, true); ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_vec_q, quantize_row_q8_1_cuda);
} else if (use_mul_mat_q) { } else if (use_mul_mat_q) {
ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_q, true); ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_q, quantize_mmq_q8_1_cuda);
} else { } else {
ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_cublas, false); ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_cublas, nullptr);
} }
} }

View File

@ -139,6 +139,7 @@
#define CC_PASCAL 600 #define CC_PASCAL 600
#define MIN_CC_DP4A 610 // minimum compute capability for __dp4a, an intrinsic for byte-wise dot products #define MIN_CC_DP4A 610 // minimum compute capability for __dp4a, an intrinsic for byte-wise dot products
#define CC_VOLTA 700 #define CC_VOLTA 700
#define CC_TURING 750
#define CC_AMPERE 800 #define CC_AMPERE 800
#define CC_OFFSET_AMD 1000000 #define CC_OFFSET_AMD 1000000
#define CC_RDNA1 (CC_OFFSET_AMD + 1010) #define CC_RDNA1 (CC_OFFSET_AMD + 1010)
@ -326,9 +327,17 @@ static __device__ __forceinline__ half2 __shfl_xor(half2 var, int laneMask, int
#endif // defined(__HIP_PLATFORM_AMD__) && HIP_VERSION < 50600000 #endif // defined(__HIP_PLATFORM_AMD__) && HIP_VERSION < 50600000
#endif // defined(GGML_USE_HIPBLAS) #endif // defined(GGML_USE_HIPBLAS)
#define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) || __CUDA_ARCH__ >= CC_PASCAL #if (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) || __CUDA_ARCH__ >= CC_PASCAL
#define FP16_AVAILABLE
#endif // (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) || __CUDA_ARCH__ >= CC_PASCAL
#define FP16_MMA_AVAILABLE !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA #if !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA
#define FP16_MMA_AVAILABLE
#endif // !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA
#if !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_TURING
#define INT8_MMA_AVAILABLE
#endif // !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_TURING
static bool fast_fp16_available(const int cc) { static bool fast_fp16_available(const int cc) {
return cc >= CC_PASCAL && cc != 610; return cc >= CC_PASCAL && cc != 610;
@ -338,6 +347,10 @@ static bool fp16_mma_available(const int cc) {
return cc < CC_OFFSET_AMD && cc >= CC_VOLTA; return cc < CC_OFFSET_AMD && cc >= CC_VOLTA;
} }
static bool int8_mma_available(const int cc) {
return cc < CC_OFFSET_AMD && cc >= CC_TURING;
}
[[noreturn]] [[noreturn]]
static __device__ void no_device_code( static __device__ void no_device_code(
const char * file_name, const int line, const char * function_name, const int arch, const char * arch_list) { const char * file_name, const int line, const char * function_name, const int arch, const char * arch_list) {
@ -379,7 +392,7 @@ static __device__ __forceinline__ float2 warp_reduce_sum(float2 a) {
} }
static __device__ __forceinline__ half2 warp_reduce_sum(half2 a) { static __device__ __forceinline__ half2 warp_reduce_sum(half2 a) {
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
#if defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__) #if defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)
#pragma unroll #pragma unroll
@ -412,7 +425,7 @@ static __device__ __forceinline__ float warp_reduce_max(float x) {
} }
static __device__ __forceinline__ half ggml_cuda_hmax(const half a, const half b) { static __device__ __forceinline__ half ggml_cuda_hmax(const half a, const half b) {
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
#if !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && CUDART_VERSION < CUDART_HMAX #if !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && CUDART_VERSION < CUDART_HMAX
return __float2half(fmaxf(__half2float(a), __half2float(b))); return __float2half(fmaxf(__half2float(a), __half2float(b)));

View File

@ -74,7 +74,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q4_0(
const int sumi = __dp4a(v, u, 0); const int sumi = __dp4a(v, u, 0);
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
if (std::is_same<T, half>::value) { if (std::is_same<T, half>::value) {
const half2 * Q_ds = (const half2 *) Q_ds_v; const half2 * Q_ds = (const half2 *) Q_ds_v;
@ -122,7 +122,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q4_1(
const int sumi = __dp4a(v, u, 0); const int sumi = __dp4a(v, u, 0);
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
if (std::is_same<T, half>::value) { if (std::is_same<T, half>::value) {
const half2 * Q_ds = (const half2 *) Q_ds_v; const half2 * Q_ds = (const half2 *) Q_ds_v;
@ -181,7 +181,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q5_0(
const int sumi = __dp4a(v, u, 0); const int sumi = __dp4a(v, u, 0);
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
if (std::is_same<T, half>::value) { if (std::is_same<T, half>::value) {
const half2 * Q_ds = (const half2 *) Q_ds_v; const half2 * Q_ds = (const half2 *) Q_ds_v;
@ -236,7 +236,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q5_1(
const int sumi = __dp4a(v, u, 0); const int sumi = __dp4a(v, u, 0);
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
if (std::is_same<T, half>::value) { if (std::is_same<T, half>::value) {
const half2 * Q_ds = (const half2 *) Q_ds_v; const half2 * Q_ds = (const half2 *) Q_ds_v;
@ -314,7 +314,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_f16(
GGML_UNUSED(Q_q8); GGML_UNUSED(Q_q8);
GGML_UNUSED(Q_ds_v); GGML_UNUSED(Q_ds_v);
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
if (std::is_same<T, half>::value) { if (std::is_same<T, half>::value) {
const half2 * Q_h2 = (const half2 *) Q_v; const half2 * Q_h2 = (const half2 *) Q_v;
@ -407,7 +407,7 @@ static __device__ __forceinline__ T dequantize_1_q4_0(const void * __restrict__
const int q0 = x[ib].qs[iqs]; const int q0 = x[ib].qs[iqs];
const int q = ((q0 >> (4*shift)) & 0x0F) - 8; const int q = ((q0 >> (4*shift)) & 0x0F) - 8;
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
if (std::is_same<T, half>::value) { if (std::is_same<T, half>::value) {
return ((half) d)*((half) q); return ((half) d)*((half) q);
} }
@ -428,7 +428,7 @@ static __device__ __forceinline__ T dequantize_1_q4_1(const void * __restrict__
const int q0 = x[ib].qs[iqs]; const int q0 = x[ib].qs[iqs];
const int q = ((q0 >> (4*shift)) & 0x0F); const int q = ((q0 >> (4*shift)) & 0x0F);
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
if (std::is_same<T, half>::value) { if (std::is_same<T, half>::value) {
return __low2half(dm)*((half) q) + __high2half(dm); return __low2half(dm)*((half) q) + __high2half(dm);
} }
@ -453,7 +453,7 @@ static __device__ __forceinline__ T dequantize_1_q5_0(const void * __restrict__
const int qh = ((qh0 >> idq) << 4) & 0x10; const int qh = ((qh0 >> idq) << 4) & 0x10;
const int q = (ql | qh) - 16; const int q = (ql | qh) - 16;
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
if (std::is_same<T, half>::value) { if (std::is_same<T, half>::value) {
return ((half) d)*((half) q); return ((half) d)*((half) q);
} }
@ -478,7 +478,7 @@ static __device__ __forceinline__ T dequantize_1_q5_1(const void * __restrict__
const int qh = ((qh0 >> idq) << 4) & 0x10; const int qh = ((qh0 >> idq) << 4) & 0x10;
const int q = (ql | qh); const int q = (ql | qh);
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
if (std::is_same<T, half>::value) { if (std::is_same<T, half>::value) {
return __low2half(dm)*((half) q) + __high2half(dm); return __low2half(dm)*((half) q) + __high2half(dm);
} }
@ -497,7 +497,7 @@ static __device__ __forceinline__ T dequantize_1_q8_0(const void * __restrict__
const T d = x[ib].d; const T d = x[ib].d;
const int q = x[ib].qs[iqs]; const int q = x[ib].qs[iqs];
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
if (std::is_same<T, half>::value) { if (std::is_same<T, half>::value) {
return ((half) d)*((half) q); return ((half) d)*((half) q);
} }

View File

@ -43,7 +43,7 @@ static __global__ void flash_attn_tile_ext_f16(
const int ne1, const int ne1,
const int ne2, const int ne2,
const int ne3) { const int ne3) {
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
//In this kernel Q, K, V are matrices while i, j, k are matrix indices. //In this kernel Q, K, V are matrices while i, j, k are matrix indices.
const int ic0 = (blockIdx.x / parallel_blocks) * ncols; // Index of the Q/QKV column to work on. const int ic0 = (blockIdx.x / parallel_blocks) * ncols; // Index of the Q/QKV column to work on.

View File

@ -40,7 +40,7 @@ static __global__ void flash_attn_vec_ext_f16(
const int ne1, const int ne1,
const int ne2, const int ne2,
const int ne3) { const int ne3) {
#if FP16_AVAILABLE #ifdef FP16_AVAILABLE
//In this kernel Q, K, V are matrices while i, j, k are matrix indices. //In this kernel Q, K, V are matrices while i, j, k are matrix indices.
constexpr vec_dot_KQ_f16_t vec_dot_KQ = get_vec_dot_KQ_f16<D>(type_K); constexpr vec_dot_KQ_f16_t vec_dot_KQ = get_vec_dot_KQ_f16<D>(type_K);

View File

@ -1,9 +1,9 @@
#include "common.cuh" #include "common.cuh"
#include "fattn-common.cuh" #include "fattn-common.cuh"
#if FP16_MMA_AVAILABLE #ifdef FP16_MMA_AVAILABLE
#include <mma.h> #include <mma.h>
#endif #endif // FP16_MMA_AVAILABLE
// D == head size, VKQ_stride == num VKQ rows calculated in parallel: // D == head size, VKQ_stride == num VKQ rows calculated in parallel:
template<int D, int ncols, int nwarps, int VKQ_stride, int parallel_blocks, typename KQ_acc_t> template<int D, int ncols, int nwarps, int VKQ_stride, int parallel_blocks, typename KQ_acc_t>
@ -45,7 +45,7 @@ static __global__ void flash_attn_ext_f16(
const int ne1, const int ne1,
const int ne2, const int ne2,
const int ne3) { const int ne3) {
#if FP16_MMA_AVAILABLE #ifdef FP16_MMA_AVAILABLE
//In this kernel Q, K, V are matrices while i, j, k are matrix indices. //In this kernel Q, K, V are matrices while i, j, k are matrix indices.
const int ic0 = ncols*(blockIdx.x / parallel_blocks); // Index of the first Q/QKV column to work on. const int ic0 = ncols*(blockIdx.x / parallel_blocks); // Index of the first Q/QKV column to work on.

95
ggml-cuda/mma.cuh Normal file
View File

@ -0,0 +1,95 @@
#include "common.cuh"
struct mma_int_A_I16K8 {
static constexpr int I = 16;
static constexpr int K = 8;
static constexpr int ne = 4;
int x[ne] = {0};
static __device__ __forceinline__ int get_i(const int l) {
const int ret = (l%2) * (I/2) + threadIdx.x / (K/2);
GGML_CUDA_ASSUME(ret >= 0);
GGML_CUDA_ASSUME(ret < I);
return ret;
}
static __device__ __forceinline__ int get_k(const int l) {
const int ret = (l/2) * (K/2) + threadIdx.x % (K/2);
GGML_CUDA_ASSUME(ret >= 0);
GGML_CUDA_ASSUME(ret < K);
return ret;
}
};
struct mma_int_B_J8K8 {
static constexpr int J = 8;
static constexpr int K = 8;
static constexpr int ne = 2;
int x[ne] = {0};
static __device__ __forceinline__ int get_j(const int /* l */) {
const int ret = threadIdx.x / (K/2);
GGML_CUDA_ASSUME(ret >= 0);
GGML_CUDA_ASSUME(ret < J);
return ret;
}
static __device__ __forceinline__ int get_k(const int l) {
const int ret = l * (K/2) + threadIdx.x % (K/2);
GGML_CUDA_ASSUME(ret >= 0);
GGML_CUDA_ASSUME(ret < K);
return ret;
}
};
struct mma_int_C_I16J8 {
static constexpr int I = 16;
static constexpr int J = 8;
static constexpr int ne = 4;
int x[ne] = {0};
static __device__ __forceinline__ int get_i(const int l) {
const int ret = (l/2) * (I/2) + threadIdx.x / (J/2);
GGML_CUDA_ASSUME(ret >= 0);
GGML_CUDA_ASSUME(ret < I);
return ret;
}
static __device__ __forceinline__ int get_j(const int l) {
const int ret = 2 * (threadIdx.x % (J/2)) + l%2;
GGML_CUDA_ASSUME(ret >= 0);
GGML_CUDA_ASSUME(ret < J);
return ret;
}
__device__ __forceinline__ void mma_K8(const mma_int_A_I16K8 & mma_A, const mma_int_B_J8K8 & mma_B) {
#ifdef INT8_MMA_AVAILABLE
#if __CUDA_ARCH__ >= CC_AMPERE
asm("mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32 {%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%0, %1, %2, %3};"
: "+r"(x[0]), "+r"(x[1]), "+r"(x[2]), "+r"(x[3])
: "r"(mma_A.x[0]), "r"(mma_A.x[1]), "r"(mma_A.x[2]), "r"(mma_A.x[3]), "r"(mma_B.x[0]), "r"(mma_B.x[1]));
#else
// On Turing m16n8k32 mma is not available, use 4x m8n8k16 mma instead:
asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 {%0, %1}, {%2}, {%3}, {%0, %1};"
: "+r"(x[0]), "+r"(x[1])
: "r"(mma_A.x[0]), "r"(mma_B.x[0]));
asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 {%0, %1}, {%2}, {%3}, {%0, %1};"
: "+r"(x[2]), "+r"(x[3])
: "r"(mma_A.x[1]), "r"(mma_B.x[0]));
asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 {%0, %1}, {%2}, {%3}, {%0, %1};"
: "+r"(x[0]), "+r"(x[1])
: "r"(mma_A.x[2]), "r"(mma_B.x[1]));
asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 {%0, %1}, {%2}, {%3}, {%0, %1};"
: "+r"(x[2]), "+r"(x[3])
: "r"(mma_A.x[3]), "r"(mma_B.x[1]));
#endif // __CUDA_ARCH__ >= CC_AMPERE
#else
GGML_UNUSED(mma_A);
GGML_UNUSED(mma_B);
NO_DEVICE_CODE;
#endif // INT8_MMA_AVAILABLE
}
};

View File

@ -11,6 +11,7 @@ void ggml_cuda_op_mul_mat_q(
const int64_t nb01 = src0->nb[1]; const int64_t nb01 = src0->nb[1];
const int64_t ne10 = src1->ne[0]; const int64_t ne10 = src1->ne[0];
const int64_t ne11 = src1->ne[1];
GGML_ASSERT(ne10 % QK8_1 == 0); GGML_ASSERT(ne10 % QK8_1 == 0);
const int64_t ne0 = dst->ne[0]; const int64_t ne0 = dst->ne[0];
@ -25,7 +26,7 @@ void ggml_cuda_op_mul_mat_q(
// nrows_dst == nrows of the matrix that the kernel writes into // nrows_dst == nrows of the matrix that the kernel writes into
const int64_t nrows_dst = id == ctx.device ? ne0 : row_diff; const int64_t nrows_dst = id == ctx.device ? ne0 : row_diff;
const mmq_args args = {src0_dd_i, src1_ddq_i, dst_dd_i, ne00, row_diff, stride00, src1_padded_row_size, src1_ncols, nrows_dst}; const mmq_args args = {src0_dd_i, src1_ddq_i, dst_dd_i, ne00, row_diff, stride00, src1_padded_row_size, src1_ncols, ne11, nrows_dst};
switch (src0->type) { switch (src0->type) {
case GGML_TYPE_Q4_0: case GGML_TYPE_Q4_0:

File diff suppressed because it is too large Load Diff

View File

@ -1,22 +1,23 @@
#include "quantize.cuh" #include "quantize.cuh"
#include <cstdint>
static __global__ void quantize_q8_1(const float * __restrict__ x, void * __restrict__ vy, const int64_t kx, const int64_t kx_padded) { static __global__ void quantize_q8_1(const float * __restrict__ x, void * __restrict__ vy, const int64_t kx, const int64_t kx0_padded) {
const int64_t ix = (int64_t)blockDim.x*blockIdx.x + threadIdx.x; const int64_t ix0 = (int64_t)blockDim.x*blockIdx.x + threadIdx.x;
if (ix >= kx_padded) { if (ix0 >= kx0_padded) {
return; return;
} }
const int64_t iy = (int64_t)blockDim.y*blockIdx.y + threadIdx.y; const int64_t ix1 = blockIdx.y;
const int64_t i_padded = (int64_t)iy*kx_padded + ix; const int64_t i_padded = ix1*kx0_padded + ix0;
block_q8_1 * y = (block_q8_1 *) vy; block_q8_1 * y = (block_q8_1 *) vy;
const int64_t ib = i_padded / QK8_1; // block index const int64_t ib = i_padded / QK8_1; // block index
const int64_t iqs = i_padded % QK8_1; // quant index const int64_t iqs = i_padded % QK8_1; // quant index
const float xi = ix < kx ? x[iy*kx + ix] : 0.0f; const float xi = ix0 < kx ? x[ix1*kx + ix0] : 0.0f;
float amax = fabsf(xi); float amax = fabsf(xi);
float sum = xi; float sum = xi;
@ -36,10 +37,76 @@ static __global__ void quantize_q8_1(const float * __restrict__ x, void * __rest
reinterpret_cast<half&>(y[ib].ds.y) = sum; reinterpret_cast<half&>(y[ib].ds.y) = sum;
} }
void quantize_row_q8_1_cuda(const float * x, void * vy, const int64_t kx, const int64_t ky, const int64_t kx_padded, cudaStream_t stream) { template <bool need_sum>
const int64_t block_num_x = (kx_padded + CUDA_QUANTIZE_BLOCK_SIZE - 1) / CUDA_QUANTIZE_BLOCK_SIZE; static __global__ void quantize_mmq_q8_1(
const dim3 num_blocks(block_num_x, ky, 1); const float * __restrict__ x, void * __restrict__ vy, const int64_t kx0, const int64_t kx1, const int64_t kx0_padded) {
const dim3 block_size(CUDA_QUANTIZE_BLOCK_SIZE, 1, 1);
quantize_q8_1<<<num_blocks, block_size, 0, stream>>>(x, vy, kx, kx_padded); const int64_t ix0 = (int64_t)blockDim.x*blockIdx.x + threadIdx.x;
if (ix0 >= kx0_padded) {
return;
}
const int64_t ix1 = kx1*blockIdx.z + blockIdx.y;
block_q8_1_mmq * y = (block_q8_1_mmq *) vy;
const int64_t ib0 = blockIdx.z*(gridDim.y*gridDim.x*blockDim.x/(4*QK8_1)); // first block of channel
const int64_t ib = ib0 + (ix0 / (4*QK8_1))*kx1 + blockIdx.y; // block index in channel
const int64_t iqs = ix0 % (4*QK8_1); // quant index in block
const float xi = ix0 < kx0 ? x[ix1*kx0 + ix0] : 0.0f;
float amax = fabsf(xi);
amax = warp_reduce_max(amax);
float sum;
if (need_sum) {
sum = warp_reduce_sum(xi);
}
const float d = amax / 127;
const int8_t q = amax == 0.0f ? 0 : roundf(xi / d);
y[ib].qs[iqs] = q;
if (iqs % QK8_1 != 0) {
return;
}
if (need_sum) {
y[ib].ds[iqs/QK8_1] = make_half2(d, sum);
} else {
((float *) y[ib].ds)[iqs/QK8_1] = d;
}
} }
void quantize_row_q8_1_cuda(
const float * x, void * vy, const int64_t kx0, const int64_t kx1, const int64_t channels,
const int64_t kx0_padded, const ggml_type type_x, cudaStream_t stream) {
GGML_ASSERT(kx0_padded % QK8_1 == 0);
const int64_t block_num_x = (kx0_padded + CUDA_QUANTIZE_BLOCK_SIZE - 1) / CUDA_QUANTIZE_BLOCK_SIZE;
const dim3 num_blocks(block_num_x, kx1*channels, 1);
const dim3 block_size(CUDA_QUANTIZE_BLOCK_SIZE, 1, 1);
quantize_q8_1<<<num_blocks, block_size, 0, stream>>>(x, vy, kx0, kx0_padded);
GGML_UNUSED(type_x);
}
void quantize_mmq_q8_1_cuda(
const float * x, void * vy, const int64_t kx0, const int64_t kx1, const int64_t channels,
const int64_t kx0_padded, const ggml_type type_x, cudaStream_t stream) {
GGML_ASSERT(kx0_padded % (4*QK8_1) == 0);
const int64_t block_num_x = (kx0_padded + CUDA_QUANTIZE_BLOCK_SIZE - 1) / CUDA_QUANTIZE_BLOCK_SIZE;
const dim3 num_blocks(block_num_x, kx1, channels);
const dim3 block_size(CUDA_QUANTIZE_BLOCK_SIZE, 1, 1);
if (mmq_need_sum(type_x)) {
quantize_mmq_q8_1<true><<<num_blocks, block_size, 0, stream>>>(x, vy, kx0, kx1, kx0_padded);
} else {
quantize_mmq_q8_1<false><<<num_blocks, block_size, 0, stream>>>(x, vy, kx0, kx1, kx0_padded);
}
}

View File

@ -1,5 +1,20 @@
#pragma once
#include "common.cuh" #include "common.cuh"
#include "mmq.cuh"
#include <cstdint>
#define CUDA_QUANTIZE_BLOCK_SIZE 256 #define CUDA_QUANTIZE_BLOCK_SIZE 256
void quantize_row_q8_1_cuda(const float * x, void * vy, const int64_t kx, const int64_t ky, const int64_t kx_padded, cudaStream_t stream); typedef void (*quantize_cuda_t)(
const float * x, void * vy, const int64_t kx0, const int64_t kx1, const int64_t channels, const int64_t kx0_padded,
const ggml_type type_x, cudaStream_t stream);
void quantize_row_q8_1_cuda(
const float * x, void * vy, const int64_t kx0, const int64_t kx1, const int64_t channels, const int64_t kx0_padded,
const ggml_type type_x, cudaStream_t stream);
void quantize_mmq_q8_1_cuda(
const float * x, void * vy, const int64_t kx0, const int64_t kx1, const int64_t channels, const int64_t kx0_padded,
const ggml_type type_x, cudaStream_t stream);

View File

@ -13089,10 +13089,12 @@ void *ggml_sycl_host_malloc(size_t size) try {
return nullptr; return nullptr;
} }
ggml_sycl_set_device(g_main_device);
dpct::queue_ptr main_stream = g_syclStreams[g_main_device][0];
void * ptr = nullptr; void * ptr = nullptr;
//allow to use dpct::get_in_order_queue() for host malloc
dpct::err0 err = CHECK_TRY_ERROR( dpct::err0 err = CHECK_TRY_ERROR(
ptr = (void *)sycl::malloc_host(size, dpct::get_in_order_queue())); ptr = (void *)sycl::malloc_host(size, *main_stream));
if (err != 0) { if (err != 0) {
// clear the error // clear the error
@ -13113,8 +13115,9 @@ catch (sycl::exception const &exc) {
} }
void ggml_sycl_host_free(void *ptr) try { void ggml_sycl_host_free(void *ptr) try {
//allow to use dpct::get_in_order_queue() for host malloc ggml_sycl_set_device(g_main_device);
SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(ptr, dpct::get_in_order_queue()))); dpct::queue_ptr main_stream = g_syclStreams[g_main_device][0];
SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(ptr, *main_stream)));
} }
catch (sycl::exception const &exc) { catch (sycl::exception const &exc) {
std::cerr << exc.what() << "Exception caught at file:" << __FILE__ std::cerr << exc.what() << "Exception caught at file:" << __FILE__

View File

@ -5,6 +5,7 @@ import os
import shutil import shutil
import struct import struct
import tempfile import tempfile
from dataclasses import dataclass
from enum import Enum, auto from enum import Enum, auto
from io import BufferedWriter from io import BufferedWriter
from typing import IO, Any, Sequence, Mapping from typing import IO, Any, Sequence, Mapping
@ -30,17 +31,36 @@ from .quants import quant_shape_from_byte_shape
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@dataclass
class TensorInfo:
shape: Sequence[int]
dtype: GGMLQuantizationType
nbytes: int
tensor: np.ndarray[Any, Any] | None = None
@dataclass
class GGUFValue:
value: Any
type: GGUFValueType
class WriterState(Enum): class WriterState(Enum):
NO_FILE = auto()
EMPTY = auto() EMPTY = auto()
HEADER = auto() HEADER = auto()
KV_DATA = auto() KV_DATA = auto()
TI_DATA = auto() TI_DATA = auto()
WEIGHTS = auto()
class GGUFWriter: class GGUFWriter:
fout: BufferedWriter fout: BufferedWriter | None
path: os.PathLike[str] | str | None
temp_file: tempfile.SpooledTemporaryFile[bytes] | None temp_file: tempfile.SpooledTemporaryFile[bytes] | None
tensors: list[np.ndarray[Any, Any]] tensors: dict[str, TensorInfo]
kv_data: dict[str, GGUFValue]
state: WriterState
_simple_value_packing = { _simple_value_packing = {
GGUFValueType.UINT8: "B", GGUFValueType.UINT8: "B",
GGUFValueType.INT8: "b", GGUFValueType.INT8: "b",
@ -56,141 +76,140 @@ class GGUFWriter:
} }
def __init__( def __init__(
self, path: os.PathLike[str] | str, arch: str, use_temp_file: bool = True, self, path: os.PathLike[str] | str | None, arch: str, use_temp_file: bool = False,
endianess: GGUFEndian = GGUFEndian.LITTLE, endianess: GGUFEndian = GGUFEndian.LITTLE,
): ):
self.fout = open(path, "wb") self.fout = None
self.path = path
self.arch = arch self.arch = arch
self.endianess = endianess self.endianess = endianess
self.offset_tensor = 0
self.data_alignment = GGUF_DEFAULT_ALIGNMENT self.data_alignment = GGUF_DEFAULT_ALIGNMENT
self.kv_data = bytearray()
self.kv_data_count = 0
self.ti_data = bytearray()
self.ti_data_count = 0
self.ti_names = set()
self.use_temp_file = use_temp_file self.use_temp_file = use_temp_file
self.temp_file = None self.temp_file = None
self.tensors = [] self.tensors = dict()
self.kv_data = dict()
logger.info("gguf: This GGUF file is for {0} Endian only".format( logger.info("gguf: This GGUF file is for {0} Endian only".format(
"Big" if self.endianess == GGUFEndian.BIG else "Little", "Big" if self.endianess == GGUFEndian.BIG else "Little",
)) ))
self.state = WriterState.EMPTY self.state = WriterState.NO_FILE
self.add_architecture() self.add_architecture()
def write_header_to_file(self) -> None: def open_output_file(self, path: os.PathLike[str] | str | None = None) -> None:
if self.state is WriterState.EMPTY and self.fout is not None and (path is None or path == self.path):
# allow calling this multiple times as long as the path is the same
return
if self.state is not WriterState.NO_FILE:
raise ValueError(f'Expected output file to be not yet opened, got {self.state}')
if path is not None:
self.path = path
if self.path is not None:
if self.fout is not None:
self.fout.close()
self.fout = open(self.path, "wb")
self.state = WriterState.EMPTY
def write_header_to_file(self, path: os.PathLike[str] | str | None = None) -> None:
self.open_output_file(path)
if self.state is not WriterState.EMPTY: if self.state is not WriterState.EMPTY:
raise ValueError(f'Expected output file to be empty, got {self.state}') raise ValueError(f'Expected output file to be empty, got {self.state}')
self._write_packed("<I", GGUF_MAGIC, skip_pack_prefix = True) self._write_packed("<I", GGUF_MAGIC, skip_pack_prefix = True)
self._write_packed("I", GGUF_VERSION) self._write_packed("I", GGUF_VERSION)
self._write_packed("Q", self.ti_data_count) self._write_packed("Q", len(self.tensors))
self._write_packed("Q", self.kv_data_count) self._write_packed("Q", len(self.kv_data))
self.flush() self.flush()
self.state = WriterState.HEADER self.state = WriterState.HEADER
def write_kv_data_to_file(self) -> None: def write_kv_data_to_file(self) -> None:
if self.state is not WriterState.HEADER: if self.state is not WriterState.HEADER:
raise ValueError(f'Expected output file to contain the header, got {self.state}') raise ValueError(f'Expected output file to contain the header, got {self.state}')
assert self.fout is not None
self.fout.write(self.kv_data) kv_data = bytearray()
for key, val in self.kv_data.items():
kv_data += self._pack_val(key, GGUFValueType.STRING, add_vtype=False)
kv_data += self._pack_val(val.value, val.type, add_vtype=True)
self.fout.write(kv_data)
self.flush() self.flush()
self.state = WriterState.KV_DATA self.state = WriterState.KV_DATA
def write_ti_data_to_file(self) -> None: def write_ti_data_to_file(self) -> None:
if self.state is not WriterState.KV_DATA: if self.state is not WriterState.KV_DATA:
raise ValueError(f'Expected output file to contain KV data, got {self.state}') raise ValueError(f'Expected output file to contain KV data, got {self.state}')
assert self.fout is not None
self.fout.write(self.ti_data) ti_data = bytearray()
offset_tensor = 0
for name, ti in self.tensors.items():
ti_data += self._pack_val(name, GGUFValueType.STRING, add_vtype=False)
n_dims = len(ti.shape)
ti_data += self._pack("I", n_dims)
for i in range(n_dims):
ti_data += self._pack("Q", ti.shape[n_dims - 1 - i])
ti_data += self._pack("I", ti.dtype)
ti_data += self._pack("Q", offset_tensor)
offset_tensor += GGUFWriter.ggml_pad(ti.nbytes, self.data_alignment)
self.fout.write(ti_data)
self.flush() self.flush()
self.state = WriterState.TI_DATA self.state = WriterState.TI_DATA
def add_key(self, key: str) -> None: def add_key_value(self, key: str, val: Any, vtype: GGUFValueType) -> None:
self.add_val(key, GGUFValueType.STRING, add_vtype=False) if key in self.kv_data:
raise ValueError(f'Duplicated key name {key!r}')
self.kv_data[key] = GGUFValue(value=val, type=vtype)
def add_uint8(self, key: str, val: int) -> None: def add_uint8(self, key: str, val: int) -> None:
self.add_key(key) self.add_key_value(key,val, GGUFValueType.UINT8)
self.add_val(val, GGUFValueType.UINT8)
def add_int8(self, key: str, val: int) -> None: def add_int8(self, key: str, val: int) -> None:
self.add_key(key) self.add_key_value(key, val, GGUFValueType.INT8)
self.add_val(val, GGUFValueType.INT8)
def add_uint16(self, key: str, val: int) -> None: def add_uint16(self, key: str, val: int) -> None:
self.add_key(key) self.add_key_value(key, val, GGUFValueType.UINT16)
self.add_val(val, GGUFValueType.UINT16)
def add_int16(self, key: str, val: int) -> None: def add_int16(self, key: str, val: int) -> None:
self.add_key(key) self.add_key_value(key, val, GGUFValueType.INT16)
self.add_val(val, GGUFValueType.INT16)
def add_uint32(self, key: str, val: int) -> None: def add_uint32(self, key: str, val: int) -> None:
self.add_key(key) self.add_key_value(key, val, GGUFValueType.UINT32)
self.add_val(val, GGUFValueType.UINT32)
def add_int32(self, key: str, val: int) -> None: def add_int32(self, key: str, val: int) -> None:
self.add_key(key) self.add_key_value(key, val, GGUFValueType.INT32)
self.add_val(val, GGUFValueType.INT32)
def add_float32(self, key: str, val: float) -> None: def add_float32(self, key: str, val: float) -> None:
self.add_key(key) self.add_key_value(key, val, GGUFValueType.FLOAT32)
self.add_val(val, GGUFValueType.FLOAT32)
def add_uint64(self, key: str, val: int) -> None: def add_uint64(self, key: str, val: int) -> None:
self.add_key(key) self.add_key_value(key, val, GGUFValueType.UINT64)
self.add_val(val, GGUFValueType.UINT64)
def add_int64(self, key: str, val: int) -> None: def add_int64(self, key: str, val: int) -> None:
self.add_key(key) self.add_key_value(key, val, GGUFValueType.INT64)
self.add_val(val, GGUFValueType.INT64)
def add_float64(self, key: str, val: float) -> None: def add_float64(self, key: str, val: float) -> None:
self.add_key(key) self.add_key_value(key, val, GGUFValueType.FLOAT64)
self.add_val(val, GGUFValueType.FLOAT64)
def add_bool(self, key: str, val: bool) -> None: def add_bool(self, key: str, val: bool) -> None:
self.add_key(key) self.add_key_value(key, val, GGUFValueType.BOOL)
self.add_val(val, GGUFValueType.BOOL)
def add_string(self, key: str, val: str) -> None: def add_string(self, key: str, val: str) -> None:
if not val: if not val:
return return
self.add_key(key) self.add_key_value(key, val, GGUFValueType.STRING)
self.add_val(val, GGUFValueType.STRING)
def add_array(self, key: str, val: Sequence[Any]) -> None: def add_array(self, key: str, val: Sequence[Any]) -> None:
if not isinstance(val, Sequence): if not isinstance(val, Sequence):
raise ValueError("Value must be a sequence for array type") raise ValueError("Value must be a sequence for array type")
self.add_key(key) self.add_key_value(key, val, GGUFValueType.ARRAY)
self.add_val(val, GGUFValueType.ARRAY)
def add_val(self, val: Any, vtype: GGUFValueType | None = None, add_vtype: bool = True) -> None:
if vtype is None:
vtype = GGUFValueType.get_type(val)
if add_vtype:
self.kv_data += self._pack("I", vtype)
self.kv_data_count += 1
pack_fmt = self._simple_value_packing.get(vtype)
if pack_fmt is not None:
self.kv_data += self._pack(pack_fmt, val, skip_pack_prefix = vtype == GGUFValueType.BOOL)
elif vtype == GGUFValueType.STRING:
encoded_val = val.encode("utf-8") if isinstance(val, str) else val
self.kv_data += self._pack("Q", len(encoded_val))
self.kv_data += encoded_val
elif vtype == GGUFValueType.ARRAY and isinstance(val, Sequence) and val:
ltype = GGUFValueType.get_type(val[0])
if not all(GGUFValueType.get_type(i) is ltype for i in val[1:]):
raise ValueError("All items in a GGUF array should be of the same type")
self.kv_data += self._pack("I", ltype)
self.kv_data += self._pack("Q", len(val))
for item in val:
self.add_val(item, add_vtype=False)
else:
raise ValueError("Invalid GGUF metadata value type or value")
@staticmethod @staticmethod
def ggml_pad(x: int, n: int) -> int: def ggml_pad(x: int, n: int) -> int:
@ -200,16 +219,12 @@ class GGUFWriter:
self, name: str, tensor_shape: Sequence[int], tensor_dtype: np.dtype, self, name: str, tensor_shape: Sequence[int], tensor_dtype: np.dtype,
tensor_nbytes: int, raw_dtype: GGMLQuantizationType | None = None, tensor_nbytes: int, raw_dtype: GGMLQuantizationType | None = None,
) -> None: ) -> None:
if self.state is not WriterState.EMPTY: if self.state is not WriterState.NO_FILE:
raise ValueError(f'Expected output file to be empty, got {self.state}') raise ValueError(f'Expected output file to be not yet opened, got {self.state}')
if name in self.ti_names: if name in self.tensors:
raise ValueError(f'Duplicated tensor name {name}') raise ValueError(f'Duplicated tensor name {name!r}')
self.ti_names.add(name)
encoded_name = name.encode("utf-8")
self.ti_data += self._pack("Q", len(encoded_name))
self.ti_data += encoded_name
if raw_dtype is None: if raw_dtype is None:
if tensor_dtype == np.float16: if tensor_dtype == np.float16:
dtype = GGMLQuantizationType.F16 dtype = GGMLQuantizationType.F16
@ -231,14 +246,8 @@ class GGUFWriter:
dtype = raw_dtype dtype = raw_dtype
if tensor_dtype == np.uint8: if tensor_dtype == np.uint8:
tensor_shape = quant_shape_from_byte_shape(tensor_shape, raw_dtype) tensor_shape = quant_shape_from_byte_shape(tensor_shape, raw_dtype)
n_dims = len(tensor_shape)
self.ti_data += self._pack("I", n_dims) self.tensors[name] = TensorInfo(shape=tensor_shape, dtype=dtype, nbytes=tensor_nbytes)
for i in range(n_dims):
self.ti_data += self._pack("Q", tensor_shape[n_dims - 1 - i])
self.ti_data += self._pack("I", dtype)
self.ti_data += self._pack("Q", self.offset_tensor)
self.offset_tensor += GGUFWriter.ggml_pad(tensor_nbytes, self.data_alignment)
self.ti_data_count += 1
def add_tensor( def add_tensor(
self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None, self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None,
@ -252,10 +261,10 @@ class GGUFWriter:
self.temp_file = fp self.temp_file = fp
shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape
self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype = raw_dtype) self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype=raw_dtype)
if self.temp_file is None: if self.temp_file is None:
self.tensors.append(tensor) self.tensors[name].tensor = tensor
return return
tensor.tofile(self.temp_file) tensor.tofile(self.temp_file)
@ -267,8 +276,9 @@ class GGUFWriter:
fp.write(bytes([0] * pad)) fp.write(bytes([0] * pad))
def write_tensor_data(self, tensor: np.ndarray[Any, Any]) -> None: def write_tensor_data(self, tensor: np.ndarray[Any, Any]) -> None:
if self.state is not WriterState.TI_DATA: if self.state is not WriterState.TI_DATA and self.state is not WriterState.WEIGHTS:
raise ValueError(f'Expected output file to contain tensor info, got {self.state}') raise ValueError(f'Expected output file to contain tensor info or weights, got {self.state}')
assert self.fout is not None
if self.endianess == GGUFEndian.BIG: if self.endianess == GGUFEndian.BIG:
tensor.byteswap(inplace=True) tensor.byteswap(inplace=True)
@ -276,50 +286,51 @@ class GGUFWriter:
tensor.tofile(self.fout) tensor.tofile(self.fout)
self.write_padding(self.fout, tensor.nbytes) self.write_padding(self.fout, tensor.nbytes)
self.state = WriterState.WEIGHTS
def write_tensors_to_file(self, *, progress: bool = False) -> None: def write_tensors_to_file(self, *, progress: bool = False) -> None:
self.write_ti_data_to_file() self.write_ti_data_to_file()
assert self.fout is not None
self.write_padding(self.fout, self.fout.tell()) self.write_padding(self.fout, self.fout.tell())
if self.temp_file is None: if self.temp_file is None:
self.tensors.reverse() # to pop from the "beginning" in constant time bar = None
if progress: if progress:
from tqdm import tqdm from tqdm import tqdm
total_bytes = sum(t.nbytes for t in self.tensors) total_bytes = sum(t.nbytes for t in self.tensors.values())
bar = tqdm(desc="Writing", total=total_bytes, unit="byte", unit_scale=True) bar = tqdm(desc="Writing", total=total_bytes, unit="byte", unit_scale=True)
while True: # relying on the fact that Python dicts preserve insertion order (since 3.7)
try: for ti in self.tensors.values():
tensor = self.tensors.pop() assert ti.tensor is not None # can only iterate once over the tensors
except IndexError: assert ti.tensor.nbytes == ti.nbytes
break ti.tensor.tofile(self.fout)
tensor.tofile(self.fout) if bar is not None:
bar.update(tensor.nbytes) bar.update(ti.nbytes)
self.write_padding(self.fout, tensor.nbytes) self.write_padding(self.fout, ti.nbytes)
return ti.tensor = None
while True: else:
try: self.temp_file.seek(0)
tensor = self.tensors.pop()
except IndexError:
break
tensor.tofile(self.fout)
self.write_padding(self.fout, tensor.nbytes)
return
self.temp_file.seek(0) shutil.copyfileobj(self.temp_file, self.fout)
self.flush()
self.temp_file.close()
shutil.copyfileobj(self.temp_file, self.fout) self.state = WriterState.WEIGHTS
self.flush()
self.temp_file.close()
def flush(self) -> None: def flush(self) -> None:
assert self.fout is not None
self.fout.flush() self.fout.flush()
def close(self) -> None: def close(self) -> None:
self.fout.close() if self.fout is not None:
self.fout.close()
self.fout = None
def add_architecture(self) -> None: def add_architecture(self) -> None:
self.add_string(Keys.General.ARCHITECTURE, self.arch) self.add_string(Keys.General.ARCHITECTURE, self.arch)
@ -449,7 +460,7 @@ class GGUFWriter:
def add_rope_scaling_factor(self, value: float) -> None: def add_rope_scaling_factor(self, value: float) -> None:
self.add_float32(Keys.Rope.SCALING_FACTOR.format(arch=self.arch), value) self.add_float32(Keys.Rope.SCALING_FACTOR.format(arch=self.arch), value)
def add_rope_scaling_attn_factors(self, value: Sequence[float]) -> None: def add_rope_scaling_attn_factors(self, value: float) -> None:
self.add_float32(Keys.Rope.SCALING_ATTN_FACTOR.format(arch=self.arch), value) self.add_float32(Keys.Rope.SCALING_ATTN_FACTOR.format(arch=self.arch), value)
def add_rope_scaling_orig_ctx_len(self, value: int) -> None: def add_rope_scaling_orig_ctx_len(self, value: int) -> None:
@ -571,5 +582,32 @@ class GGUFWriter:
pack_prefix = '<' if self.endianess == GGUFEndian.LITTLE else '>' pack_prefix = '<' if self.endianess == GGUFEndian.LITTLE else '>'
return struct.pack(f'{pack_prefix}{fmt}', value) return struct.pack(f'{pack_prefix}{fmt}', value)
def _pack_val(self, val: Any, vtype: GGUFValueType, add_vtype: bool) -> bytes:
kv_data = bytearray()
if add_vtype:
kv_data += self._pack("I", vtype)
pack_fmt = self._simple_value_packing.get(vtype)
if pack_fmt is not None:
kv_data += self._pack(pack_fmt, val, skip_pack_prefix = vtype == GGUFValueType.BOOL)
elif vtype == GGUFValueType.STRING:
encoded_val = val.encode("utf-8") if isinstance(val, str) else val
kv_data += self._pack("Q", len(encoded_val))
kv_data += encoded_val
elif vtype == GGUFValueType.ARRAY and isinstance(val, Sequence) and val:
ltype = GGUFValueType.get_type(val[0])
if not all(GGUFValueType.get_type(i) is ltype for i in val[1:]):
raise ValueError("All items in a GGUF array should be of the same type")
kv_data += self._pack("I", ltype)
kv_data += self._pack("Q", len(val))
for item in val:
kv_data += self._pack_val(item, ltype, add_vtype=False)
else:
raise ValueError("Invalid GGUF metadata value type or value")
return kv_data
def _write_packed(self, fmt: str, value: Any, skip_pack_prefix: bool = False) -> None: def _write_packed(self, fmt: str, value: Any, skip_pack_prefix: bool = False) -> None:
assert self.fout is not None
self.fout.write(self._pack(fmt, value, skip_pack_prefix)) self.fout.write(self._pack(fmt, value, skip_pack_prefix))

View File

@ -101,8 +101,7 @@ def copy_with_new_metadata(reader: gguf.GGUFReader, writer: gguf.GGUFWriter, new
logger.debug(f'Copying {field.name}') logger.debug(f'Copying {field.name}')
if val.value is not None: if val.value is not None:
writer.add_key(field.name) writer.add_key_value(field.name, val.value, val.type)
writer.add_val(val.value, val.type)
if gguf.Keys.Tokenizer.CHAT_TEMPLATE in new_metadata: if gguf.Keys.Tokenizer.CHAT_TEMPLATE in new_metadata:
logger.debug('Adding chat template(s)') logger.debug('Adding chat template(s)')
@ -111,8 +110,7 @@ def copy_with_new_metadata(reader: gguf.GGUFReader, writer: gguf.GGUFWriter, new
for key, val in new_metadata.items(): for key, val in new_metadata.items():
logger.debug(f'Adding {key}: "{val.value}" {val.description}') logger.debug(f'Adding {key}: "{val.value}" {val.description}')
writer.add_key(key) writer.add_key_value(key, val.value, val.type)
writer.add_val(val.value, val.type)
total_bytes = 0 total_bytes = 0