mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-31 06:03:11 +01:00
resolve merge conflicts
This commit is contained in:
commit
66cffa8aff
102
CONTRIBUTING.md
102
CONTRIBUTING.md
@ -1,10 +1,10 @@
|
|||||||
# Pull requests (for contributors)
|
# Pull requests (for contributors)
|
||||||
|
|
||||||
- Test your changes:
|
- Test your changes:
|
||||||
- Execute [the full CI locally on your machine](ci/README.md) before publishing
|
- Execute [the full CI locally on your machine](ci/README.md) before publishing
|
||||||
- Verify that the perplexity and the performance are not affected negatively by your changes (use `llama-perplexity` and `llama-bench`)
|
- Verify that the perplexity and the performance are not affected negatively by your changes (use `llama-perplexity` and `llama-bench`)
|
||||||
- If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends)
|
- If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends)
|
||||||
- If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops`
|
- If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops`
|
||||||
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
|
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
|
||||||
- If your PR becomes stale, don't hesitate to ping the maintainers in the comments
|
- If your PR becomes stale, don't hesitate to ping the maintainers in the comments
|
||||||
|
|
||||||
@ -20,14 +20,104 @@
|
|||||||
- Avoid adding third-party dependencies, extra files, extra headers, etc.
|
- Avoid adding third-party dependencies, extra files, extra headers, etc.
|
||||||
- Always consider cross-compatibility with other operating systems and architectures
|
- Always consider cross-compatibility with other operating systems and architectures
|
||||||
- Avoid fancy-looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple
|
- Avoid fancy-looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple
|
||||||
- There are no strict rules for the code style, but try to follow the patterns in the code (indentation, spaces, etc.). Vertical alignment makes things more readable and easier to batch edit
|
- Vertical alignment makes things more readable and easier to batch edit
|
||||||
- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & a`
|
- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & a`
|
||||||
- Naming usually optimizes for common prefix (see https://github.com/ggerganov/ggml/pull/302#discussion_r1243240963)
|
- Use sized integer types such as `int32_t` in the public API, e.g. `size_t` may also be appropriate for allocation sizes or byte offsets
|
||||||
|
- Declare structs with `struct foo {}` instead of `typedef struct foo {} foo`
|
||||||
|
- In C++ code omit optional `struct` and `enum` keyword whenever they are not necessary
|
||||||
|
```cpp
|
||||||
|
// OK
|
||||||
|
llama_context * ctx;
|
||||||
|
const llama_rope_type rope_type;
|
||||||
|
|
||||||
|
// not OK
|
||||||
|
struct llama_context * ctx;
|
||||||
|
const enum llama_rope_type rope_type;
|
||||||
|
```
|
||||||
|
|
||||||
|
_(NOTE: this guideline is yet to be applied to the `llama.cpp` codebase. New code should follow this guideline.)_
|
||||||
|
|
||||||
|
- Try to follow the existing patterns in the code (indentation, spaces, etc.). In case of doubt use `clang-format` to format the added code
|
||||||
|
- For anything not covered in the current guidelines, refer to the [C++ Core Guidelines](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines)
|
||||||
- Tensors store data in row-major order. We refer to dimension 0 as columns, 1 as rows, 2 as matrices
|
- Tensors store data in row-major order. We refer to dimension 0 as columns, 1 as rows, 2 as matrices
|
||||||
- Matrix multiplication is unconventional: [`C = ggml_mul_mat(ctx, A, B)`](https://github.com/ggerganov/llama.cpp/blob/880e352277fc017df4d5794f0c21c44e1eae2b84/ggml.h#L1058-L1064) means $C^T = A B^T \Leftrightarrow C = B A^T.$
|
- Matrix multiplication is unconventional: [`C = ggml_mul_mat(ctx, A, B)`](https://github.com/ggerganov/llama.cpp/blob/880e352277fc017df4d5794f0c21c44e1eae2b84/ggml.h#L1058-L1064) means $C^T = A B^T \Leftrightarrow C = B A^T.$
|
||||||
|
|
||||||
![matmul](media/matmul.png)
|
![matmul](media/matmul.png)
|
||||||
|
|
||||||
|
# Naming guidelines
|
||||||
|
|
||||||
|
- Use `snake_case` for function, variable and type names
|
||||||
|
- Naming usually optimizes for longest common prefix (see https://github.com/ggerganov/ggml/pull/302#discussion_r1243240963)
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
// not OK
|
||||||
|
int small_number;
|
||||||
|
int big_number;
|
||||||
|
|
||||||
|
// OK
|
||||||
|
int number_small;
|
||||||
|
int number_big;
|
||||||
|
```
|
||||||
|
|
||||||
|
- Enum values are always in upper case and prefixed with the enum name
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
enum llama_vocab_type {
|
||||||
|
LLAMA_VOCAB_TYPE_NONE = 0,
|
||||||
|
LLAMA_VOCAB_TYPE_SPM = 1,
|
||||||
|
LLAMA_VOCAB_TYPE_BPE = 2,
|
||||||
|
LLAMA_VOCAB_TYPE_WPM = 3,
|
||||||
|
LLAMA_VOCAB_TYPE_UGM = 4,
|
||||||
|
LLAMA_VOCAB_TYPE_RWKV = 5,
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
- The general naming pattern is `<class>_<method>`, with `<method>` being `<action>_<noun>`
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
llama_model_init(); // class: "llama_model", method: "init"
|
||||||
|
llama_sampler_chain_remove(); // class: "llama_sampler_chain", method: "remove"
|
||||||
|
llama_sampler_get_seed(); // class: "llama_sampler", method: "get_seed"
|
||||||
|
llama_set_embeddings(); // class: "llama_context", method: "set_embeddings"
|
||||||
|
llama_n_threads(); // class: "llama_context", method: "n_threads"
|
||||||
|
llama_adapter_lora_free(); // class: "llama_adapter_lora", method: "free"
|
||||||
|
```
|
||||||
|
|
||||||
|
- The `get` `<action>` can be omitted
|
||||||
|
- The `<noun>` can be omitted if not necessary
|
||||||
|
- The `_context` suffix of the `<class>` is optional. Use it to disambiguate symbols when needed
|
||||||
|
- Use `init`/`free` for constructor/destructor `<action>`
|
||||||
|
|
||||||
|
- Use the `_t` suffix when a type is supposed to be opaque to the user - it's not relevant to them if it is a struct or anything else
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
typedef struct llama_context * llama_context_t;
|
||||||
|
|
||||||
|
enum llama_pooling_type llama_pooling_type(const llama_context_t ctx);
|
||||||
|
```
|
||||||
|
|
||||||
|
_(NOTE: this guideline is yet to be applied to the `llama.cpp` codebase. New code should follow this guideline)_
|
||||||
|
|
||||||
|
- C/C++ filenames are all lowercase with dashes. Headers use the `.h` extension. Source files use the `.c` or `.cpp` extension
|
||||||
|
- Python filenames are all lowercase with underscores
|
||||||
|
|
||||||
|
- _(TODO: abbreviations usage)_
|
||||||
|
|
||||||
|
# Preprocessor directives
|
||||||
|
|
||||||
|
- _(TODO: add guidelines with examples and apply them to the codebase)_
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
#ifdef FOO
|
||||||
|
#endif // FOO
|
||||||
|
```
|
||||||
|
|
||||||
|
# Documentation
|
||||||
|
|
||||||
|
- Documentation is a community effort
|
||||||
|
- When you need to look into the source code to figure out how to use an API consider adding a short summary to the header file for future reference
|
||||||
|
- When you notice incorrect or outdated documentation, please update it
|
||||||
|
|
||||||
# Resources
|
# Resources
|
||||||
|
|
||||||
The Github issues, PRs and discussions contain a lot of information that can be useful to get familiar with the codebase. For convenience, some of the more important information is referenced from Github projects:
|
The Github issues, PRs and discussions contain a lot of information that can be useful to get familiar with the codebase. For convenience, some of the more important information is referenced from Github projects:
|
||||||
|
40
README.md
40
README.md
@ -69,6 +69,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
|
|||||||
- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
|
- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
|
||||||
- [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
|
- [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
|
||||||
- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
|
- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
|
||||||
|
- [x] [PhiMoE](https://github.com/ggerganov/llama.cpp/pull/11003)
|
||||||
- [x] [GPT-2](https://huggingface.co/gpt2)
|
- [x] [GPT-2](https://huggingface.co/gpt2)
|
||||||
- [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
|
- [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
|
||||||
- [x] [InternLM2](https://huggingface.co/models?search=internlm2)
|
- [x] [InternLM2](https://huggingface.co/models?search=internlm2)
|
||||||
@ -98,6 +99,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
|
|||||||
- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
|
- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
|
||||||
- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
|
- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
|
||||||
- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
|
- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
|
||||||
|
- [x] [QRWKV-6](https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1)
|
||||||
- [x] [GigaChat-20B-A3B](https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct)
|
- [x] [GigaChat-20B-A3B](https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct)
|
||||||
|
|
||||||
#### Multimodal
|
#### Multimodal
|
||||||
@ -243,6 +245,8 @@ The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](htt
|
|||||||
- [Trending](https://huggingface.co/models?library=gguf&sort=trending)
|
- [Trending](https://huggingface.co/models?library=gguf&sort=trending)
|
||||||
- [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)
|
- [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)
|
||||||
|
|
||||||
|
You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from Hugging Face by using this CLI argument: `-hf <user>/<model>[:quant]`
|
||||||
|
|
||||||
After downloading a model, use the CLI tools to run it locally - see below.
|
After downloading a model, use the CLI tools to run it locally - see below.
|
||||||
|
|
||||||
`llama.cpp` requires the model to be stored in the [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) file format. Models in other data formats can be converted to GGUF using the `convert_*.py` Python scripts in this repo.
|
`llama.cpp` requires the model to be stored in the [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) file format. Models in other data formats can be converted to GGUF using the `convert_*.py` Python scripts in this repo.
|
||||||
@ -261,21 +265,12 @@ To learn more about model quantization, [read this documentation](examples/quant
|
|||||||
#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.
|
#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.
|
||||||
|
|
||||||
- <details open>
|
- <details open>
|
||||||
<summary>Run simple text completion</summary>
|
|
||||||
|
|
||||||
```bash
|
|
||||||
llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128
|
|
||||||
|
|
||||||
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
|
|
||||||
```
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
- <details>
|
|
||||||
<summary>Run in conversation mode</summary>
|
<summary>Run in conversation mode</summary>
|
||||||
|
|
||||||
|
Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding `-cnv` and specifying a suitable chat template with `--chat-template NAME`
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv
|
llama-cli -m model.gguf
|
||||||
|
|
||||||
# > hi, who are you?
|
# > hi, who are you?
|
||||||
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
|
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
|
||||||
@ -287,17 +282,28 @@ To learn more about model quantization, [read this documentation](examples/quant
|
|||||||
</details>
|
</details>
|
||||||
|
|
||||||
- <details>
|
- <details>
|
||||||
<summary>Run with custom chat template</summary>
|
<summary>Run in conversation mode with custom chat template</summary>
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# use the "chatml" template
|
# use the "chatml" template (use -h to see the list of supported templates)
|
||||||
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
|
llama-cli -m model.gguf -cnv --chat-template chatml
|
||||||
|
|
||||||
# use a custom template
|
# use a custom template
|
||||||
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
|
llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
|
||||||
```
|
```
|
||||||
|
|
||||||
[Supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
|
</details>
|
||||||
|
|
||||||
|
- <details>
|
||||||
|
<summary>Run simple text completion</summary>
|
||||||
|
|
||||||
|
To disable conversation mode explicitly, use `-no-cnv`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128 -no-cnv
|
||||||
|
|
||||||
|
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
|
||||||
|
```
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
@ -130,17 +130,26 @@ std::string common_arg::to_string() {
|
|||||||
|
|
||||||
static void common_params_handle_model_default(
|
static void common_params_handle_model_default(
|
||||||
std::string & model,
|
std::string & model,
|
||||||
std::string & model_url,
|
const std::string & model_url,
|
||||||
std::string & hf_repo,
|
std::string & hf_repo,
|
||||||
std::string & hf_file) {
|
std::string & hf_file,
|
||||||
|
const std::string & hf_token) {
|
||||||
if (!hf_repo.empty()) {
|
if (!hf_repo.empty()) {
|
||||||
// short-hand to avoid specifying --hf-file -> default it to --model
|
// short-hand to avoid specifying --hf-file -> default it to --model
|
||||||
if (hf_file.empty()) {
|
if (hf_file.empty()) {
|
||||||
if (model.empty()) {
|
if (model.empty()) {
|
||||||
throw std::invalid_argument("error: --hf-repo requires either --hf-file or --model\n");
|
auto auto_detected = common_get_hf_file(hf_repo, hf_token);
|
||||||
|
if (auto_detected.first.empty() || auto_detected.second.empty()) {
|
||||||
|
exit(1); // built without CURL, error message already printed
|
||||||
|
}
|
||||||
|
hf_repo = auto_detected.first;
|
||||||
|
hf_file = auto_detected.second;
|
||||||
|
} else {
|
||||||
|
hf_file = model;
|
||||||
}
|
}
|
||||||
hf_file = model;
|
}
|
||||||
} else if (model.empty()) {
|
// make sure model path is present (for caching purposes)
|
||||||
|
if (model.empty()) {
|
||||||
// this is to avoid different repo having same file name, or same file name in different subdirs
|
// this is to avoid different repo having same file name, or same file name in different subdirs
|
||||||
std::string filename = hf_repo + "_" + hf_file;
|
std::string filename = hf_repo + "_" + hf_file;
|
||||||
// to make sure we don't have any slashes in the filename
|
// to make sure we don't have any slashes in the filename
|
||||||
@ -290,8 +299,8 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
|
|||||||
}
|
}
|
||||||
|
|
||||||
// TODO: refactor model params in a common struct
|
// TODO: refactor model params in a common struct
|
||||||
common_params_handle_model_default(params.model, params.model_url, params.hf_repo, params.hf_file);
|
common_params_handle_model_default(params.model, params.model_url, params.hf_repo, params.hf_file, params.hf_token);
|
||||||
common_params_handle_model_default(params.vocoder.model, params.vocoder.model_url, params.vocoder.hf_repo, params.vocoder.hf_file);
|
common_params_handle_model_default(params.vocoder.model, params.vocoder.model_url, params.vocoder.hf_repo, params.vocoder.hf_file, params.hf_token);
|
||||||
|
|
||||||
if (params.escape) {
|
if (params.escape) {
|
||||||
string_process_escapes(params.prompt);
|
string_process_escapes(params.prompt);
|
||||||
@ -768,15 +777,19 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
|
|||||||
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER}));
|
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER}));
|
||||||
add_opt(common_arg(
|
add_opt(common_arg(
|
||||||
{"-cnv", "--conversation"},
|
{"-cnv", "--conversation"},
|
||||||
string_format(
|
"run in conversation mode:\n"
|
||||||
"run in conversation mode:\n"
|
"- does not print special tokens and suffix/prefix\n"
|
||||||
"- does not print special tokens and suffix/prefix\n"
|
"- interactive mode is also enabled\n"
|
||||||
"- interactive mode is also enabled\n"
|
"(default: auto enabled if chat template is available)",
|
||||||
"(default: %s)",
|
|
||||||
params.conversation ? "true" : "false"
|
|
||||||
),
|
|
||||||
[](common_params & params) {
|
[](common_params & params) {
|
||||||
params.conversation = true;
|
params.conversation_mode = COMMON_CONVERSATION_MODE_ENABLED;
|
||||||
|
}
|
||||||
|
).set_examples({LLAMA_EXAMPLE_MAIN}));
|
||||||
|
add_opt(common_arg(
|
||||||
|
{"-no-cnv", "--no-conversation"},
|
||||||
|
"force disable conversation mode (default: false)",
|
||||||
|
[](common_params & params) {
|
||||||
|
params.conversation_mode = COMMON_CONVERSATION_MODE_DISABLED;
|
||||||
}
|
}
|
||||||
).set_examples({LLAMA_EXAMPLE_MAIN}));
|
).set_examples({LLAMA_EXAMPLE_MAIN}));
|
||||||
add_opt(common_arg(
|
add_opt(common_arg(
|
||||||
@ -1590,21 +1603,23 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
|
|||||||
}
|
}
|
||||||
).set_env("LLAMA_ARG_MODEL_URL"));
|
).set_env("LLAMA_ARG_MODEL_URL"));
|
||||||
add_opt(common_arg(
|
add_opt(common_arg(
|
||||||
{"-hfr", "--hf-repo"}, "REPO",
|
{"-hf", "-hfr", "--hf-repo"}, "<user>/<model>[:quant]",
|
||||||
"Hugging Face model repository (default: unused)",
|
"Hugging Face model repository; quant is optional, case-insensitive, default to Q4_K_M, or falls back to the first file in the repo if Q4_K_M doesn't exist.\n"
|
||||||
|
"example: unsloth/phi-4-GGUF:q4_k_m\n"
|
||||||
|
"(default: unused)",
|
||||||
[](common_params & params, const std::string & value) {
|
[](common_params & params, const std::string & value) {
|
||||||
params.hf_repo = value;
|
params.hf_repo = value;
|
||||||
}
|
}
|
||||||
).set_env("LLAMA_ARG_HF_REPO"));
|
).set_env("LLAMA_ARG_HF_REPO"));
|
||||||
add_opt(common_arg(
|
add_opt(common_arg(
|
||||||
{"-hff", "--hf-file"}, "FILE",
|
{"-hff", "--hf-file"}, "FILE",
|
||||||
"Hugging Face model file (default: unused)",
|
"Hugging Face model file. If specified, it will override the quant in --hf-repo (default: unused)",
|
||||||
[](common_params & params, const std::string & value) {
|
[](common_params & params, const std::string & value) {
|
||||||
params.hf_file = value;
|
params.hf_file = value;
|
||||||
}
|
}
|
||||||
).set_env("LLAMA_ARG_HF_FILE"));
|
).set_env("LLAMA_ARG_HF_FILE"));
|
||||||
add_opt(common_arg(
|
add_opt(common_arg(
|
||||||
{"-hfrv", "--hf-repo-v"}, "REPO",
|
{"-hfv", "-hfrv", "--hf-repo-v"}, "<user>/<model>[:quant]",
|
||||||
"Hugging Face model repository for the vocoder model (default: unused)",
|
"Hugging Face model repository for the vocoder model (default: unused)",
|
||||||
[](common_params & params, const std::string & value) {
|
[](common_params & params, const std::string & value) {
|
||||||
params.vocoder.hf_repo = value;
|
params.vocoder.hf_repo = value;
|
||||||
|
@ -73,6 +73,22 @@
|
|||||||
#include <sys/syslimits.h>
|
#include <sys/syslimits.h>
|
||||||
#endif
|
#endif
|
||||||
#define LLAMA_CURL_MAX_URL_LENGTH 2084 // Maximum URL Length in Chrome: 2083
|
#define LLAMA_CURL_MAX_URL_LENGTH 2084 // Maximum URL Length in Chrome: 2083
|
||||||
|
|
||||||
|
//
|
||||||
|
// CURL utils
|
||||||
|
//
|
||||||
|
|
||||||
|
using curl_ptr = std::unique_ptr<CURL, decltype(&curl_easy_cleanup)>;
|
||||||
|
|
||||||
|
// cannot use unique_ptr for curl_slist, because we cannot update without destroying the old one
|
||||||
|
struct curl_slist_ptr {
|
||||||
|
struct curl_slist * ptr = nullptr;
|
||||||
|
~curl_slist_ptr() {
|
||||||
|
if (ptr) {
|
||||||
|
curl_slist_free_all(ptr);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
};
|
||||||
#endif // LLAMA_USE_CURL
|
#endif // LLAMA_USE_CURL
|
||||||
|
|
||||||
using json = nlohmann::ordered_json;
|
using json = nlohmann::ordered_json;
|
||||||
@ -857,21 +873,23 @@ struct common_init_result common_init_from_params(common_params & params) {
|
|||||||
return iparams;
|
return iparams;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
if (params.reranking) {
|
if (params.reranking) {
|
||||||
bool ok = true;
|
bool ok = true;
|
||||||
|
|
||||||
if (llama_token_bos(model) == LLAMA_TOKEN_NULL) {
|
if (llama_vocab_bos(vocab) == LLAMA_TOKEN_NULL) {
|
||||||
LOG_WRN("%s: warning: model does not have a BOS token, reranking will not work\n", __func__);
|
LOG_WRN("%s: warning: vocab does not have a BOS token, reranking will not work\n", __func__);
|
||||||
ok = false;
|
ok = false;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (llama_token_eos(model) == LLAMA_TOKEN_NULL) {
|
if (llama_vocab_eos(vocab) == LLAMA_TOKEN_NULL) {
|
||||||
LOG_WRN("%s: warning: model does not have an EOS token, reranking will not work\n", __func__);
|
LOG_WRN("%s: warning: vocab does not have an EOS token, reranking will not work\n", __func__);
|
||||||
ok = false;
|
ok = false;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (llama_token_sep(model) == LLAMA_TOKEN_NULL) {
|
if (llama_vocab_sep(vocab) == LLAMA_TOKEN_NULL) {
|
||||||
LOG_WRN("%s: warning: model does not have a SEP token, reranking will not work\n", __func__);
|
LOG_WRN("%s: warning: vocab does not have a SEP token, reranking will not work\n", __func__);
|
||||||
ok = false;
|
ok = false;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -884,7 +902,7 @@ struct common_init_result common_init_from_params(common_params & params) {
|
|||||||
|
|
||||||
auto cparams = common_context_params_to_llama(params);
|
auto cparams = common_context_params_to_llama(params);
|
||||||
|
|
||||||
llama_context * lctx = llama_new_context_with_model(model, cparams);
|
llama_context * lctx = llama_init_from_model(model, cparams);
|
||||||
if (lctx == NULL) {
|
if (lctx == NULL) {
|
||||||
LOG_ERR("%s: failed to create context with model '%s'\n", __func__, params.model.c_str());
|
LOG_ERR("%s: failed to create context with model '%s'\n", __func__, params.model.c_str());
|
||||||
llama_model_free(model);
|
llama_model_free(model);
|
||||||
@ -898,7 +916,7 @@ struct common_init_result common_init_from_params(common_params & params) {
|
|||||||
|
|
||||||
if (!params.control_vectors.empty()) {
|
if (!params.control_vectors.empty()) {
|
||||||
if (params.control_vector_layer_start <= 0) params.control_vector_layer_start = 1;
|
if (params.control_vector_layer_start <= 0) params.control_vector_layer_start = 1;
|
||||||
if (params.control_vector_layer_end <= 0) params.control_vector_layer_end = llama_n_layer(model);
|
if (params.control_vector_layer_end <= 0) params.control_vector_layer_end = llama_model_n_layer(model);
|
||||||
|
|
||||||
const auto cvec = common_control_vector_load(params.control_vectors);
|
const auto cvec = common_control_vector_load(params.control_vectors);
|
||||||
if (cvec.n_embd == -1) {
|
if (cvec.n_embd == -1) {
|
||||||
@ -908,12 +926,13 @@ struct common_init_result common_init_from_params(common_params & params) {
|
|||||||
return iparams;
|
return iparams;
|
||||||
}
|
}
|
||||||
|
|
||||||
int err = llama_control_vector_apply(lctx,
|
int err = llama_apply_adapter_cvec(
|
||||||
cvec.data.data(),
|
lctx,
|
||||||
cvec.data.size(),
|
cvec.data.data(),
|
||||||
cvec.n_embd,
|
cvec.data.size(),
|
||||||
params.control_vector_layer_start,
|
cvec.n_embd,
|
||||||
params.control_vector_layer_end);
|
params.control_vector_layer_start,
|
||||||
|
params.control_vector_layer_end);
|
||||||
if (err) {
|
if (err) {
|
||||||
llama_free(lctx);
|
llama_free(lctx);
|
||||||
llama_model_free(model);
|
llama_model_free(model);
|
||||||
@ -924,8 +943,8 @@ struct common_init_result common_init_from_params(common_params & params) {
|
|||||||
|
|
||||||
// load and optionally apply lora adapters
|
// load and optionally apply lora adapters
|
||||||
for (auto & la : params.lora_adapters) {
|
for (auto & la : params.lora_adapters) {
|
||||||
llama_lora_adapter_ptr lora;
|
llama_adapter_lora_ptr lora;
|
||||||
lora.reset(llama_lora_adapter_init(model, la.path.c_str()));
|
lora.reset(llama_adapter_lora_init(model, la.path.c_str()));
|
||||||
if (lora == nullptr) {
|
if (lora == nullptr) {
|
||||||
LOG_ERR("%s: failed to apply lora adapter '%s'\n", __func__, la.path.c_str());
|
LOG_ERR("%s: failed to apply lora adapter '%s'\n", __func__, la.path.c_str());
|
||||||
llama_free(lctx);
|
llama_free(lctx);
|
||||||
@ -938,17 +957,17 @@ struct common_init_result common_init_from_params(common_params & params) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
if (!params.lora_init_without_apply) {
|
if (!params.lora_init_without_apply) {
|
||||||
common_lora_adapters_apply(lctx, params.lora_adapters);
|
common_set_adapter_lora(lctx, params.lora_adapters);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (params.sampling.ignore_eos && llama_token_eos(model) == LLAMA_TOKEN_NULL) {
|
if (params.sampling.ignore_eos && llama_vocab_eos(vocab) == LLAMA_TOKEN_NULL) {
|
||||||
LOG_WRN("%s: warning: model does not have an EOS token, ignoring --ignore-eos\n", __func__);
|
LOG_WRN("%s: warning: vocab does not have an EOS token, ignoring --ignore-eos\n", __func__);
|
||||||
params.sampling.ignore_eos = false;
|
params.sampling.ignore_eos = false;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (params.sampling.ignore_eos) {
|
if (params.sampling.ignore_eos) {
|
||||||
for (llama_token i = 0; i < llama_n_vocab(model); i++) {
|
for (llama_token i = 0; i < llama_vocab_n_tokens(vocab); i++) {
|
||||||
if (llama_token_is_eog(model, i)) {
|
if (llama_vocab_is_eog(vocab, i)) {
|
||||||
LOG_INF("%s: added %s logit bias = %f\n", __func__, common_token_to_piece(lctx, i).c_str(), -INFINITY);
|
LOG_INF("%s: added %s logit bias = %f\n", __func__, common_token_to_piece(lctx, i).c_str(), -INFINITY);
|
||||||
params.sampling.logit_bias.push_back({i, -INFINITY});
|
params.sampling.logit_bias.push_back({i, -INFINITY});
|
||||||
}
|
}
|
||||||
@ -969,8 +988,9 @@ struct common_init_result common_init_from_params(common_params & params) {
|
|||||||
LOG_WRN("%s: warming up the model with an empty run - please wait ... (--no-warmup to disable)\n", __func__);
|
LOG_WRN("%s: warming up the model with an empty run - please wait ... (--no-warmup to disable)\n", __func__);
|
||||||
|
|
||||||
std::vector<llama_token> tmp;
|
std::vector<llama_token> tmp;
|
||||||
llama_token bos = llama_token_bos(model);
|
llama_token bos = llama_vocab_bos(vocab);
|
||||||
llama_token eos = llama_token_eos(model);
|
llama_token eos = llama_vocab_eos(vocab);
|
||||||
|
|
||||||
// some models (e.g. T5) don't have a BOS token
|
// some models (e.g. T5) don't have a BOS token
|
||||||
if (bos != LLAMA_TOKEN_NULL) {
|
if (bos != LLAMA_TOKEN_NULL) {
|
||||||
tmp.push_back(bos);
|
tmp.push_back(bos);
|
||||||
@ -1005,11 +1025,11 @@ struct common_init_result common_init_from_params(common_params & params) {
|
|||||||
return iparams;
|
return iparams;
|
||||||
}
|
}
|
||||||
|
|
||||||
void common_lora_adapters_apply(struct llama_context * ctx, std::vector<common_lora_adapter_info> & lora) {
|
void common_set_adapter_lora(struct llama_context * ctx, std::vector<common_adapter_lora_info> & lora) {
|
||||||
llama_lora_adapter_clear(ctx);
|
llama_clear_adapter_lora(ctx);
|
||||||
for (auto & la : lora) {
|
for (auto & la : lora) {
|
||||||
if (la.scale != 0.0f) {
|
if (la.scale != 0.0f) {
|
||||||
llama_lora_adapter_set(ctx, la.ptr, la.scale);
|
llama_set_adapter_lora(ctx, la.ptr, la.scale);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -1126,7 +1146,8 @@ static bool curl_perform_with_retry(const std::string & url, CURL * curl, int ma
|
|||||||
|
|
||||||
static bool common_download_file(const std::string & url, const std::string & path, const std::string & hf_token) {
|
static bool common_download_file(const std::string & url, const std::string & path, const std::string & hf_token) {
|
||||||
// Initialize libcurl
|
// Initialize libcurl
|
||||||
std::unique_ptr<CURL, decltype(&curl_easy_cleanup)> curl(curl_easy_init(), &curl_easy_cleanup);
|
curl_ptr curl(curl_easy_init(), &curl_easy_cleanup);
|
||||||
|
curl_slist_ptr http_headers;
|
||||||
if (!curl) {
|
if (!curl) {
|
||||||
LOG_ERR("%s: error initializing libcurl\n", __func__);
|
LOG_ERR("%s: error initializing libcurl\n", __func__);
|
||||||
return false;
|
return false;
|
||||||
@ -1140,11 +1161,9 @@ static bool common_download_file(const std::string & url, const std::string & pa
|
|||||||
|
|
||||||
// Check if hf-token or bearer-token was specified
|
// Check if hf-token or bearer-token was specified
|
||||||
if (!hf_token.empty()) {
|
if (!hf_token.empty()) {
|
||||||
std::string auth_header = "Authorization: Bearer ";
|
std::string auth_header = "Authorization: Bearer " + hf_token;
|
||||||
auth_header += hf_token.c_str();
|
http_headers.ptr = curl_slist_append(http_headers.ptr, auth_header.c_str());
|
||||||
struct curl_slist *http_headers = NULL;
|
curl_easy_setopt(curl.get(), CURLOPT_HTTPHEADER, http_headers.ptr);
|
||||||
http_headers = curl_slist_append(http_headers, auth_header.c_str());
|
|
||||||
curl_easy_setopt(curl.get(), CURLOPT_HTTPHEADER, http_headers);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
#if defined(_WIN32)
|
#if defined(_WIN32)
|
||||||
@ -1440,6 +1459,80 @@ struct llama_model * common_load_model_from_hf(
|
|||||||
return common_load_model_from_url(model_url, local_path, hf_token, params);
|
return common_load_model_from_url(model_url, local_path, hf_token, params);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Allow getting the HF file from the HF repo with tag (like ollama), for example:
|
||||||
|
* - bartowski/Llama-3.2-3B-Instruct-GGUF:q4
|
||||||
|
* - bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
|
||||||
|
* - bartowski/Llama-3.2-3B-Instruct-GGUF:q5_k_s
|
||||||
|
* Tag is optional, default to "latest" (meaning it checks for Q4_K_M first, then Q4, then if not found, return the first GGUF file in repo)
|
||||||
|
*
|
||||||
|
* Return pair of <repo, file> (with "repo" already having tag removed)
|
||||||
|
*
|
||||||
|
* Note: we use the Ollama-compatible HF API, but not using the blobId. Instead, we use the special "ggufFile" field which returns the value for "hf_file". This is done to be backward-compatible with existing cache files.
|
||||||
|
*/
|
||||||
|
std::pair<std::string, std::string> common_get_hf_file(const std::string & hf_repo_with_tag, const std::string & hf_token) {
|
||||||
|
auto parts = string_split<std::string>(hf_repo_with_tag, ':');
|
||||||
|
std::string tag = parts.size() > 1 ? parts.back() : "latest";
|
||||||
|
std::string hf_repo = parts[0];
|
||||||
|
if (string_split<std::string>(hf_repo, '/').size() != 2) {
|
||||||
|
throw std::invalid_argument("error: invalid HF repo format, expected <user>/<model>[:quant]\n");
|
||||||
|
}
|
||||||
|
|
||||||
|
// fetch model info from Hugging Face Hub API
|
||||||
|
json model_info;
|
||||||
|
curl_ptr curl(curl_easy_init(), &curl_easy_cleanup);
|
||||||
|
curl_slist_ptr http_headers;
|
||||||
|
std::string res_str;
|
||||||
|
std::string url = "https://huggingface.co/v2/" + hf_repo + "/manifests/" + tag;
|
||||||
|
curl_easy_setopt(curl.get(), CURLOPT_URL, url.c_str());
|
||||||
|
curl_easy_setopt(curl.get(), CURLOPT_NOPROGRESS, 1L);
|
||||||
|
typedef size_t(*CURLOPT_WRITEFUNCTION_PTR)(void * ptr, size_t size, size_t nmemb, void * data);
|
||||||
|
auto write_callback = [](void * ptr, size_t size, size_t nmemb, void * data) -> size_t {
|
||||||
|
static_cast<std::string *>(data)->append((char * ) ptr, size * nmemb);
|
||||||
|
return size * nmemb;
|
||||||
|
};
|
||||||
|
curl_easy_setopt(curl.get(), CURLOPT_WRITEFUNCTION, static_cast<CURLOPT_WRITEFUNCTION_PTR>(write_callback));
|
||||||
|
curl_easy_setopt(curl.get(), CURLOPT_WRITEDATA, &res_str);
|
||||||
|
#if defined(_WIN32)
|
||||||
|
curl_easy_setopt(curl.get(), CURLOPT_SSL_OPTIONS, CURLSSLOPT_NATIVE_CA);
|
||||||
|
#endif
|
||||||
|
if (!hf_token.empty()) {
|
||||||
|
std::string auth_header = "Authorization: Bearer " + hf_token;
|
||||||
|
http_headers.ptr = curl_slist_append(http_headers.ptr, auth_header.c_str());
|
||||||
|
}
|
||||||
|
// Important: the User-Agent must be "llama-cpp" to get the "ggufFile" field in the response
|
||||||
|
http_headers.ptr = curl_slist_append(http_headers.ptr, "User-Agent: llama-cpp");
|
||||||
|
http_headers.ptr = curl_slist_append(http_headers.ptr, "Accept: application/json");
|
||||||
|
curl_easy_setopt(curl.get(), CURLOPT_HTTPHEADER, http_headers.ptr);
|
||||||
|
|
||||||
|
CURLcode res = curl_easy_perform(curl.get());
|
||||||
|
|
||||||
|
if (res != CURLE_OK) {
|
||||||
|
throw std::runtime_error("error: cannot make GET request to HF API");
|
||||||
|
}
|
||||||
|
|
||||||
|
long res_code;
|
||||||
|
curl_easy_getinfo(curl.get(), CURLINFO_RESPONSE_CODE, &res_code);
|
||||||
|
if (res_code == 200) {
|
||||||
|
model_info = json::parse(res_str);
|
||||||
|
} else if (res_code == 401) {
|
||||||
|
throw std::runtime_error("error: model is private or does not exist; if you are accessing a gated model, please provide a valid HF token");
|
||||||
|
} else {
|
||||||
|
throw std::runtime_error(string_format("error from HF API, response code: %ld, data: %s", res_code, res_str.c_str()));
|
||||||
|
}
|
||||||
|
|
||||||
|
// check response
|
||||||
|
if (!model_info.contains("ggufFile")) {
|
||||||
|
throw std::runtime_error("error: model does not have ggufFile");
|
||||||
|
}
|
||||||
|
json & gguf_file = model_info.at("ggufFile");
|
||||||
|
if (!gguf_file.contains("rfilename")) {
|
||||||
|
throw std::runtime_error("error: ggufFile does not have rfilename");
|
||||||
|
}
|
||||||
|
|
||||||
|
return std::make_pair(hf_repo, gguf_file.at("rfilename"));
|
||||||
|
}
|
||||||
|
|
||||||
#else
|
#else
|
||||||
|
|
||||||
struct llama_model * common_load_model_from_url(
|
struct llama_model * common_load_model_from_url(
|
||||||
@ -1461,6 +1554,11 @@ struct llama_model * common_load_model_from_hf(
|
|||||||
return nullptr;
|
return nullptr;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
std::pair<std::string, std::string> common_get_hf_file(const std::string &, const std::string &) {
|
||||||
|
LOG_WRN("%s: llama.cpp built without libcurl, downloading from Hugging Face not supported.\n", __func__);
|
||||||
|
return std::make_pair("", "");
|
||||||
|
}
|
||||||
|
|
||||||
#endif // LLAMA_USE_CURL
|
#endif // LLAMA_USE_CURL
|
||||||
|
|
||||||
//
|
//
|
||||||
@ -1559,21 +1657,23 @@ std::vector<llama_token> common_tokenize(
|
|||||||
const std::string & text,
|
const std::string & text,
|
||||||
bool add_special,
|
bool add_special,
|
||||||
bool parse_special) {
|
bool parse_special) {
|
||||||
return common_tokenize(llama_get_model(ctx), text, add_special, parse_special);
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
return common_tokenize(vocab, text, add_special, parse_special);
|
||||||
}
|
}
|
||||||
|
|
||||||
std::vector<llama_token> common_tokenize(
|
std::vector<llama_token> common_tokenize(
|
||||||
const struct llama_model * model,
|
const struct llama_vocab * vocab,
|
||||||
const std::string & text,
|
const std::string & text,
|
||||||
bool add_special,
|
bool add_special,
|
||||||
bool parse_special) {
|
bool parse_special) {
|
||||||
// upper limit for the number of tokens
|
// upper limit for the number of tokens
|
||||||
int n_tokens = text.length() + 2 * add_special;
|
int n_tokens = text.length() + 2 * add_special;
|
||||||
std::vector<llama_token> result(n_tokens);
|
std::vector<llama_token> result(n_tokens);
|
||||||
n_tokens = llama_tokenize(model, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
|
n_tokens = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
|
||||||
if (n_tokens < 0) {
|
if (n_tokens < 0) {
|
||||||
result.resize(-n_tokens);
|
result.resize(-n_tokens);
|
||||||
int check = llama_tokenize(model, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
|
int check = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
|
||||||
GGML_ASSERT(check == -n_tokens);
|
GGML_ASSERT(check == -n_tokens);
|
||||||
} else {
|
} else {
|
||||||
result.resize(n_tokens);
|
result.resize(n_tokens);
|
||||||
@ -1582,12 +1682,18 @@ std::vector<llama_token> common_tokenize(
|
|||||||
}
|
}
|
||||||
|
|
||||||
std::string common_token_to_piece(const struct llama_context * ctx, llama_token token, bool special) {
|
std::string common_token_to_piece(const struct llama_context * ctx, llama_token token, bool special) {
|
||||||
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
return common_token_to_piece(vocab, token, special);
|
||||||
|
}
|
||||||
|
|
||||||
|
std::string common_token_to_piece(const struct llama_vocab * vocab, llama_token token, bool special) {
|
||||||
std::string piece;
|
std::string piece;
|
||||||
piece.resize(piece.capacity()); // using string internal cache, 15 bytes + '\n'
|
piece.resize(piece.capacity()); // using string internal cache, 15 bytes + '\n'
|
||||||
const int n_chars = llama_token_to_piece(llama_get_model(ctx), token, &piece[0], piece.size(), 0, special);
|
const int n_chars = llama_token_to_piece(vocab, token, &piece[0], piece.size(), 0, special);
|
||||||
if (n_chars < 0) {
|
if (n_chars < 0) {
|
||||||
piece.resize(-n_chars);
|
piece.resize(-n_chars);
|
||||||
int check = llama_token_to_piece(llama_get_model(ctx), token, &piece[0], piece.size(), 0, special);
|
int check = llama_token_to_piece(vocab, token, &piece[0], piece.size(), 0, special);
|
||||||
GGML_ASSERT(check == -n_chars);
|
GGML_ASSERT(check == -n_chars);
|
||||||
}
|
}
|
||||||
else {
|
else {
|
||||||
@ -1597,13 +1703,19 @@ std::string common_token_to_piece(const struct llama_context * ctx, llama_token
|
|||||||
return piece;
|
return piece;
|
||||||
}
|
}
|
||||||
|
|
||||||
std::string common_detokenize(llama_context * ctx, const std::vector<llama_token> & tokens, bool special) {
|
std::string common_detokenize(const struct llama_context * ctx, const std::vector<llama_token> & tokens, bool special) {
|
||||||
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
return common_detokenize(vocab, tokens, special);
|
||||||
|
}
|
||||||
|
|
||||||
|
std::string common_detokenize(const struct llama_vocab * vocab, const std::vector<llama_token> & tokens, bool special) {
|
||||||
std::string text;
|
std::string text;
|
||||||
text.resize(std::max(text.capacity(), tokens.size()));
|
text.resize(std::max(text.capacity(), tokens.size()));
|
||||||
int32_t n_chars = llama_detokenize(llama_get_model(ctx), tokens.data(), (int32_t)tokens.size(), &text[0], (int32_t)text.size(), false, special);
|
int32_t n_chars = llama_detokenize(vocab, tokens.data(), (int32_t)tokens.size(), &text[0], (int32_t)text.size(), false, special);
|
||||||
if (n_chars < 0) {
|
if (n_chars < 0) {
|
||||||
text.resize(-n_chars);
|
text.resize(-n_chars);
|
||||||
n_chars = llama_detokenize(llama_get_model(ctx), tokens.data(), (int32_t)tokens.size(), &text[0], (int32_t)text.size(), false, special);
|
n_chars = llama_detokenize(vocab, tokens.data(), (int32_t)tokens.size(), &text[0], (int32_t)text.size(), false, special);
|
||||||
GGML_ASSERT(n_chars <= (int32_t)text.size()); // whitespace trimming is performed after per-token detokenization
|
GGML_ASSERT(n_chars <= (int32_t)text.size()); // whitespace trimming is performed after per-token detokenization
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -1618,20 +1730,13 @@ std::string common_detokenize(llama_context * ctx, const std::vector<llama_token
|
|||||||
//
|
//
|
||||||
|
|
||||||
std::string common_get_builtin_chat_template(const struct llama_model * model) {
|
std::string common_get_builtin_chat_template(const struct llama_model * model) {
|
||||||
static const char * template_key = "tokenizer.chat_template";
|
const char * ptr_tmpl = llama_model_chat_template(model);
|
||||||
// call with NULL buffer to get the total size of the string
|
return ptr_tmpl == nullptr ? "" : ptr_tmpl;
|
||||||
int32_t res = llama_model_meta_val_str(model, template_key, NULL, 0);
|
|
||||||
if (res > 0) {
|
|
||||||
std::vector<char> model_template(res + 1, 0);
|
|
||||||
llama_model_meta_val_str(model, template_key, model_template.data(), model_template.size());
|
|
||||||
return std::string(model_template.data(), model_template.size() - 1);
|
|
||||||
}
|
|
||||||
return "";
|
|
||||||
}
|
}
|
||||||
|
|
||||||
bool common_chat_verify_template(const std::string & tmpl) {
|
bool common_chat_verify_template(const std::string & tmpl) {
|
||||||
llama_chat_message chat[] = {{"user", "test"}};
|
llama_chat_message chat[] = {{"user", "test"}};
|
||||||
int res = llama_chat_apply_template(nullptr, tmpl.c_str(), chat, 1, true, nullptr, 0);
|
const int res = llama_chat_apply_template(tmpl.c_str(), chat, 1, true, nullptr, 0);
|
||||||
return res >= 0;
|
return res >= 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -1642,16 +1747,16 @@ std::string common_chat_apply_template(const struct llama_model * model,
|
|||||||
int alloc_size = 0;
|
int alloc_size = 0;
|
||||||
bool fallback = false; // indicate if we must fallback to default chatml
|
bool fallback = false; // indicate if we must fallback to default chatml
|
||||||
std::vector<llama_chat_message> chat;
|
std::vector<llama_chat_message> chat;
|
||||||
for (auto & msg : msgs) {
|
for (const auto & msg : msgs) {
|
||||||
chat.push_back({msg.role.c_str(), msg.content.c_str()});
|
chat.push_back({msg.role.c_str(), msg.content.c_str()});
|
||||||
alloc_size += (msg.role.size() + msg.content.size()) * 1.25;
|
alloc_size += (msg.role.size() + msg.content.size()) * 1.25;
|
||||||
}
|
}
|
||||||
|
|
||||||
const char * ptr_tmpl = tmpl.empty() ? nullptr : tmpl.c_str();
|
const char * ptr_tmpl = tmpl.empty() ? llama_model_chat_template(model) : tmpl.c_str();
|
||||||
std::vector<char> buf(alloc_size);
|
std::vector<char> buf(alloc_size);
|
||||||
|
|
||||||
// run the first time to get the total output length
|
// run the first time to get the total output length
|
||||||
int32_t res = llama_chat_apply_template(model, ptr_tmpl, chat.data(), chat.size(), add_ass, buf.data(), buf.size());
|
int32_t res = llama_chat_apply_template(ptr_tmpl, chat.data(), chat.size(), add_ass, buf.data(), buf.size());
|
||||||
|
|
||||||
// error: chat template is not supported
|
// error: chat template is not supported
|
||||||
if (res < 0) {
|
if (res < 0) {
|
||||||
@ -1659,18 +1764,17 @@ std::string common_chat_apply_template(const struct llama_model * model,
|
|||||||
// if the custom "tmpl" is not supported, we throw an error
|
// if the custom "tmpl" is not supported, we throw an error
|
||||||
// this is a bit redundant (for good), since we're not sure if user validated the custom template with llama_chat_verify_template()
|
// this is a bit redundant (for good), since we're not sure if user validated the custom template with llama_chat_verify_template()
|
||||||
throw std::runtime_error("this custom template is not supported");
|
throw std::runtime_error("this custom template is not supported");
|
||||||
} else {
|
|
||||||
// If the built-in template is not supported, we default to chatml
|
|
||||||
res = llama_chat_apply_template(nullptr, "chatml", chat.data(), chat.size(), add_ass, buf.data(), buf.size());
|
|
||||||
fallback = true;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// If the built-in template is not supported, we default to chatml
|
||||||
|
res = llama_chat_apply_template("chatml", chat.data(), chat.size(), add_ass, buf.data(), buf.size());
|
||||||
|
fallback = true;
|
||||||
}
|
}
|
||||||
|
|
||||||
// if it turns out that our buffer is too small, we resize it
|
// if it turns out that our buffer is too small, we resize it
|
||||||
if ((size_t) res > buf.size()) {
|
if ((size_t) res > buf.size()) {
|
||||||
buf.resize(res);
|
buf.resize(res);
|
||||||
res = llama_chat_apply_template(
|
res = llama_chat_apply_template(
|
||||||
fallback ? nullptr : model,
|
|
||||||
fallback ? "chatml" : ptr_tmpl,
|
fallback ? "chatml" : ptr_tmpl,
|
||||||
chat.data(), chat.size(), add_ass, buf.data(), buf.size());
|
chat.data(), chat.size(), add_ass, buf.data(), buf.size());
|
||||||
}
|
}
|
||||||
|
@ -24,11 +24,11 @@
|
|||||||
|
|
||||||
#define DEFAULT_MODEL_PATH "models/7B/ggml-model-f16.gguf"
|
#define DEFAULT_MODEL_PATH "models/7B/ggml-model-f16.gguf"
|
||||||
|
|
||||||
struct common_lora_adapter_info {
|
struct common_adapter_lora_info {
|
||||||
std::string path;
|
std::string path;
|
||||||
float scale;
|
float scale;
|
||||||
|
|
||||||
struct llama_lora_adapter * ptr;
|
struct llama_adapter_lora * ptr;
|
||||||
};
|
};
|
||||||
|
|
||||||
using llama_tokens = std::vector<llama_token>;
|
using llama_tokens = std::vector<llama_token>;
|
||||||
@ -103,6 +103,12 @@ enum dimre_method {
|
|||||||
DIMRE_METHOD_MEAN,
|
DIMRE_METHOD_MEAN,
|
||||||
};
|
};
|
||||||
|
|
||||||
|
enum common_conversation_mode {
|
||||||
|
COMMON_CONVERSATION_MODE_DISABLED = 0,
|
||||||
|
COMMON_CONVERSATION_MODE_ENABLED = 1,
|
||||||
|
COMMON_CONVERSATION_MODE_AUTO = 2,
|
||||||
|
};
|
||||||
|
|
||||||
// sampling parameters
|
// sampling parameters
|
||||||
struct common_params_sampling {
|
struct common_params_sampling {
|
||||||
uint32_t seed = LLAMA_DEFAULT_SEED; // the seed used to initialize llama_sampler
|
uint32_t seed = LLAMA_DEFAULT_SEED; // the seed used to initialize llama_sampler
|
||||||
@ -247,8 +253,8 @@ struct common_params {
|
|||||||
std::vector<std::string> antiprompt; // strings upon which more user input is prompted (a.k.a. reverse prompts)
|
std::vector<std::string> antiprompt; // strings upon which more user input is prompted (a.k.a. reverse prompts)
|
||||||
std::vector<llama_model_kv_override> kv_overrides;
|
std::vector<llama_model_kv_override> kv_overrides;
|
||||||
|
|
||||||
bool lora_init_without_apply = false; // only load lora to memory, but do not apply it to ctx (user can manually apply lora later using llama_lora_adapter_apply)
|
bool lora_init_without_apply = false; // only load lora to memory, but do not apply it to ctx (user can manually apply lora later using llama_adapter_lora_apply)
|
||||||
std::vector<common_lora_adapter_info> lora_adapters; // lora adapter path with user defined scale
|
std::vector<common_adapter_lora_info> lora_adapters; // lora adapter path with user defined scale
|
||||||
|
|
||||||
std::vector<common_control_vector_load_info> control_vectors; // control vector with user defined scale
|
std::vector<common_control_vector_load_info> control_vectors; // control vector with user defined scale
|
||||||
|
|
||||||
@ -276,7 +282,6 @@ struct common_params {
|
|||||||
bool special = false; // enable special token output
|
bool special = false; // enable special token output
|
||||||
bool interactive = false; // interactive mode
|
bool interactive = false; // interactive mode
|
||||||
bool interactive_first = false; // wait for user input immediately
|
bool interactive_first = false; // wait for user input immediately
|
||||||
bool conversation = false; // conversation mode (does not print special tokens and suffix/prefix)
|
|
||||||
bool prompt_cache_all = false; // save user input and generations to prompt cache
|
bool prompt_cache_all = false; // save user input and generations to prompt cache
|
||||||
bool prompt_cache_ro = false; // open the prompt cache read-only and do not update it
|
bool prompt_cache_ro = false; // open the prompt cache read-only and do not update it
|
||||||
|
|
||||||
@ -302,6 +307,8 @@ struct common_params {
|
|||||||
ggml_type cache_type_k = GGML_TYPE_F16; // KV cache data type for the K
|
ggml_type cache_type_k = GGML_TYPE_F16; // KV cache data type for the K
|
||||||
ggml_type cache_type_v = GGML_TYPE_F16; // KV cache data type for the V
|
ggml_type cache_type_v = GGML_TYPE_F16; // KV cache data type for the V
|
||||||
|
|
||||||
|
common_conversation_mode conversation_mode = COMMON_CONVERSATION_MODE_AUTO;
|
||||||
|
|
||||||
// multimodal models (see examples/llava)
|
// multimodal models (see examples/llava)
|
||||||
std::string mmproj = ""; // path to multimodal projector // NOLINT
|
std::string mmproj = ""; // path to multimodal projector // NOLINT
|
||||||
std::vector<std::string> image; // path to image file(s)
|
std::vector<std::string> image; // path to image file(s)
|
||||||
@ -455,6 +462,11 @@ static bool string_starts_with(const std::string & str,
|
|||||||
return str.rfind(prefix, 0) == 0;
|
return str.rfind(prefix, 0) == 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
static bool string_ends_with(const std::string & str,
|
||||||
|
const std::string & suffix) { // While we wait for C++20's std::string::ends_with...
|
||||||
|
return str.size() >= suffix.size() && str.compare(str.size()-suffix.size(), suffix.size(), suffix) == 0;
|
||||||
|
}
|
||||||
|
|
||||||
bool string_parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides);
|
bool string_parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides);
|
||||||
void string_process_escapes(std::string & input);
|
void string_process_escapes(std::string & input);
|
||||||
|
|
||||||
@ -482,7 +494,7 @@ struct common_init_result {
|
|||||||
llama_model_ptr model;
|
llama_model_ptr model;
|
||||||
llama_context_ptr context;
|
llama_context_ptr context;
|
||||||
|
|
||||||
std::vector<llama_lora_adapter_ptr> lora;
|
std::vector<llama_adapter_lora_ptr> lora;
|
||||||
};
|
};
|
||||||
|
|
||||||
struct common_init_result common_init_from_params(common_params & params);
|
struct common_init_result common_init_from_params(common_params & params);
|
||||||
@ -502,9 +514,12 @@ struct llama_model * common_load_model_from_hf(
|
|||||||
const std::string & local_path,
|
const std::string & local_path,
|
||||||
const std::string & hf_token,
|
const std::string & hf_token,
|
||||||
const struct llama_model_params & params);
|
const struct llama_model_params & params);
|
||||||
|
std::pair<std::string, std::string> common_get_hf_file(
|
||||||
|
const std::string & hf_repo_with_tag,
|
||||||
|
const std::string & hf_token);
|
||||||
|
|
||||||
// clear LoRA adapters from context, then apply new list of adapters
|
// clear LoRA adapters from context, then apply new list of adapters
|
||||||
void common_lora_adapters_apply(struct llama_context * ctx, std::vector<common_lora_adapter_info> & lora);
|
void common_set_adapter_lora(struct llama_context * ctx, std::vector<common_adapter_lora_info> & lora);
|
||||||
|
|
||||||
//
|
//
|
||||||
// Batch utils
|
// Batch utils
|
||||||
@ -542,7 +557,7 @@ std::vector<llama_token> common_tokenize(
|
|||||||
bool parse_special = false);
|
bool parse_special = false);
|
||||||
|
|
||||||
std::vector<llama_token> common_tokenize(
|
std::vector<llama_token> common_tokenize(
|
||||||
const struct llama_model * model,
|
const struct llama_vocab * vocab,
|
||||||
const std::string & text,
|
const std::string & text,
|
||||||
bool add_special,
|
bool add_special,
|
||||||
bool parse_special = false);
|
bool parse_special = false);
|
||||||
@ -554,11 +569,21 @@ std::string common_token_to_piece(
|
|||||||
llama_token token,
|
llama_token token,
|
||||||
bool special = true);
|
bool special = true);
|
||||||
|
|
||||||
|
std::string common_token_to_piece(
|
||||||
|
const struct llama_vocab * vocab,
|
||||||
|
llama_token token,
|
||||||
|
bool special = true);
|
||||||
|
|
||||||
// detokenizes a vector of tokens into a string
|
// detokenizes a vector of tokens into a string
|
||||||
// should work similar to Python's `tokenizer.decode`
|
// should work similar to Python's `tokenizer.decode`
|
||||||
// optionally renders special/control tokens
|
// optionally renders special/control tokens
|
||||||
std::string common_detokenize(
|
std::string common_detokenize(
|
||||||
llama_context * ctx,
|
const struct llama_context * ctx,
|
||||||
|
const std::vector<llama_token> & tokens,
|
||||||
|
bool special = true);
|
||||||
|
|
||||||
|
std::string common_detokenize(
|
||||||
|
const struct llama_vocab * vocab,
|
||||||
const std::vector<llama_token> & tokens,
|
const std::vector<llama_token> & tokens,
|
||||||
bool special = true);
|
bool special = true);
|
||||||
|
|
||||||
|
@ -113,7 +113,10 @@ struct common_sampler {
|
|||||||
void set_logits(struct llama_context * ctx, int idx) {
|
void set_logits(struct llama_context * ctx, int idx) {
|
||||||
const auto * logits = llama_get_logits_ith(ctx, idx);
|
const auto * logits = llama_get_logits_ith(ctx, idx);
|
||||||
|
|
||||||
const int n_vocab = llama_n_vocab(llama_get_model(ctx));
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
|
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||||||
|
|
||||||
cur.resize(n_vocab);
|
cur.resize(n_vocab);
|
||||||
|
|
||||||
@ -142,13 +145,15 @@ std::string common_params_sampling::print() const {
|
|||||||
}
|
}
|
||||||
|
|
||||||
struct common_sampler * common_sampler_init(const struct llama_model * model, const struct common_params_sampling & params) {
|
struct common_sampler * common_sampler_init(const struct llama_model * model, const struct common_params_sampling & params) {
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
llama_sampler_chain_params lparams = llama_sampler_chain_default_params();
|
llama_sampler_chain_params lparams = llama_sampler_chain_default_params();
|
||||||
|
|
||||||
lparams.no_perf = params.no_perf;
|
lparams.no_perf = params.no_perf;
|
||||||
|
|
||||||
auto * result = new common_sampler {
|
auto * result = new common_sampler {
|
||||||
/* .params = */ params,
|
/* .params = */ params,
|
||||||
/* .grmr = */ llama_sampler_init_grammar(model, params.grammar.c_str(), "root"),
|
/* .grmr = */ llama_sampler_init_grammar(vocab, params.grammar.c_str(), "root"),
|
||||||
/* .chain = */ llama_sampler_chain_init(lparams),
|
/* .chain = */ llama_sampler_chain_init(lparams),
|
||||||
/* .prev = */ ring_buffer<llama_token>(std::max(32, params.n_prev)),
|
/* .prev = */ ring_buffer<llama_token>(std::max(32, params.n_prev)),
|
||||||
/* .cur = */ {},
|
/* .cur = */ {},
|
||||||
@ -157,7 +162,7 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, co
|
|||||||
|
|
||||||
llama_sampler_chain_add(result->chain,
|
llama_sampler_chain_add(result->chain,
|
||||||
llama_sampler_init_logit_bias(
|
llama_sampler_init_logit_bias(
|
||||||
llama_n_vocab(model),
|
llama_vocab_n_tokens(vocab),
|
||||||
params.logit_bias.size(),
|
params.logit_bias.size(),
|
||||||
params.logit_bias.data()));
|
params.logit_bias.data()));
|
||||||
|
|
||||||
@ -176,32 +181,32 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, co
|
|||||||
c_breakers.push_back(str.c_str());
|
c_breakers.push_back(str.c_str());
|
||||||
}
|
}
|
||||||
|
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_dry (model, params.dry_multiplier, params.dry_base, params.dry_allowed_length, params.dry_penalty_last_n, c_breakers.data(), c_breakers.size()));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_dry (vocab, llama_model_n_ctx_train(model), params.dry_multiplier, params.dry_base, params.dry_allowed_length, params.dry_penalty_last_n, c_breakers.data(), c_breakers.size()));
|
||||||
}
|
}
|
||||||
break;
|
break;
|
||||||
case COMMON_SAMPLER_TYPE_TOP_K:
|
case COMMON_SAMPLER_TYPE_TOP_K:
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_top_k (params.top_k));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_top_k (params.top_k));
|
||||||
break;
|
break;
|
||||||
case COMMON_SAMPLER_TYPE_TOP_P:
|
case COMMON_SAMPLER_TYPE_TOP_P:
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_top_p (params.top_p, params.min_keep));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_top_p (params.top_p, params.min_keep));
|
||||||
break;
|
break;
|
||||||
case COMMON_SAMPLER_TYPE_MIN_P:
|
case COMMON_SAMPLER_TYPE_MIN_P:
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_min_p (params.min_p, params.min_keep));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_min_p (params.min_p, params.min_keep));
|
||||||
break;
|
break;
|
||||||
case COMMON_SAMPLER_TYPE_XTC:
|
case COMMON_SAMPLER_TYPE_XTC:
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_xtc (params.xtc_probability, params.xtc_threshold, params.min_keep, params.seed));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_xtc (params.xtc_probability, params.xtc_threshold, params.min_keep, params.seed));
|
||||||
break;
|
break;
|
||||||
case COMMON_SAMPLER_TYPE_TYPICAL_P:
|
case COMMON_SAMPLER_TYPE_TYPICAL_P:
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_typical (params.typ_p, params.min_keep));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_typical (params.typ_p, params.min_keep));
|
||||||
break;
|
break;
|
||||||
case COMMON_SAMPLER_TYPE_TEMPERATURE:
|
case COMMON_SAMPLER_TYPE_TEMPERATURE:
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_temp_ext (params.temp, params.dynatemp_range, params.dynatemp_exponent));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_temp_ext (params.temp, params.dynatemp_range, params.dynatemp_exponent));
|
||||||
break;
|
break;
|
||||||
case COMMON_SAMPLER_TYPE_INFILL:
|
case COMMON_SAMPLER_TYPE_INFILL:
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_infill (model));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_infill (vocab));
|
||||||
break;
|
break;
|
||||||
case COMMON_SAMPLER_TYPE_PENALTIES:
|
case COMMON_SAMPLER_TYPE_PENALTIES:
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_penalties (params.penalty_last_n, params.penalty_repeat, params.penalty_freq, params.penalty_present));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_penalties(params.penalty_last_n, params.penalty_repeat, params.penalty_freq, params.penalty_present));
|
||||||
break;
|
break;
|
||||||
default:
|
default:
|
||||||
GGML_ASSERT(false && "unknown sampler type");
|
GGML_ASSERT(false && "unknown sampler type");
|
||||||
@ -211,7 +216,7 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, co
|
|||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_dist(params.seed));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_dist(params.seed));
|
||||||
} else if (params.mirostat == 1) {
|
} else if (params.mirostat == 1) {
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_temp(params.temp));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_temp(params.temp));
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_mirostat(llama_n_vocab(model), params.seed, params.mirostat_tau, params.mirostat_eta, 100));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_mirostat(llama_vocab_n_tokens(vocab), params.seed, params.mirostat_tau, params.mirostat_eta, 100));
|
||||||
} else if (params.mirostat == 2) {
|
} else if (params.mirostat == 2) {
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_temp(params.temp));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_temp(params.temp));
|
||||||
llama_sampler_chain_add(result->chain, llama_sampler_init_mirostat_v2(params.seed, params.mirostat_tau, params.mirostat_eta));
|
llama_sampler_chain_add(result->chain, llama_sampler_init_mirostat_v2(params.seed, params.mirostat_tau, params.mirostat_eta));
|
||||||
|
@ -79,10 +79,13 @@ bool common_speculative_are_compatible(
|
|||||||
const struct llama_model * model_tgt = llama_get_model(ctx_tgt);
|
const struct llama_model * model_tgt = llama_get_model(ctx_tgt);
|
||||||
const struct llama_model * model_dft = llama_get_model(ctx_dft);
|
const struct llama_model * model_dft = llama_get_model(ctx_dft);
|
||||||
|
|
||||||
const bool vocab_type_tgt = llama_vocab_type(model_tgt);
|
const struct llama_vocab * vocab_tgt = llama_model_get_vocab(model_tgt);
|
||||||
|
const struct llama_vocab * vocab_dft = llama_model_get_vocab(model_dft);
|
||||||
|
|
||||||
|
const bool vocab_type_tgt = llama_vocab_type(vocab_tgt);
|
||||||
LOG_DBG("%s: vocab_type tgt: %d\n", __func__, vocab_type_tgt);
|
LOG_DBG("%s: vocab_type tgt: %d\n", __func__, vocab_type_tgt);
|
||||||
|
|
||||||
const bool vocab_type_dft = llama_vocab_type(model_dft);
|
const bool vocab_type_dft = llama_vocab_type(vocab_dft);
|
||||||
LOG_DBG("%s: vocab_type dft: %d\n", __func__, vocab_type_dft);
|
LOG_DBG("%s: vocab_type dft: %d\n", __func__, vocab_type_dft);
|
||||||
|
|
||||||
if (vocab_type_tgt != vocab_type_dft) {
|
if (vocab_type_tgt != vocab_type_dft) {
|
||||||
@ -91,34 +94,34 @@ bool common_speculative_are_compatible(
|
|||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (llama_add_bos_token(model_tgt) != llama_add_bos_token(model_dft) ||
|
if (llama_vocab_get_add_bos(vocab_tgt) != llama_vocab_get_add_bos(vocab_dft) ||
|
||||||
llama_add_eos_token(model_tgt) != llama_add_eos_token(model_dft) ||
|
llama_vocab_get_add_eos(vocab_tgt) != llama_vocab_get_add_eos(vocab_dft) ||
|
||||||
llama_token_bos(model_tgt) != llama_token_bos(model_dft) ||
|
llama_vocab_bos(vocab_tgt) != llama_vocab_bos(vocab_dft) ||
|
||||||
llama_token_eos(model_tgt) != llama_token_eos(model_dft)) {
|
llama_vocab_eos(vocab_tgt) != llama_vocab_eos(vocab_dft)) {
|
||||||
LOG_ERR("%s: draft model special tokens must match target model to use speculation\n", __func__);
|
LOG_ERR("%s: draft vocab special tokens must match target vocab to use speculation\n", __func__);
|
||||||
LOG_ERR("%s: tgt: bos = %d (%d), eos = %d (%d)\n", __func__, llama_token_bos(model_tgt), llama_add_bos_token(model_tgt), llama_token_eos(model_tgt), llama_add_eos_token(model_tgt));
|
LOG_ERR("%s: tgt: bos = %d (%d), eos = %d (%d)\n", __func__, llama_vocab_bos(vocab_tgt), llama_vocab_get_add_bos(vocab_tgt), llama_vocab_eos(vocab_tgt), llama_vocab_get_add_eos(vocab_tgt));
|
||||||
LOG_ERR("%s: dft: bos = %d (%d), eos = %d (%d)\n", __func__, llama_token_bos(model_dft), llama_add_bos_token(model_dft), llama_token_eos(model_dft), llama_add_eos_token(model_dft));
|
LOG_ERR("%s: dft: bos = %d (%d), eos = %d (%d)\n", __func__, llama_vocab_bos(vocab_dft), llama_vocab_get_add_bos(vocab_dft), llama_vocab_eos(vocab_dft), llama_vocab_get_add_eos(vocab_dft));
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
{
|
{
|
||||||
const int n_vocab_tgt = llama_n_vocab(model_tgt);
|
const int n_vocab_tgt = llama_vocab_n_tokens(vocab_tgt);
|
||||||
const int n_vocab_dft = llama_n_vocab(model_dft);
|
const int n_vocab_dft = llama_vocab_n_tokens(vocab_dft);
|
||||||
|
|
||||||
const int vocab_diff = std::abs(n_vocab_tgt - n_vocab_dft);
|
const int vocab_diff = std::abs(n_vocab_tgt - n_vocab_dft);
|
||||||
|
|
||||||
if (vocab_diff > SPEC_VOCAB_MAX_SIZE_DIFFERENCE) {
|
if (vocab_diff > SPEC_VOCAB_MAX_SIZE_DIFFERENCE) {
|
||||||
LOG_ERR("%s: draft model vocab must closely match target model to use speculation but "
|
LOG_ERR("%s: draft model vocab must closely match target model to use speculation but "
|
||||||
"target vocab size %d does not match draft vocab size %d - difference %d, max allowed %d\n",
|
"target vocab size %d does not match draft vocab size %d - difference %d, max allowed %d\n",
|
||||||
__func__, n_vocab_tgt, llama_n_vocab(model_dft), vocab_diff, SPEC_VOCAB_MAX_SIZE_DIFFERENCE);
|
__func__, n_vocab_tgt, llama_vocab_n_tokens(vocab_dft), vocab_diff, SPEC_VOCAB_MAX_SIZE_DIFFERENCE);
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
for (int i = SPEC_VOCAB_CHECK_START_TOKEN_ID; i < std::min(n_vocab_tgt, n_vocab_dft); ++i) {
|
for (int i = SPEC_VOCAB_CHECK_START_TOKEN_ID; i < std::min(n_vocab_tgt, n_vocab_dft); ++i) {
|
||||||
const char * token_text_tgt = llama_token_get_text(model_tgt, i);
|
const char * token_text_tgt = llama_vocab_get_text(vocab_tgt, i);
|
||||||
const char * token_text_dft = llama_token_get_text(model_dft, i);
|
const char * token_text_dft = llama_vocab_get_text(vocab_dft, i);
|
||||||
if (std::strcmp(token_text_tgt, token_text_dft) != 0) {
|
if (std::strcmp(token_text_tgt, token_text_dft) != 0) {
|
||||||
LOG_ERR("%s: draft model vocab must match target model to use speculation but "
|
LOG_ERR("%s: draft vocab vocab must match target vocab to use speculation but "
|
||||||
"token %d content differs - target '%s', draft '%s'\n", __func__, i,
|
"token %d content differs - target '%s', draft '%s'\n", __func__, i,
|
||||||
common_token_to_piece(ctx_tgt, i).c_str(),
|
common_token_to_piece(ctx_tgt, i).c_str(),
|
||||||
common_token_to_piece(ctx_dft, i).c_str());
|
common_token_to_piece(ctx_dft, i).c_str());
|
||||||
|
@ -326,6 +326,7 @@ class Model:
|
|||||||
gguf.MODEL_TENSOR.TIME_MIX_W2,
|
gguf.MODEL_TENSOR.TIME_MIX_W2,
|
||||||
gguf.MODEL_TENSOR.TIME_MIX_DECAY_W1,
|
gguf.MODEL_TENSOR.TIME_MIX_DECAY_W1,
|
||||||
gguf.MODEL_TENSOR.TIME_MIX_DECAY_W2,
|
gguf.MODEL_TENSOR.TIME_MIX_DECAY_W2,
|
||||||
|
gguf.MODEL_TENSOR.TIME_MIX_LERP_FUSED,
|
||||||
gguf.MODEL_TENSOR.POSNET_NORM1,
|
gguf.MODEL_TENSOR.POSNET_NORM1,
|
||||||
gguf.MODEL_TENSOR.POSNET_NORM2,
|
gguf.MODEL_TENSOR.POSNET_NORM2,
|
||||||
)
|
)
|
||||||
@ -477,6 +478,11 @@ class Model:
|
|||||||
return modelcls
|
return modelcls
|
||||||
return func
|
return func
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def print_registered_models(cls):
|
||||||
|
for name in sorted(cls._model_classes.keys()):
|
||||||
|
logger.error(f"- {name}")
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def from_model_architecture(cls, arch: str) -> type[Model]:
|
def from_model_architecture(cls, arch: str) -> type[Model]:
|
||||||
try:
|
try:
|
||||||
@ -2562,6 +2568,63 @@ class Phi3MiniModel(Model):
|
|||||||
yield (self.format_tensor_name(gguf.MODEL_TENSOR.ROPE_FACTORS_SHORT), torch.tensor(short_factors, dtype=torch.float32))
|
yield (self.format_tensor_name(gguf.MODEL_TENSOR.ROPE_FACTORS_SHORT), torch.tensor(short_factors, dtype=torch.float32))
|
||||||
|
|
||||||
|
|
||||||
|
@Model.register("PhiMoEForCausalLM")
|
||||||
|
class PhiMoeModel(Phi3MiniModel):
|
||||||
|
model_arch = gguf.MODEL_ARCH.PHIMOE
|
||||||
|
|
||||||
|
_experts: list[dict[str, Tensor]] | None = None
|
||||||
|
|
||||||
|
def set_gguf_parameters(self):
|
||||||
|
super().set_gguf_parameters()
|
||||||
|
self.gguf_writer.add_expert_used_count(self.hparams["num_experts_per_tok"])
|
||||||
|
self.gguf_writer.add_expert_count(self.hparams["num_local_experts"])
|
||||||
|
|
||||||
|
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
||||||
|
# process the experts separately
|
||||||
|
if name.find("block_sparse_moe.experts") != -1:
|
||||||
|
n_experts = self.hparams["num_local_experts"]
|
||||||
|
assert bid is not None
|
||||||
|
|
||||||
|
if self._experts is None:
|
||||||
|
self._experts = [{} for _ in range(self.block_count)]
|
||||||
|
|
||||||
|
self._experts[bid][name] = data_torch
|
||||||
|
|
||||||
|
if len(self._experts[bid]) >= n_experts * 3:
|
||||||
|
tensors: list[tuple[str, Tensor]] = []
|
||||||
|
|
||||||
|
# merge the experts into a single 3d tensor
|
||||||
|
for w_name in ["w1", "w2", "w3"]:
|
||||||
|
datas: list[Tensor] = []
|
||||||
|
|
||||||
|
for xid in range(n_experts):
|
||||||
|
ename = f"model.layers.{bid}.block_sparse_moe.experts.{xid}.{w_name}.weight"
|
||||||
|
datas.append(self._experts[bid][ename])
|
||||||
|
del self._experts[bid][ename]
|
||||||
|
|
||||||
|
data_torch = torch.stack(datas, dim=0)
|
||||||
|
|
||||||
|
merged_name = f"model.layers.{bid}.block_sparse_moe.experts.{w_name}.weight"
|
||||||
|
|
||||||
|
new_name = self.map_tensor_name(merged_name)
|
||||||
|
|
||||||
|
tensors.append((new_name, data_torch))
|
||||||
|
return tensors
|
||||||
|
else:
|
||||||
|
return []
|
||||||
|
|
||||||
|
return [(self.map_tensor_name(name), data_torch)]
|
||||||
|
|
||||||
|
def prepare_tensors(self):
|
||||||
|
super().prepare_tensors()
|
||||||
|
|
||||||
|
if self._experts is not None:
|
||||||
|
# flatten `list[dict[str, Tensor]]` into `list[str]`
|
||||||
|
experts = [k for d in self._experts for k in d.keys()]
|
||||||
|
if len(experts) > 0:
|
||||||
|
raise ValueError(f"Unprocessed experts: {experts}")
|
||||||
|
|
||||||
|
|
||||||
@Model.register("PlamoForCausalLM")
|
@Model.register("PlamoForCausalLM")
|
||||||
class PlamoModel(Model):
|
class PlamoModel(Model):
|
||||||
model_arch = gguf.MODEL_ARCH.PLAMO
|
model_arch = gguf.MODEL_ARCH.PLAMO
|
||||||
@ -3259,6 +3322,8 @@ class Rwkv6Model(Model):
|
|||||||
# required by llama.cpp, unused
|
# required by llama.cpp, unused
|
||||||
self.gguf_writer.add_head_count(0)
|
self.gguf_writer.add_head_count(0)
|
||||||
|
|
||||||
|
lerp_weights: dict[int, dict[str, Tensor]] = {}
|
||||||
|
|
||||||
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
||||||
new_name = self.map_tensor_name(name)
|
new_name = self.map_tensor_name(name)
|
||||||
|
|
||||||
@ -3274,14 +3339,84 @@ class Rwkv6Model(Model):
|
|||||||
if new_name.endswith("time_mix_decay.weight") or "lerp" in new_name:
|
if new_name.endswith("time_mix_decay.weight") or "lerp" in new_name:
|
||||||
data_torch = data_torch.squeeze()
|
data_torch = data_torch.squeeze()
|
||||||
|
|
||||||
rescale_every_n_layers = self.hparams["rescale_every"]
|
try:
|
||||||
if rescale_every_n_layers > 0:
|
rescale_every_n_layers = self.hparams["rescale_every"]
|
||||||
if new_name.endswith("time_mix_output.weight") or new_name.endswith("channel_mix_value.weight"):
|
if rescale_every_n_layers > 0:
|
||||||
data_torch = data_torch.div_(2 ** int(bid // rescale_every_n_layers))
|
if new_name.endswith("time_mix_output.weight") or new_name.endswith("channel_mix_value.weight"):
|
||||||
|
data_torch = data_torch.div_(2 ** int(bid // rescale_every_n_layers))
|
||||||
|
except KeyError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# concat time_mix_lerp weights to reduce some cpu overhead
|
||||||
|
# also reduces the number of tensors in the model
|
||||||
|
if bid is not None and "time_mix_lerp" in new_name and "time_mix_lerp_x" not in new_name:
|
||||||
|
try:
|
||||||
|
self.lerp_weights[bid][new_name] = data_torch
|
||||||
|
except KeyError:
|
||||||
|
self.lerp_weights[bid] = {new_name: data_torch}
|
||||||
|
if all(f"blk.{bid}.time_mix_lerp_{i}.weight" in self.lerp_weights[bid].keys() for i in ["w", "k", "v", "r", "g"]):
|
||||||
|
new_name = f"blk.{bid}.time_mix_lerp_fused.weight"
|
||||||
|
data = torch.stack([self.lerp_weights[bid][f"blk.{bid}.time_mix_lerp_{i}.weight"].unsqueeze(0) for i in ["w", "k", "v", "r", "g"]], dim=0).unsqueeze(1)
|
||||||
|
yield (new_name, data)
|
||||||
|
return
|
||||||
|
|
||||||
yield (new_name, data_torch)
|
yield (new_name, data_torch)
|
||||||
|
|
||||||
|
|
||||||
|
@Model.register("RWKV6Qwen2ForCausalLM")
|
||||||
|
class RWKV6Qwen2Model(Rwkv6Model):
|
||||||
|
model_arch = gguf.MODEL_ARCH.RWKV6QWEN2
|
||||||
|
|
||||||
|
def set_vocab(self):
|
||||||
|
try:
|
||||||
|
self._set_vocab_sentencepiece()
|
||||||
|
except FileNotFoundError:
|
||||||
|
self._set_vocab_gpt2()
|
||||||
|
|
||||||
|
def set_gguf_parameters(self):
|
||||||
|
block_count = self.hparams["num_hidden_layers"]
|
||||||
|
num_attention_heads = self.hparams["num_attention_heads"]
|
||||||
|
num_key_value_heads = self.hparams["num_key_value_heads"]
|
||||||
|
hidden_size = self.hparams["hidden_size"]
|
||||||
|
head_size = hidden_size // num_attention_heads
|
||||||
|
rms_norm_eps = self.hparams["rms_norm_eps"]
|
||||||
|
intermediate_size = self.hparams["intermediate_size"]
|
||||||
|
time_mix_extra_dim = 64 if hidden_size >= 4096 else 32
|
||||||
|
time_decay_extra_dim = 128 if hidden_size >= 4096 else 64
|
||||||
|
|
||||||
|
# RWKV isn't context limited
|
||||||
|
self.gguf_writer.add_context_length(1048576)
|
||||||
|
self.gguf_writer.add_embedding_length(hidden_size)
|
||||||
|
self.gguf_writer.add_block_count(block_count)
|
||||||
|
self.gguf_writer.add_wkv_head_size(head_size)
|
||||||
|
self.gguf_writer.add_time_mix_extra_dim(time_mix_extra_dim)
|
||||||
|
self.gguf_writer.add_time_decay_extra_dim(time_decay_extra_dim)
|
||||||
|
self.gguf_writer.add_feed_forward_length(intermediate_size)
|
||||||
|
self.gguf_writer.add_file_type(self.ftype)
|
||||||
|
|
||||||
|
# special parameters for time_mixing in RWKV6QWEN2
|
||||||
|
self.gguf_writer.add_layer_norm_rms_eps(rms_norm_eps)
|
||||||
|
self.gguf_writer.add_token_shift_count(1)
|
||||||
|
# RWKV6QWEN2 use grouped key/value like GQA
|
||||||
|
self.gguf_writer.add_head_count_kv(num_key_value_heads)
|
||||||
|
|
||||||
|
# required by llama.cpp, unused
|
||||||
|
self.gguf_writer.add_head_count(0)
|
||||||
|
|
||||||
|
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
||||||
|
for new_name, data in super().modify_tensors(data_torch, name, bid):
|
||||||
|
if "time_mix_w1" in new_name or "time_mix_w2" in new_name:
|
||||||
|
data = data.view(5, -1, data.shape[-1])
|
||||||
|
# rwkv6qwen2 has a different order of rkvwg instead of the original wkvrg
|
||||||
|
# permute them here to avoid code changes
|
||||||
|
data = torch.stack([data[3], data[1], data[2], data[0], data[4]], dim=0).view(-1, data.shape[-1])
|
||||||
|
if "w2" in new_name:
|
||||||
|
data = data.view(5, -1, data.shape[-1])
|
||||||
|
yield (new_name, data)
|
||||||
|
continue
|
||||||
|
yield (new_name, data)
|
||||||
|
|
||||||
|
|
||||||
@Model.register("MambaForCausalLM", "MambaLMHeadModel", "FalconMambaForCausalLM")
|
@Model.register("MambaForCausalLM", "MambaLMHeadModel", "FalconMambaForCausalLM")
|
||||||
class MambaModel(Model):
|
class MambaModel(Model):
|
||||||
model_arch = gguf.MODEL_ARCH.MAMBA
|
model_arch = gguf.MODEL_ARCH.MAMBA
|
||||||
@ -4799,6 +4934,7 @@ def parse_args() -> argparse.Namespace:
|
|||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"model", type=Path,
|
"model", type=Path,
|
||||||
help="directory containing model file",
|
help="directory containing model file",
|
||||||
|
nargs="?",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--use-temp-file", action="store_true",
|
"--use-temp-file", action="store_true",
|
||||||
@ -4836,8 +4972,15 @@ def parse_args() -> argparse.Namespace:
|
|||||||
"--metadata", type=Path,
|
"--metadata", type=Path,
|
||||||
help="Specify the path for an authorship metadata override file"
|
help="Specify the path for an authorship metadata override file"
|
||||||
)
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--print-supported-models", action="store_true",
|
||||||
|
help="Print the supported models"
|
||||||
|
)
|
||||||
|
|
||||||
return parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
if not args.print_supported_models and args.model is None:
|
||||||
|
parser.error("the following arguments are required: model")
|
||||||
|
return args
|
||||||
|
|
||||||
|
|
||||||
def split_str_to_n_bytes(split_str: str) -> int:
|
def split_str_to_n_bytes(split_str: str) -> int:
|
||||||
@ -4861,6 +5004,11 @@ def split_str_to_n_bytes(split_str: str) -> int:
|
|||||||
def main() -> None:
|
def main() -> None:
|
||||||
args = parse_args()
|
args = parse_args()
|
||||||
|
|
||||||
|
if args.print_supported_models:
|
||||||
|
logger.error("Supported models:")
|
||||||
|
Model.print_registered_models()
|
||||||
|
sys.exit(0)
|
||||||
|
|
||||||
if args.verbose:
|
if args.verbose:
|
||||||
logging.basicConfig(level=logging.DEBUG)
|
logging.basicConfig(level=logging.DEBUG)
|
||||||
else:
|
else:
|
||||||
|
@ -127,6 +127,8 @@ For detailed info, please refer to [llama.cpp for SYCL](./backend/SYCL.md).
|
|||||||
|
|
||||||
This provides GPU acceleration using an NVIDIA GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from the [NVIDIA developer site](https://developer.nvidia.com/cuda-downloads).
|
This provides GPU acceleration using an NVIDIA GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from the [NVIDIA developer site](https://developer.nvidia.com/cuda-downloads).
|
||||||
|
|
||||||
|
If you are using Fedora (using Fedora Workstation, or an 'Atomic' variant such as Silverblue), or would like to set up CUDA in a toolbox, please consider our [Fedora CUDA guide](./cuda-fedora.md). Unfortunately, the process is not as simple as one might expect.
|
||||||
|
|
||||||
- Using `CMake`:
|
- Using `CMake`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
317
docs/cuda-fedora.md
Normal file
317
docs/cuda-fedora.md
Normal file
@ -0,0 +1,317 @@
|
|||||||
|
# Setting Up CUDA on Fedora
|
||||||
|
|
||||||
|
In this guide we setup [Nvidia CUDA](https://docs.nvidia.com/cuda/) in a toolbox container. This guide is applicable for:
|
||||||
|
- [Fedora Workstation](https://fedoraproject.org/workstation/)
|
||||||
|
- [Atomic Desktops for Fedora](https://fedoraproject.org/atomic-desktops/)
|
||||||
|
- [Fedora Spins](https://fedoraproject.org/spins)
|
||||||
|
- [Other Distributions](https://containertoolbx.org/distros/), including `Red Hat Enterprise Linux >= 8.`, `Arch Linux`, and `Ubuntu`.
|
||||||
|
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
- [Prerequisites](#prerequisites)
|
||||||
|
- [Monitoring NVIDIA CUDA Repositories](#monitoring-nvidia-cuda-repositories)
|
||||||
|
- [Using the Fedora 39 CUDA Repository](#using-the-fedora-39-cuda-repository)
|
||||||
|
- [Creating a Fedora Toolbox Environment](#creating-a-fedora-toolbox-environment)
|
||||||
|
- [Installing Essential Development Tools](#installing-essential-development-tools)
|
||||||
|
- [Adding the CUDA Repository](#adding-the-cuda-repository)
|
||||||
|
- [Installing `nvidia-driver-libs`](#installing-nvidia-driver-libs)
|
||||||
|
- [Manually Resolving Package Conflicts](#manually-resolving-package-conflicts)
|
||||||
|
- [Finalizing the Installation of `nvidia-driver-libs`](#finalizing-the-installation-of-nvidia-driver-libs)
|
||||||
|
- [Installing the CUDA Meta-Package](#installing-the-cuda-meta-package)
|
||||||
|
- [Configuring the Environment](#configuring-the-environment)
|
||||||
|
- [Verifying the Installation](#verifying-the-installation)
|
||||||
|
- [Conclusion](#conclusion)
|
||||||
|
- [Troubleshooting](#troubleshooting)
|
||||||
|
- [Additional Notes](#additional-notes)
|
||||||
|
- [References](#references)
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- **Toolbox Installed on the Host System** `Fedora Silverblue` and `Fedora Workstation` both have toolbox by default, other distributions may need to install the [toolbox package](https://containertoolbx.org/install/).
|
||||||
|
- **NVIDIA Drivers and Graphics Card installed on Host System (optional)** To run CUDA program, such as `llama.cpp`, the host should be setup to access your NVIDIA hardware. Fedora Hosts can use the [RPM Fusion Repository](https://rpmfusion.org/Howto/NVIDIA).
|
||||||
|
- **Internet connectivity** to download packages.
|
||||||
|
|
||||||
|
### Monitoring NVIDIA CUDA Repositories
|
||||||
|
|
||||||
|
Before proceeding, it is advisable to check if NVIDIA has updated their CUDA repositories for your Fedora version. NVIDIA's repositories can be found at:
|
||||||
|
|
||||||
|
- [Fedora 40 CUDA Repository](https://developer.download.nvidia.com/compute/cuda/repos/fedora40/x86_64/)
|
||||||
|
- [Fedora 41 CUDA Repository](https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/)
|
||||||
|
|
||||||
|
As of the latest update, these repositories do not contain the `cuda` meta-package or are missing essential components.
|
||||||
|
|
||||||
|
### Using the Fedora 39 CUDA Repository
|
||||||
|
|
||||||
|
Since the newer repositories are incomplete, we'll use the Fedora 39 repository:
|
||||||
|
|
||||||
|
- [Fedora 39 CUDA Repository](https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64/)
|
||||||
|
|
||||||
|
**Note:** Fedora 39 is no longer maintained, so we recommend using a toolbox environment to prevent system conflicts.
|
||||||
|
|
||||||
|
## Creating a Fedora Toolbox Environment
|
||||||
|
|
||||||
|
This guide focuses on Fedora hosts, but with small adjustments, it can work for other hosts. Using a Fedora 39 toolbox allows us to install the necessary packages without affecting the host system.
|
||||||
|
|
||||||
|
**Note:** Toolbox is available for other systems, and even without Toolbox, it is possible to use Podman or Docker.
|
||||||
|
|
||||||
|
We do not recommend installing on the host system, as Fedora 39 is out-of-maintenance, and instead you should upgrade to a maintained version of Fedora for your host.
|
||||||
|
|
||||||
|
1. **Create a Fedora 39 Toolbox:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
toolbox create --image registry.fedoraproject.org/fedora-toolbox:39 --container fedora-toolbox-39-cuda
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Enter the Toolbox:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
toolbox enter --container fedora-toolbox-39-cuda
|
||||||
|
```
|
||||||
|
|
||||||
|
Inside the toolbox, you have root privileges and can install packages without affecting the host system.
|
||||||
|
|
||||||
|
## Installing Essential Development Tools
|
||||||
|
|
||||||
|
1. **Synchronize the DNF Package Manager:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo dnf distro-sync
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Install the Default Text Editor (Optional):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo dnf install vim-default-editor --allowerasing
|
||||||
|
```
|
||||||
|
|
||||||
|
The `--allowerasing` flag resolves any package conflicts.
|
||||||
|
|
||||||
|
3. **Install Development Tools and Libraries:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo dnf install @c-development @development-tools cmake
|
||||||
|
```
|
||||||
|
|
||||||
|
This installs essential packages for compiling software, including `gcc`, `make`, and other development headers.
|
||||||
|
|
||||||
|
## Adding the CUDA Repository
|
||||||
|
|
||||||
|
Add the NVIDIA CUDA repository to your DNF configuration:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64/cuda-fedora39.repo
|
||||||
|
```
|
||||||
|
|
||||||
|
After adding the repository, synchronize the package manager again:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo dnf distro-sync
|
||||||
|
```
|
||||||
|
|
||||||
|
## Installing `nvidia-driver-libs`
|
||||||
|
|
||||||
|
Attempt to install `nvidia-driver-libs`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo dnf install nvidia-driver-libs
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation:**
|
||||||
|
|
||||||
|
- `nvidia-driver-libs` contains necessary NVIDIA driver libraries required by CUDA.
|
||||||
|
- This step might fail due to conflicts with existing NVIDIA drivers on the host system.
|
||||||
|
|
||||||
|
## Manually Resolving Package Conflicts
|
||||||
|
|
||||||
|
If the installation fails due to conflicts, we'll manually download and install the required packages, excluding conflicting files.
|
||||||
|
|
||||||
|
### 1. Download the `nvidia-driver-libs` RPM
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo dnf download --arch x86_64 nvidia-driver-libs
|
||||||
|
```
|
||||||
|
|
||||||
|
You should see a file similar to:
|
||||||
|
|
||||||
|
```
|
||||||
|
nvidia-driver-libs-560.35.05-1.fc39.x86_64.rpm
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Attempt to Install the RPM
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo dnf install nvidia-driver-libs-560.35.05-1.fc39.x86_64.rpm
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected Error:**
|
||||||
|
|
||||||
|
Installation may fail with errors pointing to conflicts with `egl-gbm` and `egl-wayland`.
|
||||||
|
|
||||||
|
**Note: It is important to carefully read the error messages to identify the exact paths that need to be excluded.**
|
||||||
|
|
||||||
|
### 3. Download Dependencies
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo dnf download --arch x86_64 egl-gbm egl-wayland
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Install `egl-gbm` with Excluded Paths
|
||||||
|
|
||||||
|
Exclude conflicting files during installation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo rpm --install --verbose --hash \
|
||||||
|
--excludepath=/usr/lib64/libnvidia-egl-gbm.so.1.1.2 \
|
||||||
|
--excludepath=/usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json \
|
||||||
|
egl-gbm-1.1.2^20240919gitb24587d-3.fc39.x86_64.rpm
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation:**
|
||||||
|
|
||||||
|
- The `--excludepath` option skips installing files that conflict with existing files.
|
||||||
|
- Adjust the paths based on the error messages you receive.
|
||||||
|
|
||||||
|
### 5. Install `egl-wayland` with Excluded Paths
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo rpm --install --verbose --hash \
|
||||||
|
--excludepath=/usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json \
|
||||||
|
egl-wayland-1.1.17^20241118giteeb29e1-5.fc39.x86_64.rpm
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6. Install `nvidia-driver-libs` with Excluded Paths
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo rpm --install --verbose --hash \
|
||||||
|
--excludepath=/usr/share/glvnd/egl_vendor.d/10_nvidia.json \
|
||||||
|
--excludepath=/usr/share/nvidia/nvoptix.bin \
|
||||||
|
nvidia-driver-libs-560.35.05-1.fc39.x86_64.rpm
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:**
|
||||||
|
|
||||||
|
- Replace the paths with the ones causing conflicts in your installation if they differ.
|
||||||
|
- The `--verbose` and `--hash` options provide detailed output during installation.
|
||||||
|
|
||||||
|
## Finalizing the Installation of `nvidia-driver-libs`
|
||||||
|
|
||||||
|
After manually installing the dependencies, run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo dnf install nvidia-driver-libs
|
||||||
|
```
|
||||||
|
|
||||||
|
You should receive a message indicating the package is already installed:
|
||||||
|
|
||||||
|
```
|
||||||
|
Package nvidia-driver-libs-3:560.35.05-1.fc39.x86_64 is already installed.
|
||||||
|
Dependencies resolved.
|
||||||
|
Nothing to do.
|
||||||
|
Complete!
|
||||||
|
```
|
||||||
|
|
||||||
|
## Installing the CUDA Meta-Package
|
||||||
|
|
||||||
|
Now that the driver libraries are installed, proceed to install CUDA:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo dnf install cuda
|
||||||
|
```
|
||||||
|
|
||||||
|
This installs the CUDA toolkit and associated packages.
|
||||||
|
|
||||||
|
## Configuring the Environment
|
||||||
|
|
||||||
|
To use CUDA, add its binary directory to your system's `PATH`.
|
||||||
|
|
||||||
|
1. **Create a Profile Script:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo sh -c 'echo "export PATH=\$PATH:/usr/local/cuda/bin" >> /etc/profile.d/cuda.sh'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation:**
|
||||||
|
|
||||||
|
- We add to `/etc/profile.d/` as the `/etc/` folder is unique to this particular container, and is not shared with other containers or the host system.
|
||||||
|
- The backslash `\` before `$PATH` ensures the variable is correctly written into the script.
|
||||||
|
|
||||||
|
2. **Make the Script Executable:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo chmod +x /etc/profile.d/cuda.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Source the Script to Update Your Environment:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
source /etc/profile.d/cuda.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:** This command updates your current shell session with the new `PATH`. The `/etc/profile.d/cuda.sh` script ensures that the CUDA binaries are available in your `PATH` for all future sessions.
|
||||||
|
|
||||||
|
## Verifying the Installation
|
||||||
|
|
||||||
|
To confirm that CUDA is correctly installed and configured, check the version of the NVIDIA CUDA Compiler (`nvcc`):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nvcc --version
|
||||||
|
```
|
||||||
|
|
||||||
|
You should see output similar to:
|
||||||
|
|
||||||
|
```
|
||||||
|
nvcc: NVIDIA (R) Cuda compiler driver
|
||||||
|
Copyright (c) 2005-2024 NVIDIA Corporation
|
||||||
|
Built on Tue_Oct_29_23:50:19_PDT_2024
|
||||||
|
Cuda compilation tools, release 12.6, V12.6.85
|
||||||
|
Build cuda_12.6.r12.6/compiler.35059454_0
|
||||||
|
```
|
||||||
|
|
||||||
|
This output confirms that the CUDA compiler is accessible and indicates the installed version.
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
You have successfully set up CUDA on Fedora within a toolbox environment using the Fedora 39 CUDA repository. By manually resolving package conflicts and configuring the environment, you can develop CUDA applications without affecting your host system.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
- **Installation Failures:**
|
||||||
|
- If you encounter errors during installation, carefully read the error messages. They often indicate conflicting files or missing dependencies.
|
||||||
|
- Use the `--excludepath` option with `rpm` to exclude conflicting files during manual installations.
|
||||||
|
|
||||||
|
- **Driver Conflicts:**
|
||||||
|
- Since the host system may already have NVIDIA drivers installed, conflicts can arise. Using the toolbox environment helps isolate these issues.
|
||||||
|
|
||||||
|
- **Environment Variables Not Set:**
|
||||||
|
- If `nvcc` is not found after installation, ensure that `/usr/local/cuda/bin` is in your `PATH`.
|
||||||
|
- Run `echo $PATH` to check if the path is included.
|
||||||
|
- Re-source the profile script or open a new terminal session.
|
||||||
|
|
||||||
|
## Additional Notes
|
||||||
|
|
||||||
|
- **Updating CUDA in the Future:**
|
||||||
|
- Keep an eye on the official NVIDIA repositories for updates to your Fedora version.
|
||||||
|
- When an updated repository becomes available, adjust your `dnf` configuration accordingly.
|
||||||
|
|
||||||
|
- **Building `llama.cpp`:**
|
||||||
|
- With CUDA installed, you can follow these [build instructions for `llama.cpp`](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md) to compile it with CUDA support.
|
||||||
|
- Ensure that any CUDA-specific build flags or paths are correctly set in your build configuration.
|
||||||
|
|
||||||
|
- **Using the Toolbox Environment:**
|
||||||
|
- The toolbox environment is isolated from your host system, which helps prevent conflicts.
|
||||||
|
- Remember that system files and configurations inside the toolbox are separate from the host. By default the home directory of the user is shared between the host and the toolbox.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer:** Manually installing and modifying system packages can lead to instability of the container. The above steps are provided as a guideline and may need adjustments based on your specific system configuration. Always back up important data before making significant system changes, especially as your home folder is writable and shared with he toolbox.
|
||||||
|
|
||||||
|
**Acknowledgments:** Special thanks to the Fedora community and NVIDIA documentation for providing resources that assisted in creating this guide.
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- [Fedora Toolbox Documentation](https://docs.fedoraproject.org/en-US/fedora-silverblue/toolbox/)
|
||||||
|
- [NVIDIA CUDA Installation Guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html)
|
||||||
|
- [Podman Documentation](https://podman.io/get-started)
|
||||||
|
|
||||||
|
---
|
@ -28,7 +28,7 @@ The required steps to implement for an HF model are:
|
|||||||
```python
|
```python
|
||||||
@Model.register("MyModelForCausalLM")
|
@Model.register("MyModelForCausalLM")
|
||||||
class MyModel(Model):
|
class MyModel(Model):
|
||||||
model_arch = gguf.MODEL_ARCH.GROK
|
model_arch = gguf.MODEL_ARCH.MYMODEL
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Define the layout of the GGUF tensors in [constants.py](/gguf-py/gguf/constants.py)
|
2. Define the layout of the GGUF tensors in [constants.py](/gguf-py/gguf/constants.py)
|
||||||
@ -79,14 +79,14 @@ Depending on the model configuration, tokenizer, code and tensors layout, you wi
|
|||||||
- `Model#set_vocab`
|
- `Model#set_vocab`
|
||||||
- `Model#write_tensors`
|
- `Model#write_tensors`
|
||||||
|
|
||||||
NOTE: Tensor names must end with `.weight` suffix, that is the convention and several tools like `quantize` expect this to proceed the weights.
|
NOTE: Tensor names must end with `.weight` or `.bias` suffixes, that is the convention and several tools like `quantize` expect this to proceed the weights.
|
||||||
|
|
||||||
### 2. Define the model architecture in `llama.cpp`
|
### 2. Define the model architecture in `llama.cpp`
|
||||||
|
|
||||||
The model params and tensors layout must be defined in `llama.cpp`:
|
The model params and tensors layout must be defined in `llama.cpp`:
|
||||||
1. Define a new `llm_arch`
|
1. Define a new `llm_arch`
|
||||||
2. Define the tensors layout in `LLM_TENSOR_NAMES`
|
2. Define the tensors layout in `LLM_TENSOR_NAMES`
|
||||||
3. Add any non standard metadata in `llm_load_hparams`
|
3. Add any non-standard metadata in `llm_load_hparams`
|
||||||
4. Create the tensors for inference in `llm_load_tensors`
|
4. Create the tensors for inference in `llm_load_tensors`
|
||||||
5. If the model has a RoPE operation, add the rope type in `llama_rope_type`
|
5. If the model has a RoPE operation, add the rope type in `llama_rope_type`
|
||||||
|
|
||||||
@ -96,9 +96,9 @@ NOTE: The dimensions in `ggml` are typically in the reverse order of the `pytorc
|
|||||||
|
|
||||||
This is the funniest part, you have to provide the inference graph implementation of the new model architecture in `llama_build_graph`.
|
This is the funniest part, you have to provide the inference graph implementation of the new model architecture in `llama_build_graph`.
|
||||||
|
|
||||||
Have a look at existing implementation like `build_llama`, `build_dbrx` or `build_bert`.
|
Have a look at existing implementations like `build_llama`, `build_dbrx` or `build_bert`.
|
||||||
|
|
||||||
When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support for missing backend operations can be added in another PR.
|
Some `ggml` backends do not support all operations. Backend implementations can be added in a separate PR.
|
||||||
|
|
||||||
Note: to debug the inference graph: you can use [llama-eval-callback](/examples/eval-callback/).
|
Note: to debug the inference graph: you can use [llama-eval-callback](/examples/eval-callback/).
|
||||||
|
|
||||||
|
@ -50,7 +50,7 @@ int main(int argc, char ** argv) {
|
|||||||
// ensure enough sequences are available
|
// ensure enough sequences are available
|
||||||
ctx_params.n_seq_max = n_pl.empty() ? 1 : *std::max_element(n_pl.begin(), n_pl.end());
|
ctx_params.n_seq_max = n_pl.empty() ? 1 : *std::max_element(n_pl.begin(), n_pl.end());
|
||||||
|
|
||||||
llama_context * ctx = llama_new_context_with_model(model, ctx_params);
|
llama_context * ctx = llama_init_from_model(model, ctx_params);
|
||||||
|
|
||||||
if (ctx == NULL) {
|
if (ctx == NULL) {
|
||||||
fprintf(stderr , "%s: error: failed to create the llama_context\n" , __func__);
|
fprintf(stderr , "%s: error: failed to create the llama_context\n" , __func__);
|
||||||
|
@ -23,12 +23,12 @@ defer {
|
|||||||
}
|
}
|
||||||
|
|
||||||
let model_params = llama_model_default_params()
|
let model_params = llama_model_default_params()
|
||||||
guard let model = llama_load_model_from_file(modelPath.cString(using: .utf8), model_params) else {
|
guard let model = llama_model_load_from_file(modelPath.cString(using: .utf8), model_params) else {
|
||||||
print("Failed to load model")
|
print("Failed to load model")
|
||||||
exit(1)
|
exit(1)
|
||||||
}
|
}
|
||||||
defer {
|
defer {
|
||||||
llama_free_model(model)
|
llama_model_free(model)
|
||||||
}
|
}
|
||||||
|
|
||||||
var tokens = tokenize(text: prompt, add_bos: true)
|
var tokens = tokenize(text: prompt, add_bos: true)
|
||||||
@ -141,7 +141,7 @@ while n_cur <= n_len {
|
|||||||
let new_token_id = llama_sampler_sample(smpl, context, i_batch[i])
|
let new_token_id = llama_sampler_sample(smpl, context, i_batch[i])
|
||||||
|
|
||||||
// is it an end of stream? -> mark the stream as finished
|
// is it an end of stream? -> mark the stream as finished
|
||||||
if llama_token_is_eog(model, new_token_id) || n_cur == n_len {
|
if llama_vocab_is_eog(model, new_token_id) || n_cur == n_len {
|
||||||
i_batch[i] = -1
|
i_batch[i] = -1
|
||||||
// print("")
|
// print("")
|
||||||
if n_parallel > 1 {
|
if n_parallel > 1 {
|
||||||
|
@ -48,10 +48,12 @@ int main(int argc, char ** argv) {
|
|||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
// tokenize the prompt
|
// tokenize the prompt
|
||||||
|
|
||||||
std::vector<llama_token> tokens_list;
|
std::vector<llama_token> tokens_list;
|
||||||
tokens_list = common_tokenize(model, params.prompt, true);
|
tokens_list = common_tokenize(vocab, params.prompt, true);
|
||||||
|
|
||||||
const int n_kv_req = tokens_list.size() + (n_predict - tokens_list.size())*n_parallel;
|
const int n_kv_req = tokens_list.size() + (n_predict - tokens_list.size())*n_parallel;
|
||||||
|
|
||||||
@ -62,7 +64,7 @@ int main(int argc, char ** argv) {
|
|||||||
ctx_params.n_ctx = n_kv_req;
|
ctx_params.n_ctx = n_kv_req;
|
||||||
ctx_params.n_batch = std::max(n_predict, n_parallel);
|
ctx_params.n_batch = std::max(n_predict, n_parallel);
|
||||||
|
|
||||||
llama_context * ctx = llama_new_context_with_model(model, ctx_params);
|
llama_context * ctx = llama_init_from_model(model, ctx_params);
|
||||||
|
|
||||||
auto sparams = llama_sampler_chain_default_params();
|
auto sparams = llama_sampler_chain_default_params();
|
||||||
sparams.no_perf = false;
|
sparams.no_perf = false;
|
||||||
@ -121,7 +123,7 @@ int main(int argc, char ** argv) {
|
|||||||
|
|
||||||
llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
|
llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
|
||||||
if (decoder_start_token_id == LLAMA_TOKEN_NULL) {
|
if (decoder_start_token_id == LLAMA_TOKEN_NULL) {
|
||||||
decoder_start_token_id = llama_token_bos(model);
|
decoder_start_token_id = llama_vocab_bos(vocab);
|
||||||
}
|
}
|
||||||
|
|
||||||
common_batch_clear(batch);
|
common_batch_clear(batch);
|
||||||
@ -174,7 +176,7 @@ int main(int argc, char ** argv) {
|
|||||||
const llama_token new_token_id = llama_sampler_sample(smpl, ctx, i_batch[i]);
|
const llama_token new_token_id = llama_sampler_sample(smpl, ctx, i_batch[i]);
|
||||||
|
|
||||||
// is it an end of generation? -> mark the stream as finished
|
// is it an end of generation? -> mark the stream as finished
|
||||||
if (llama_token_is_eog(model, new_token_id) || n_cur == n_predict) {
|
if (llama_vocab_is_eog(vocab, new_token_id) || n_cur == n_predict) {
|
||||||
i_batch[i] = -1;
|
i_batch[i] = -1;
|
||||||
LOG("\n");
|
LOG("\n");
|
||||||
if (n_parallel > 1) {
|
if (n_parallel > 1) {
|
||||||
|
@ -911,7 +911,7 @@ int main(int argc, char ** argv) {
|
|||||||
load_vocab(params.fn_vocab_model, &config, &vocab);
|
load_vocab(params.fn_vocab_model, &config, &vocab);
|
||||||
|
|
||||||
struct my_llama_model model;
|
struct my_llama_model model;
|
||||||
model.hparams.n_vocab = config.vocab_size; //llama_n_vocab(lctx);
|
model.hparams.n_vocab = config.vocab_size; //llama_vocab_n_vocab(lctx);
|
||||||
model.hparams.n_ctx = params.n_ctx;
|
model.hparams.n_ctx = params.n_ctx;
|
||||||
model.hparams.n_embd = config.dim; //params.n_embd;
|
model.hparams.n_embd = config.dim; //params.n_embd;
|
||||||
model.hparams.n_ff = config.hidden_dim;
|
model.hparams.n_ff = config.hidden_dim;
|
||||||
|
@ -273,7 +273,9 @@ struct tokenized_prompt {
|
|||||||
size_t max_seq_len;
|
size_t max_seq_len;
|
||||||
|
|
||||||
tokenized_prompt(llama_context * ctx, std::string pos, std::string neg) {
|
tokenized_prompt(llama_context * ctx, std::string pos, std::string neg) {
|
||||||
const bool add_bos = llama_add_bos_token(llama_get_model(ctx));
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
const bool add_bos = llama_vocab_get_add_bos(vocab);
|
||||||
tokens_pos = common_tokenize(ctx, pos, add_bos, true);
|
tokens_pos = common_tokenize(ctx, pos, add_bos, true);
|
||||||
tokens_neg = common_tokenize(ctx, neg, add_bos, true);
|
tokens_neg = common_tokenize(ctx, neg, add_bos, true);
|
||||||
max_seq_len = std::max(tokens_pos.size(), tokens_neg.size());
|
max_seq_len = std::max(tokens_pos.size(), tokens_neg.size());
|
||||||
@ -421,8 +423,8 @@ int main(int argc, char ** argv) {
|
|||||||
llama_context * ctx = llama_init.context.get();
|
llama_context * ctx = llama_init.context.get();
|
||||||
|
|
||||||
// int n_ctx = llama_n_ctx(ctx);
|
// int n_ctx = llama_n_ctx(ctx);
|
||||||
int n_layers = llama_n_layer(model);
|
int n_layers = llama_model_n_layer(model);
|
||||||
int n_embd = llama_n_embd(model);
|
int n_embd = llama_model_n_embd(model);
|
||||||
|
|
||||||
// get model hint param (a.k.a model arch name)
|
// get model hint param (a.k.a model arch name)
|
||||||
char model_hint[128];
|
char model_hint[128];
|
||||||
|
@ -105,7 +105,9 @@ int main(int argc, char ** argv) {
|
|||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
const int n_ctx_train = llama_n_ctx_train(model);
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
|
const int n_ctx_train = llama_model_n_ctx_train(model);
|
||||||
const int n_ctx = llama_n_ctx(ctx);
|
const int n_ctx = llama_n_ctx(ctx);
|
||||||
|
|
||||||
const enum llama_pooling_type pooling_type = llama_pooling_type(ctx);
|
const enum llama_pooling_type pooling_type = llama_pooling_type(ctx);
|
||||||
@ -148,7 +150,7 @@ int main(int argc, char ** argv) {
|
|||||||
// check if the last token is SEP
|
// check if the last token is SEP
|
||||||
// it should be automatically added by the tokenizer when 'tokenizer.ggml.add_eos_token' is set to 'true'
|
// it should be automatically added by the tokenizer when 'tokenizer.ggml.add_eos_token' is set to 'true'
|
||||||
for (auto & inp : inputs) {
|
for (auto & inp : inputs) {
|
||||||
if (inp.empty() || inp.back() != llama_token_sep(model)) {
|
if (inp.empty() || inp.back() != llama_vocab_sep(vocab)) {
|
||||||
LOG_WRN("%s: last token in the prompt is not SEP\n", __func__);
|
LOG_WRN("%s: last token in the prompt is not SEP\n", __func__);
|
||||||
LOG_WRN("%s: 'tokenizer.ggml.add_eos_token' should be set to 'true' in the GGUF header\n", __func__);
|
LOG_WRN("%s: 'tokenizer.ggml.add_eos_token' should be set to 'true' in the GGUF header\n", __func__);
|
||||||
}
|
}
|
||||||
@ -181,7 +183,7 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// allocate output
|
// allocate output
|
||||||
const int n_embd = llama_n_embd(model);
|
const int n_embd = llama_model_n_embd(model);
|
||||||
std::vector<float> embeddings(n_embd_count * n_embd, 0);
|
std::vector<float> embeddings(n_embd_count * n_embd, 0);
|
||||||
float * emb = embeddings.data();
|
float * emb = embeddings.data();
|
||||||
|
|
||||||
|
@ -127,7 +127,10 @@ static bool ggml_debug(struct ggml_tensor * t, bool ask, void * user_data) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
static bool run(llama_context * ctx, const common_params & params) {
|
static bool run(llama_context * ctx, const common_params & params) {
|
||||||
const bool add_bos = llama_add_bos_token(llama_get_model(ctx));
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
|
const bool add_bos = llama_vocab_get_add_bos(vocab);
|
||||||
|
|
||||||
std::vector<llama_token> tokens = common_tokenize(ctx, params.prompt, add_bos);
|
std::vector<llama_token> tokens = common_tokenize(ctx, params.prompt, add_bos);
|
||||||
|
|
||||||
|
@ -8,7 +8,6 @@
|
|||||||
#include <map>
|
#include <map>
|
||||||
#include <vector>
|
#include <vector>
|
||||||
#include <string>
|
#include <string>
|
||||||
#include <thread>
|
|
||||||
#include <fstream>
|
#include <fstream>
|
||||||
|
|
||||||
static bool g_verbose = false;
|
static bool g_verbose = false;
|
||||||
@ -130,7 +129,7 @@ struct lora_merge_ctx {
|
|||||||
|
|
||||||
lora_merge_ctx(
|
lora_merge_ctx(
|
||||||
std::string & base_fname,
|
std::string & base_fname,
|
||||||
std::vector<common_lora_adapter_info> & lora_files,
|
std::vector<common_adapter_lora_info> & lora_files,
|
||||||
std::string & outfile,
|
std::string & outfile,
|
||||||
int n_threads) : base_model(base_fname, 0), n_threads(n_threads), fout(outfile, std::ios::binary) {
|
int n_threads) : base_model(base_fname, 0), n_threads(n_threads), fout(outfile, std::ios::binary) {
|
||||||
fout.exceptions(std::ofstream::failbit); // fail fast on write errors
|
fout.exceptions(std::ofstream::failbit); // fail fast on write errors
|
||||||
|
@ -11,6 +11,7 @@ static std::vector<std::vector<float>> encode(llama_context * ctx, const std::ve
|
|||||||
std::vector<std::vector<float>> result;
|
std::vector<std::vector<float>> result;
|
||||||
|
|
||||||
const llama_model * model = llama_get_model(ctx);
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
llama_batch batch = llama_batch_init(llama_n_batch(ctx), 0, 1);
|
llama_batch batch = llama_batch_init(llama_n_batch(ctx), 0, 1);
|
||||||
|
|
||||||
@ -19,16 +20,16 @@ static std::vector<std::vector<float>> encode(llama_context * ctx, const std::ve
|
|||||||
|
|
||||||
const std::string input_string = instruction + sentences[i];
|
const std::string input_string = instruction + sentences[i];
|
||||||
|
|
||||||
std::vector<llama_token> inputs = common_tokenize(model, input_string, true, false);
|
std::vector<llama_token> inputs = common_tokenize(vocab, input_string, true, false);
|
||||||
|
|
||||||
const int32_t n_toks = inputs.size();
|
const int32_t n_toks = inputs.size();
|
||||||
|
|
||||||
// GritLM seems to have EOS = ""
|
// GritLM seems to have EOS = ""
|
||||||
// https://github.com/ContextualAI/gritlm/blob/92025b16534712b31b3c4aaaf069350e222bd5f8/gritlm/gritlm.py#L18
|
// https://github.com/ContextualAI/gritlm/blob/92025b16534712b31b3c4aaaf069350e222bd5f8/gritlm/gritlm.py#L18
|
||||||
// inputs.push_back(llama_token_eos(model));
|
// inputs.push_back(llama_vocab_eos(vocab));
|
||||||
|
|
||||||
// we want to ignore instruction tokens for mean pooling
|
// we want to ignore instruction tokens for mean pooling
|
||||||
const int32_t n_inst = common_tokenize(model, instruction, true, false).size();
|
const int32_t n_inst = common_tokenize(vocab, instruction, true, false).size();
|
||||||
|
|
||||||
#ifdef GRIT_DEBUG
|
#ifdef GRIT_DEBUG
|
||||||
// debug tokens - should be matching as referenced in the GritLM sample
|
// debug tokens - should be matching as referenced in the GritLM sample
|
||||||
@ -52,7 +53,7 @@ static std::vector<std::vector<float>> encode(llama_context * ctx, const std::ve
|
|||||||
llama_decode(ctx, batch);
|
llama_decode(ctx, batch);
|
||||||
|
|
||||||
// get embedding dimensions
|
// get embedding dimensions
|
||||||
uint64_t n_embd = llama_n_embd(model);
|
uint64_t n_embd = llama_model_n_embd(model);
|
||||||
|
|
||||||
// allocate embedding output
|
// allocate embedding output
|
||||||
std::vector<float> emb_unorm(n_embd, 0.0f);
|
std::vector<float> emb_unorm(n_embd, 0.0f);
|
||||||
@ -97,7 +98,9 @@ static std::string generate(llama_context * ctx, llama_sampler * smpl, const std
|
|||||||
std::string result;
|
std::string result;
|
||||||
|
|
||||||
const llama_model * model = llama_get_model(ctx);
|
const llama_model * model = llama_get_model(ctx);
|
||||||
llama_token eos_token = llama_token_eos(model);
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
|
llama_token eos_token = llama_vocab_eos(vocab);
|
||||||
|
|
||||||
llama_kv_cache_clear(ctx);
|
llama_kv_cache_clear(ctx);
|
||||||
llama_set_embeddings(ctx, false);
|
llama_set_embeddings(ctx, false);
|
||||||
@ -105,7 +108,7 @@ static std::string generate(llama_context * ctx, llama_sampler * smpl, const std
|
|||||||
|
|
||||||
llama_batch bat = llama_batch_init(llama_n_batch(ctx), 0, 1);
|
llama_batch bat = llama_batch_init(llama_n_batch(ctx), 0, 1);
|
||||||
|
|
||||||
std::vector<llama_token> inputs = common_tokenize(model, prompt, false, true);
|
std::vector<llama_token> inputs = common_tokenize(vocab, prompt, false, true);
|
||||||
int32_t i_current_token = 0;
|
int32_t i_current_token = 0;
|
||||||
|
|
||||||
while (true) {
|
while (true) {
|
||||||
@ -168,7 +171,7 @@ int main(int argc, char * argv[]) {
|
|||||||
llama_model * model = llama_model_load_from_file(params.model.c_str(), mparams);
|
llama_model * model = llama_model_load_from_file(params.model.c_str(), mparams);
|
||||||
|
|
||||||
// create generation context
|
// create generation context
|
||||||
llama_context * ctx = llama_new_context_with_model(model, cparams);
|
llama_context * ctx = llama_init_from_model(model, cparams);
|
||||||
|
|
||||||
auto sparams = llama_sampler_chain_default_params();
|
auto sparams = llama_sampler_chain_default_params();
|
||||||
|
|
||||||
@ -197,7 +200,7 @@ int main(int argc, char * argv[]) {
|
|||||||
const std::vector<std::vector<float>> d_rep = encode(ctx, documents, gritlm_instruction(""));
|
const std::vector<std::vector<float>> d_rep = encode(ctx, documents, gritlm_instruction(""));
|
||||||
const std::vector<std::vector<float>> q_rep = encode(ctx, queries, gritlm_instruction(instruction));
|
const std::vector<std::vector<float>> q_rep = encode(ctx, queries, gritlm_instruction(instruction));
|
||||||
|
|
||||||
const int n_embd = llama_n_embd(model);
|
const int n_embd = llama_model_n_embd(model);
|
||||||
|
|
||||||
const float cosine_sim_q0_d0 = common_embd_similarity_cos(q_rep[0].data(), d_rep[0].data(), n_embd);
|
const float cosine_sim_q0_d0 = common_embd_similarity_cos(q_rep[0].data(), d_rep[0].data(), n_embd);
|
||||||
const float cosine_sim_q0_d1 = common_embd_similarity_cos(q_rep[0].data(), d_rep[1].data(), n_embd);
|
const float cosine_sim_q0_d1 = common_embd_similarity_cos(q_rep[0].data(), d_rep[1].data(), n_embd);
|
||||||
|
@ -7,7 +7,6 @@
|
|||||||
#include <cstdio>
|
#include <cstdio>
|
||||||
#include <cstring>
|
#include <cstring>
|
||||||
#include <ctime>
|
#include <ctime>
|
||||||
#include <sstream>
|
|
||||||
#include <thread>
|
#include <thread>
|
||||||
#include <mutex>
|
#include <mutex>
|
||||||
#include <vector>
|
#include <vector>
|
||||||
@ -40,7 +39,7 @@ public:
|
|||||||
void set_params(common_params params) { m_params = std::move(params); }
|
void set_params(common_params params) { m_params = std::move(params); }
|
||||||
bool collect_imatrix(struct ggml_tensor * t, bool ask, void * user_data);
|
bool collect_imatrix(struct ggml_tensor * t, bool ask, void * user_data);
|
||||||
void save_imatrix(int ncall = -1) const;
|
void save_imatrix(int ncall = -1) const;
|
||||||
bool load_imatrix(const char * file_name);
|
bool load_imatrix(const char * fname);
|
||||||
private:
|
private:
|
||||||
std::unordered_map<std::string, Stats> m_stats;
|
std::unordered_map<std::string, Stats> m_stats;
|
||||||
common_params m_params;
|
common_params m_params;
|
||||||
@ -429,10 +428,13 @@ static void process_logits(
|
|||||||
}
|
}
|
||||||
|
|
||||||
static bool compute_imatrix(llama_context * ctx, const common_params & params) {
|
static bool compute_imatrix(llama_context * ctx, const common_params & params) {
|
||||||
const bool add_bos = llama_add_bos_token(llama_get_model(ctx));
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
|
const bool add_bos = llama_vocab_get_add_bos(vocab);
|
||||||
const int n_ctx = llama_n_ctx(ctx);
|
const int n_ctx = llama_n_ctx(ctx);
|
||||||
|
|
||||||
GGML_ASSERT(!llama_add_eos_token(llama_get_model(ctx)));
|
GGML_ASSERT(!llama_vocab_get_add_eos(vocab));
|
||||||
|
|
||||||
auto tim1 = std::chrono::high_resolution_clock::now();
|
auto tim1 = std::chrono::high_resolution_clock::now();
|
||||||
LOG_INF("%s: tokenizing the input ..\n", __func__);
|
LOG_INF("%s: tokenizing the input ..\n", __func__);
|
||||||
@ -468,7 +470,7 @@ static bool compute_imatrix(llama_context * ctx, const common_params & params) {
|
|||||||
const int n_chunk_max = tokens.size() / n_ctx;
|
const int n_chunk_max = tokens.size() / n_ctx;
|
||||||
|
|
||||||
const int n_chunk = params.n_chunks < 0 ? n_chunk_max : std::min(params.n_chunks, n_chunk_max);
|
const int n_chunk = params.n_chunks < 0 ? n_chunk_max : std::min(params.n_chunks, n_chunk_max);
|
||||||
const int n_vocab = llama_n_vocab(llama_get_model(ctx));
|
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||||||
const int n_batch = params.n_batch;
|
const int n_batch = params.n_batch;
|
||||||
|
|
||||||
int count = 0;
|
int count = 0;
|
||||||
@ -508,7 +510,7 @@ static bool compute_imatrix(llama_context * ctx, const common_params & params) {
|
|||||||
|
|
||||||
// add BOS token for the first batch of each chunk
|
// add BOS token for the first batch of each chunk
|
||||||
if (add_bos && j == 0) {
|
if (add_bos && j == 0) {
|
||||||
tokens[batch_start] = llama_token_bos(llama_get_model(ctx));
|
tokens[batch_start] = llama_vocab_bos(vocab);
|
||||||
}
|
}
|
||||||
|
|
||||||
common_batch_clear(batch);
|
common_batch_clear(batch);
|
||||||
@ -627,7 +629,7 @@ int main(int argc, char ** argv) {
|
|||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
const int n_ctx_train = llama_n_ctx_train(model);
|
const int n_ctx_train = llama_model_n_ctx_train(model);
|
||||||
if (params.n_ctx > n_ctx_train) {
|
if (params.n_ctx > n_ctx_train) {
|
||||||
LOG_WRN("%s: model was trained on only %d context tokens (%d specified)\n",
|
LOG_WRN("%s: model was trained on only %d context tokens (%d specified)\n",
|
||||||
__func__, n_ctx_train, params.n_ctx);
|
__func__, n_ctx_train, params.n_ctx);
|
||||||
|
@ -139,7 +139,9 @@ int main(int argc, char ** argv) {
|
|||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
const int n_ctx_train = llama_n_ctx_train(model);
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
|
const int n_ctx_train = llama_model_n_ctx_train(model);
|
||||||
const int n_ctx = llama_n_ctx(ctx);
|
const int n_ctx = llama_n_ctx(ctx);
|
||||||
LOG_DBG("n_ctx: %d\n", n_ctx);
|
LOG_DBG("n_ctx: %d\n", n_ctx);
|
||||||
|
|
||||||
@ -152,28 +154,28 @@ int main(int argc, char ** argv) {
|
|||||||
LOG_INF("\n");
|
LOG_INF("\n");
|
||||||
LOG_INF("%s\n", common_params_get_system_info(params).c_str());
|
LOG_INF("%s\n", common_params_get_system_info(params).c_str());
|
||||||
}
|
}
|
||||||
const bool add_bos = llama_add_bos_token(model);
|
const bool add_bos = llama_vocab_get_add_bos(vocab);
|
||||||
GGML_ASSERT(!llama_add_eos_token(model));
|
GGML_ASSERT(!llama_vocab_get_add_eos(vocab));
|
||||||
|
|
||||||
std::vector<llama_token> embd_inp;
|
std::vector<llama_token> embd_inp;
|
||||||
std::vector<llama_token> embd_end;
|
std::vector<llama_token> embd_end;
|
||||||
std::vector<llama_token> inp_pfx = common_tokenize(ctx, params.input_prefix, false);
|
std::vector<llama_token> inp_pfx = common_tokenize(ctx, params.input_prefix, false);
|
||||||
std::vector<llama_token> inp_sfx = common_tokenize(ctx, params.input_suffix, false);
|
std::vector<llama_token> inp_sfx = common_tokenize(ctx, params.input_suffix, false);
|
||||||
|
|
||||||
GGML_ASSERT(llama_token_fim_pre(model) >= 0);
|
GGML_ASSERT(llama_vocab_fim_pre(vocab) >= 0);
|
||||||
GGML_ASSERT(llama_token_fim_suf(model) >= 0);
|
GGML_ASSERT(llama_vocab_fim_suf(vocab) >= 0);
|
||||||
|
|
||||||
inp_pfx.insert(inp_pfx.begin(), llama_token_fim_pre(model));
|
inp_pfx.insert(inp_pfx.begin(), llama_vocab_fim_pre(vocab));
|
||||||
inp_sfx.insert(inp_sfx.begin(), llama_token_fim_suf(model));
|
inp_sfx.insert(inp_sfx.begin(), llama_vocab_fim_suf(vocab));
|
||||||
|
|
||||||
embd_inp = params.spm_infill ? inp_sfx : inp_pfx;
|
embd_inp = params.spm_infill ? inp_sfx : inp_pfx;
|
||||||
embd_end = params.spm_infill ? inp_pfx : inp_sfx;
|
embd_end = params.spm_infill ? inp_pfx : inp_sfx;
|
||||||
if (add_bos) {
|
if (add_bos) {
|
||||||
embd_inp.insert(embd_inp.begin(), llama_token_bos(model));
|
embd_inp.insert(embd_inp.begin(), llama_vocab_bos(vocab));
|
||||||
}
|
}
|
||||||
embd_inp.insert(embd_inp.end(), embd_end.begin(), embd_end.end());
|
embd_inp.insert(embd_inp.end(), embd_end.begin(), embd_end.end());
|
||||||
|
|
||||||
const llama_token middle_token = llama_token_fim_mid(model);
|
const llama_token middle_token = llama_vocab_fim_mid(vocab);
|
||||||
if (middle_token >= 0) {
|
if (middle_token >= 0) {
|
||||||
embd_inp.push_back(middle_token);
|
embd_inp.push_back(middle_token);
|
||||||
}
|
}
|
||||||
@ -185,7 +187,7 @@ int main(int argc, char ** argv) {
|
|||||||
|
|
||||||
// Should not run without any tokens
|
// Should not run without any tokens
|
||||||
if (embd_inp.empty()) {
|
if (embd_inp.empty()) {
|
||||||
embd_inp.push_back(llama_token_bos(model));
|
embd_inp.push_back(llama_vocab_bos(vocab));
|
||||||
LOG_WRN("embd_inp was considered empty and bos was added: %s\n", string_from(ctx, embd_inp).c_str());
|
LOG_WRN("embd_inp was considered empty and bos was added: %s\n", string_from(ctx, embd_inp).c_str());
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -420,10 +422,10 @@ int main(int argc, char ** argv) {
|
|||||||
// if not currently processing queued inputs;
|
// if not currently processing queued inputs;
|
||||||
if ((int) embd_inp.size() <= n_consumed) {
|
if ((int) embd_inp.size() <= n_consumed) {
|
||||||
// deal with eot token in infill mode
|
// deal with eot token in infill mode
|
||||||
if ((common_sampler_last(smpl) == llama_token_eot(model) || is_interacting) && params.interactive){
|
if ((common_sampler_last(smpl) == llama_vocab_eot(vocab) || is_interacting) && params.interactive){
|
||||||
if (is_interacting && !params.interactive_first) {
|
if (is_interacting && !params.interactive_first) {
|
||||||
// print an eot token
|
// print an eot token
|
||||||
LOG("%s", common_token_to_piece(ctx, llama_token_eot(model)).c_str());
|
LOG("%s", common_token_to_piece(ctx, llama_vocab_eot(vocab)).c_str());
|
||||||
}
|
}
|
||||||
LOG("\n");
|
LOG("\n");
|
||||||
console::set_display(console::user_input);
|
console::set_display(console::user_input);
|
||||||
@ -463,13 +465,13 @@ int main(int argc, char ** argv) {
|
|||||||
std::vector<llama_token> inp_pfx = common_tokenize(ctx, params.input_prefix, false);
|
std::vector<llama_token> inp_pfx = common_tokenize(ctx, params.input_prefix, false);
|
||||||
std::vector<llama_token> inp_sfx = common_tokenize(ctx, params.input_suffix, false);
|
std::vector<llama_token> inp_sfx = common_tokenize(ctx, params.input_suffix, false);
|
||||||
|
|
||||||
inp_pfx.insert(inp_pfx.begin(), llama_token_fim_pre(model));
|
inp_pfx.insert(inp_pfx.begin(), llama_vocab_fim_pre(vocab));
|
||||||
inp_sfx.insert(inp_sfx.begin(), llama_token_fim_suf(model));
|
inp_sfx.insert(inp_sfx.begin(), llama_vocab_fim_suf(vocab));
|
||||||
|
|
||||||
embd_inp = params.spm_infill ? inp_sfx : inp_pfx;
|
embd_inp = params.spm_infill ? inp_sfx : inp_pfx;
|
||||||
embd_end = params.spm_infill ? inp_pfx : inp_sfx;
|
embd_end = params.spm_infill ? inp_pfx : inp_sfx;
|
||||||
if (add_bos) {
|
if (add_bos) {
|
||||||
embd_inp.insert(embd_inp.begin(), llama_token_bos(model));
|
embd_inp.insert(embd_inp.begin(), llama_vocab_bos(vocab));
|
||||||
}
|
}
|
||||||
embd_inp.insert(embd_inp.end(), embd_end.begin(), embd_end.end());
|
embd_inp.insert(embd_inp.end(), embd_end.begin(), embd_end.end());
|
||||||
|
|
||||||
@ -484,7 +486,7 @@ int main(int argc, char ** argv) {
|
|||||||
is_interacting = false;
|
is_interacting = false;
|
||||||
}
|
}
|
||||||
// deal with end of generation tokens in interactive mode
|
// deal with end of generation tokens in interactive mode
|
||||||
else if (llama_token_is_eog(model, common_sampler_last(smpl))) {
|
else if (llama_vocab_is_eog(vocab, common_sampler_last(smpl))) {
|
||||||
LOG_DBG("found EOS token\n");
|
LOG_DBG("found EOS token\n");
|
||||||
|
|
||||||
if (params.interactive) {
|
if (params.interactive) {
|
||||||
@ -500,7 +502,7 @@ int main(int argc, char ** argv) {
|
|||||||
|
|
||||||
if (params.input_prefix_bos) {
|
if (params.input_prefix_bos) {
|
||||||
LOG_DBG("adding input prefix BOS token\n");
|
LOG_DBG("adding input prefix BOS token\n");
|
||||||
embd_inp.push_back(llama_token_bos(model));
|
embd_inp.push_back(llama_vocab_bos(vocab));
|
||||||
}
|
}
|
||||||
|
|
||||||
std::string buffer;
|
std::string buffer;
|
||||||
@ -563,7 +565,7 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// end of generation
|
// end of generation
|
||||||
if (!embd.empty() && llama_token_is_eog(model, embd.back()) && !params.interactive) {
|
if (!embd.empty() && llama_vocab_is_eog(vocab, embd.back()) && !params.interactive) {
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -575,7 +577,7 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
if (!params.interactive && n_remain <= 0) {
|
if (!params.interactive && n_remain <= 0) {
|
||||||
LOG("%s", common_token_to_piece(ctx, llama_token_eot(model)).c_str());
|
LOG("%s", common_token_to_piece(ctx, llama_vocab_eot(vocab)).c_str());
|
||||||
}
|
}
|
||||||
|
|
||||||
LOG("\n");
|
LOG("\n");
|
||||||
|
@ -1401,7 +1401,8 @@ static void test_prompt(llama_context * ctx, int n_prompt, int n_batch, int n_th
|
|||||||
llama_set_n_threads(ctx, n_threads, n_threads);
|
llama_set_n_threads(ctx, n_threads, n_threads);
|
||||||
|
|
||||||
const llama_model * model = llama_get_model(ctx);
|
const llama_model * model = llama_get_model(ctx);
|
||||||
const int32_t n_vocab = llama_n_vocab(model);
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
const int32_t n_vocab = llama_vocab_n_tokens(vocab);
|
||||||
|
|
||||||
std::vector<llama_token> tokens(n_batch);
|
std::vector<llama_token> tokens(n_batch);
|
||||||
|
|
||||||
@ -1409,7 +1410,7 @@ static void test_prompt(llama_context * ctx, int n_prompt, int n_batch, int n_th
|
|||||||
|
|
||||||
while (n_processed < n_prompt) {
|
while (n_processed < n_prompt) {
|
||||||
int n_tokens = std::min(n_prompt - n_processed, n_batch);
|
int n_tokens = std::min(n_prompt - n_processed, n_batch);
|
||||||
tokens[0] = n_processed == 0 && llama_add_bos_token(model) ? llama_token_bos(model) : std::rand() % n_vocab;
|
tokens[0] = n_processed == 0 && llama_vocab_get_add_bos(vocab) ? llama_vocab_bos(vocab) : std::rand() % n_vocab;
|
||||||
for (int i = 1; i < n_tokens; i++) {
|
for (int i = 1; i < n_tokens; i++) {
|
||||||
tokens[i] = std::rand() % n_vocab;
|
tokens[i] = std::rand() % n_vocab;
|
||||||
}
|
}
|
||||||
@ -1424,9 +1425,10 @@ static void test_gen(llama_context * ctx, int n_gen, int n_threads) {
|
|||||||
llama_set_n_threads(ctx, n_threads, n_threads);
|
llama_set_n_threads(ctx, n_threads, n_threads);
|
||||||
|
|
||||||
const llama_model * model = llama_get_model(ctx);
|
const llama_model * model = llama_get_model(ctx);
|
||||||
const int32_t n_vocab = llama_n_vocab(model);
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
const int32_t n_vocab = llama_vocab_n_tokens(vocab);
|
||||||
|
|
||||||
llama_token token = llama_add_bos_token(model) ? llama_token_bos(model) : std::rand() % n_vocab;
|
llama_token token = llama_vocab_get_add_bos(vocab) ? llama_vocab_bos(vocab) : std::rand() % n_vocab;
|
||||||
|
|
||||||
for (int i = 0; i < n_gen; i++) {
|
for (int i = 0; i < n_gen; i++) {
|
||||||
llama_decode(ctx, llama_batch_get_one(&token, 1));
|
llama_decode(ctx, llama_batch_get_one(&token, 1));
|
||||||
@ -1537,7 +1539,7 @@ int main(int argc, char ** argv) {
|
|||||||
prev_inst = &inst;
|
prev_inst = &inst;
|
||||||
}
|
}
|
||||||
|
|
||||||
llama_context * ctx = llama_new_context_with_model(lmodel, inst.to_llama_cparams());
|
llama_context * ctx = llama_init_from_model(lmodel, inst.to_llama_cparams());
|
||||||
if (ctx == NULL) {
|
if (ctx == NULL) {
|
||||||
fprintf(stderr, "%s: error: failed to create context with model '%s'\n", __func__, inst.model.c_str());
|
fprintf(stderr, "%s: error: failed to create context with model '%s'\n", __func__, inst.model.c_str());
|
||||||
llama_model_free(lmodel);
|
llama_model_free(lmodel);
|
||||||
|
@ -87,7 +87,7 @@ Java_android_llama_cpp_LLamaAndroid_load_1model(JNIEnv *env, jobject, jstring fi
|
|||||||
auto path_to_model = env->GetStringUTFChars(filename, 0);
|
auto path_to_model = env->GetStringUTFChars(filename, 0);
|
||||||
LOGi("Loading model from %s", path_to_model);
|
LOGi("Loading model from %s", path_to_model);
|
||||||
|
|
||||||
auto model = llama_load_model_from_file(path_to_model, model_params);
|
auto model = llama_model_load_from_file(path_to_model, model_params);
|
||||||
env->ReleaseStringUTFChars(filename, path_to_model);
|
env->ReleaseStringUTFChars(filename, path_to_model);
|
||||||
|
|
||||||
if (!model) {
|
if (!model) {
|
||||||
@ -102,7 +102,7 @@ Java_android_llama_cpp_LLamaAndroid_load_1model(JNIEnv *env, jobject, jstring fi
|
|||||||
extern "C"
|
extern "C"
|
||||||
JNIEXPORT void JNICALL
|
JNIEXPORT void JNICALL
|
||||||
Java_android_llama_cpp_LLamaAndroid_free_1model(JNIEnv *, jobject, jlong model) {
|
Java_android_llama_cpp_LLamaAndroid_free_1model(JNIEnv *, jobject, jlong model) {
|
||||||
llama_free_model(reinterpret_cast<llama_model *>(model));
|
llama_model_free(reinterpret_cast<llama_model *>(model));
|
||||||
}
|
}
|
||||||
|
|
||||||
extern "C"
|
extern "C"
|
||||||
@ -405,6 +405,7 @@ Java_android_llama_cpp_LLamaAndroid_completion_1loop(
|
|||||||
const auto batch = reinterpret_cast<llama_batch *>(batch_pointer);
|
const auto batch = reinterpret_cast<llama_batch *>(batch_pointer);
|
||||||
const auto sampler = reinterpret_cast<llama_sampler *>(sampler_pointer);
|
const auto sampler = reinterpret_cast<llama_sampler *>(sampler_pointer);
|
||||||
const auto model = llama_get_model(context);
|
const auto model = llama_get_model(context);
|
||||||
|
const auto vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
if (!la_int_var) la_int_var = env->GetObjectClass(intvar_ncur);
|
if (!la_int_var) la_int_var = env->GetObjectClass(intvar_ncur);
|
||||||
if (!la_int_var_value) la_int_var_value = env->GetMethodID(la_int_var, "getValue", "()I");
|
if (!la_int_var_value) la_int_var_value = env->GetMethodID(la_int_var, "getValue", "()I");
|
||||||
@ -414,7 +415,7 @@ Java_android_llama_cpp_LLamaAndroid_completion_1loop(
|
|||||||
const auto new_token_id = llama_sampler_sample(sampler, context, -1);
|
const auto new_token_id = llama_sampler_sample(sampler, context, -1);
|
||||||
|
|
||||||
const auto n_cur = env->CallIntMethod(intvar_ncur, la_int_var_value);
|
const auto n_cur = env->CallIntMethod(intvar_ncur, la_int_var_value);
|
||||||
if (llama_token_is_eog(model, new_token_id) || n_cur == n_len) {
|
if (llama_vocab_is_eog(vocab, new_token_id) || n_cur == n_len) {
|
||||||
return nullptr;
|
return nullptr;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -52,8 +52,8 @@ actor LlamaContext {
|
|||||||
deinit {
|
deinit {
|
||||||
llama_sampler_free(sampling)
|
llama_sampler_free(sampling)
|
||||||
llama_batch_free(batch)
|
llama_batch_free(batch)
|
||||||
|
llama_model_free(model)
|
||||||
llama_free(context)
|
llama_free(context)
|
||||||
llama_free_model(model)
|
|
||||||
llama_backend_free()
|
llama_backend_free()
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -65,7 +65,7 @@ actor LlamaContext {
|
|||||||
model_params.n_gpu_layers = 0
|
model_params.n_gpu_layers = 0
|
||||||
print("Running on simulator, force use n_gpu_layers = 0")
|
print("Running on simulator, force use n_gpu_layers = 0")
|
||||||
#endif
|
#endif
|
||||||
let model = llama_load_model_from_file(path, model_params)
|
let model = llama_model_load_from_file(path, model_params)
|
||||||
guard let model else {
|
guard let model else {
|
||||||
print("Could not load model at \(path)")
|
print("Could not load model at \(path)")
|
||||||
throw LlamaError.couldNotInitializeContext
|
throw LlamaError.couldNotInitializeContext
|
||||||
@ -151,7 +151,7 @@ actor LlamaContext {
|
|||||||
|
|
||||||
new_token_id = llama_sampler_sample(sampling, context, batch.n_tokens - 1)
|
new_token_id = llama_sampler_sample(sampling, context, batch.n_tokens - 1)
|
||||||
|
|
||||||
if llama_token_is_eog(model, new_token_id) || n_cur == n_len {
|
if llama_vocab_is_eog(model, new_token_id) || n_cur == n_len {
|
||||||
print("\n")
|
print("\n")
|
||||||
is_done = true
|
is_done = true
|
||||||
let new_token_str = String(cString: temporary_invalid_cchars + [0])
|
let new_token_str = String(cString: temporary_invalid_cchars + [0])
|
||||||
|
@ -47,8 +47,12 @@ static const char * sample(struct common_sampler * smpl,
|
|||||||
int * n_past) {
|
int * n_past) {
|
||||||
const llama_token id = common_sampler_sample(smpl, ctx_llama, -1);
|
const llama_token id = common_sampler_sample(smpl, ctx_llama, -1);
|
||||||
common_sampler_accept(smpl, id, true);
|
common_sampler_accept(smpl, id, true);
|
||||||
|
|
||||||
|
const llama_model * model = llama_get_model(ctx_llama);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
static std::string ret;
|
static std::string ret;
|
||||||
if (llama_token_is_eog(llama_get_model(ctx_llama), id)) {
|
if (llama_vocab_is_eog(vocab, id)) {
|
||||||
ret = "</s>";
|
ret = "</s>";
|
||||||
} else {
|
} else {
|
||||||
ret = common_token_to_piece(ctx_llama, id);
|
ret = common_token_to_piece(ctx_llama, id);
|
||||||
@ -239,11 +243,10 @@ static struct llava_context * llava_init_context(common_params * params, llama_m
|
|||||||
|
|
||||||
auto ctx_clip = clip_model_load(clip_path, /*verbosity=*/ 1);
|
auto ctx_clip = clip_model_load(clip_path, /*verbosity=*/ 1);
|
||||||
|
|
||||||
|
|
||||||
llama_context_params ctx_params = common_context_params_to_llama(*params);
|
llama_context_params ctx_params = common_context_params_to_llama(*params);
|
||||||
ctx_params.n_ctx = params->n_ctx < 2048 ? 2048 : params->n_ctx; // we need a longer context size to process image embeddings
|
ctx_params.n_ctx = params->n_ctx < 2048 ? 2048 : params->n_ctx; // we need a longer context size to process image embeddings
|
||||||
|
|
||||||
llama_context * ctx_llama = llama_new_context_with_model(model, ctx_params);
|
llama_context * ctx_llama = llama_init_from_model(model, ctx_params);
|
||||||
|
|
||||||
if (ctx_llama == NULL) {
|
if (ctx_llama == NULL) {
|
||||||
LOG_ERR("%s: failed to create the llama_context\n" , __func__);
|
LOG_ERR("%s: failed to create the llama_context\n" , __func__);
|
||||||
|
@ -384,7 +384,7 @@ static bool encode_image_with_clip(clip_ctx * ctx_clip, int n_threads, const cli
|
|||||||
|
|
||||||
bool llava_validate_embed_size(const llama_context * ctx_llama, const clip_ctx * ctx_clip) {
|
bool llava_validate_embed_size(const llama_context * ctx_llama, const clip_ctx * ctx_clip) {
|
||||||
// make sure that the correct mmproj was used, i.e., compare apples to apples
|
// make sure that the correct mmproj was used, i.e., compare apples to apples
|
||||||
int n_llama_embd = llama_n_embd(llama_get_model(ctx_llama));
|
int n_llama_embd = llama_model_n_embd(llama_get_model(ctx_llama));
|
||||||
auto n_image_embd = clip_n_mmproj_embd(ctx_clip);
|
auto n_image_embd = clip_n_mmproj_embd(ctx_clip);
|
||||||
if (n_image_embd != n_llama_embd) {
|
if (n_image_embd != n_llama_embd) {
|
||||||
LOG_ERR("%s: embedding dim of the multimodal projector (%d) is not equal to that of LLaMA (%d). Make sure that you use the correct mmproj file.\n", __func__, n_image_embd, n_llama_embd);
|
LOG_ERR("%s: embedding dim of the multimodal projector (%d) is not equal to that of LLaMA (%d). Make sure that you use the correct mmproj file.\n", __func__, n_image_embd, n_llama_embd);
|
||||||
@ -456,7 +456,7 @@ struct llava_embd_batch {
|
|||||||
};
|
};
|
||||||
|
|
||||||
bool llava_eval_image_embed(llama_context * ctx_llama, const struct llava_image_embed * image_embed, int n_batch, int * n_past) {
|
bool llava_eval_image_embed(llama_context * ctx_llama, const struct llava_image_embed * image_embed, int n_batch, int * n_past) {
|
||||||
int n_embd = llama_n_embd(llama_get_model(ctx_llama));
|
int n_embd = llama_model_n_embd(llama_get_model(ctx_llama));
|
||||||
|
|
||||||
for (int i = 0; i < image_embed->n_image_pos; i += n_batch) {
|
for (int i = 0; i < image_embed->n_image_pos; i += n_batch) {
|
||||||
int n_eval = image_embed->n_image_pos - i;
|
int n_eval = image_embed->n_image_pos - i;
|
||||||
|
@ -54,7 +54,7 @@ static struct llava_context * llava_init_context(common_params * params, llama_m
|
|||||||
ctx_params.n_ctx = params->n_ctx;
|
ctx_params.n_ctx = params->n_ctx;
|
||||||
}
|
}
|
||||||
|
|
||||||
llama_context * ctx_llama = llama_new_context_with_model(model, ctx_params);
|
llama_context * ctx_llama = llama_init_from_model(model, ctx_params);
|
||||||
|
|
||||||
if (ctx_llama == NULL) {
|
if (ctx_llama == NULL) {
|
||||||
LOG_ERR("%s: failed to create the llama_context\n" , __func__);
|
LOG_ERR("%s: failed to create the llama_context\n" , __func__);
|
||||||
@ -167,8 +167,12 @@ static const char * sample(struct common_sampler * smpl,
|
|||||||
int * n_past) {
|
int * n_past) {
|
||||||
const llama_token id = common_sampler_sample(smpl, ctx_llama, -1);
|
const llama_token id = common_sampler_sample(smpl, ctx_llama, -1);
|
||||||
common_sampler_accept(smpl, id, true);
|
common_sampler_accept(smpl, id, true);
|
||||||
|
|
||||||
|
const llama_model * model = llama_get_model(ctx_llama);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
static std::string ret;
|
static std::string ret;
|
||||||
if (llama_token_is_eog(llama_get_model(ctx_llama), id)) {
|
if (llama_vocab_is_eog(vocab, id)) {
|
||||||
ret = "</s>";
|
ret = "</s>";
|
||||||
} else {
|
} else {
|
||||||
ret = common_token_to_piece(ctx_llama, id);
|
ret = common_token_to_piece(ctx_llama, id);
|
||||||
|
@ -27,7 +27,7 @@
|
|||||||
|
|
||||||
static bool qwen2vl_eval_image_embed(llama_context * ctx_llama, const struct llava_image_embed * image_embed,
|
static bool qwen2vl_eval_image_embed(llama_context * ctx_llama, const struct llava_image_embed * image_embed,
|
||||||
int n_batch, int * n_past, int * st_pos_id, struct clip_image_size * image_size) {
|
int n_batch, int * n_past, int * st_pos_id, struct clip_image_size * image_size) {
|
||||||
int n_embd = llama_n_embd(llama_get_model(ctx_llama));
|
int n_embd = llama_model_n_embd(llama_get_model(ctx_llama));
|
||||||
const int patch_size = 14 * 2;
|
const int patch_size = 14 * 2;
|
||||||
const int ph = image_size->height / patch_size + (image_size->height % patch_size > 0);
|
const int ph = image_size->height / patch_size + (image_size->height % patch_size > 0);
|
||||||
const int pw = image_size->width / patch_size + (image_size->width % patch_size > 0);
|
const int pw = image_size->width / patch_size + (image_size->width % patch_size > 0);
|
||||||
@ -132,8 +132,12 @@ static const char * sample(struct common_sampler * smpl,
|
|||||||
int * n_past, int * st_pos_id) {
|
int * n_past, int * st_pos_id) {
|
||||||
const llama_token id = common_sampler_sample(smpl, ctx_llama, -1);
|
const llama_token id = common_sampler_sample(smpl, ctx_llama, -1);
|
||||||
common_sampler_accept(smpl, id, true);
|
common_sampler_accept(smpl, id, true);
|
||||||
|
|
||||||
|
const llama_model * model = llama_get_model(ctx_llama);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
static std::string ret;
|
static std::string ret;
|
||||||
if (llama_token_is_eog(llama_get_model(ctx_llama), id)) {
|
if (llama_vocab_is_eog(vocab, id)) {
|
||||||
ret = "</s>";
|
ret = "</s>";
|
||||||
} else {
|
} else {
|
||||||
ret = common_token_to_piece(ctx_llama, id);
|
ret = common_token_to_piece(ctx_llama, id);
|
||||||
@ -328,11 +332,10 @@ static struct llava_context * llava_init_context(common_params * params, llama_m
|
|||||||
|
|
||||||
auto ctx_clip = clip_model_load(clip_path, /*verbosity=*/ 1);
|
auto ctx_clip = clip_model_load(clip_path, /*verbosity=*/ 1);
|
||||||
|
|
||||||
|
|
||||||
llama_context_params ctx_params = common_context_params_to_llama(*params);
|
llama_context_params ctx_params = common_context_params_to_llama(*params);
|
||||||
ctx_params.n_ctx = params->n_ctx < 2048 ? 2048 : params->n_ctx; // we need a longer context size to process image embeddings
|
ctx_params.n_ctx = params->n_ctx < 2048 ? 2048 : params->n_ctx; // we need a longer context size to process image embeddings
|
||||||
|
|
||||||
llama_context * ctx_llama = llama_new_context_with_model(model, ctx_params);
|
llama_context * ctx_llama = llama_init_from_model(model, ctx_params);
|
||||||
|
|
||||||
if (ctx_llama == NULL) {
|
if (ctx_llama == NULL) {
|
||||||
LOG_ERR("%s: failed to create the llama_context\n" , __func__);
|
LOG_ERR("%s: failed to create the llama_context\n" , __func__);
|
||||||
@ -481,7 +484,7 @@ static void debug_test_mrope_2d() {
|
|||||||
}
|
}
|
||||||
|
|
||||||
static void debug_dump_img_embed(struct llava_context * ctx_llava) {
|
static void debug_dump_img_embed(struct llava_context * ctx_llava) {
|
||||||
int n_embd = llama_n_embd(llama_get_model(ctx_llava->ctx_llama));
|
int n_embd = llama_model_n_embd(llama_get_model(ctx_llava->ctx_llama));
|
||||||
int ne = n_embd * 4;
|
int ne = n_embd * 4;
|
||||||
float vals[56 * 56 * 3];
|
float vals[56 * 56 * 3];
|
||||||
// float embd[ne];
|
// float embd[ne];
|
||||||
|
@ -61,6 +61,8 @@ int main(int argc, char ** argv) {
|
|||||||
llama_model * model = llama_init.model.get();
|
llama_model * model = llama_init.model.get();
|
||||||
llama_context * ctx = llama_init.context.get();
|
llama_context * ctx = llama_init.context.get();
|
||||||
|
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
// Tokenize the prompt
|
// Tokenize the prompt
|
||||||
std::vector<llama_token> inp;
|
std::vector<llama_token> inp;
|
||||||
std::vector<llama_token> all;
|
std::vector<llama_token> all;
|
||||||
@ -147,7 +149,7 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// here we keep adding new n-grams as we go
|
// here we keep adding new n-grams as we go
|
||||||
ngram_container ngrams_observed(llama_n_vocab(model), N, G);
|
ngram_container ngrams_observed(llama_vocab_n_tokens(vocab), N, G);
|
||||||
|
|
||||||
// debug
|
// debug
|
||||||
struct llama_kv_cache_view kvc_view = llama_kv_cache_view_init(ctx, W + G + 1);
|
struct llama_kv_cache_view kvc_view = llama_kv_cache_view_init(ctx, W + G + 1);
|
||||||
@ -297,7 +299,7 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
fflush(stdout);
|
fflush(stdout);
|
||||||
|
|
||||||
if (llama_token_is_eog(model, id)) {
|
if (llama_vocab_is_eog(vocab, id)) {
|
||||||
has_eos = true;
|
has_eos = true;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -36,6 +36,8 @@ int main(int argc, char ** argv){
|
|||||||
llama_model * model = llama_init.model.get();
|
llama_model * model = llama_init.model.get();
|
||||||
llama_context * ctx = llama_init.context.get();
|
llama_context * ctx = llama_init.context.get();
|
||||||
|
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
// tokenize the prompt
|
// tokenize the prompt
|
||||||
std::vector<llama_token> inp;
|
std::vector<llama_token> inp;
|
||||||
inp = common_tokenize(ctx, params.prompt, true, true);
|
inp = common_tokenize(ctx, params.prompt, true, true);
|
||||||
@ -136,7 +138,7 @@ int main(int argc, char ** argv){
|
|||||||
LOG("%s", token_str.c_str());
|
LOG("%s", token_str.c_str());
|
||||||
}
|
}
|
||||||
|
|
||||||
if (llama_token_is_eog(model, id)) {
|
if (llama_vocab_is_eog(vocab, id)) {
|
||||||
has_eos = true;
|
has_eos = true;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -5,7 +5,6 @@
|
|||||||
#include "sampling.h"
|
#include "sampling.h"
|
||||||
#include "llama.h"
|
#include "llama.h"
|
||||||
|
|
||||||
#include <cassert>
|
|
||||||
#include <cstdio>
|
#include <cstdio>
|
||||||
#include <cstring>
|
#include <cstring>
|
||||||
#include <ctime>
|
#include <ctime>
|
||||||
@ -31,6 +30,8 @@
|
|||||||
#pragma warning(disable: 4244 4267) // possible loss of data
|
#pragma warning(disable: 4244 4267) // possible loss of data
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
static const char * DEFAULT_SYSTEM_MESSAGE = "You are a helpful assistant";
|
||||||
|
|
||||||
static llama_context ** g_ctx;
|
static llama_context ** g_ctx;
|
||||||
static llama_model ** g_model;
|
static llama_model ** g_model;
|
||||||
static common_sampler ** g_smpl;
|
static common_sampler ** g_smpl;
|
||||||
@ -163,6 +164,8 @@ int main(int argc, char ** argv) {
|
|||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
LOG_INF("%s: llama threadpool init, n_threads = %d\n", __func__, (int) params.cpuparams.n_threads);
|
LOG_INF("%s: llama threadpool init, n_threads = %d\n", __func__, (int) params.cpuparams.n_threads);
|
||||||
|
|
||||||
auto * reg = ggml_backend_dev_backend_reg(ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU));
|
auto * reg = ggml_backend_dev_backend_reg(ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU));
|
||||||
@ -196,15 +199,31 @@ int main(int argc, char ** argv) {
|
|||||||
|
|
||||||
llama_attach_threadpool(ctx, threadpool, threadpool_batch);
|
llama_attach_threadpool(ctx, threadpool, threadpool_batch);
|
||||||
|
|
||||||
const int n_ctx_train = llama_n_ctx_train(model);
|
const int n_ctx_train = llama_model_n_ctx_train(model);
|
||||||
const int n_ctx = llama_n_ctx(ctx);
|
const int n_ctx = llama_n_ctx(ctx);
|
||||||
|
|
||||||
if (n_ctx > n_ctx_train) {
|
if (n_ctx > n_ctx_train) {
|
||||||
LOG_WRN("%s: model was trained on only %d context tokens (%d specified)\n", __func__, n_ctx_train, n_ctx);
|
LOG_WRN("%s: model was trained on only %d context tokens (%d specified)\n", __func__, n_ctx_train, n_ctx);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// auto enable conversation mode if chat template is available
|
||||||
|
const bool has_chat_template = !common_get_builtin_chat_template(model).empty() || !params.chat_template.empty();
|
||||||
|
if (params.conversation_mode == COMMON_CONVERSATION_MODE_AUTO) {
|
||||||
|
if (has_chat_template) {
|
||||||
|
LOG_INF("%s: chat template is available, enabling conversation mode (disable it with -no-cnv)\n", __func__);
|
||||||
|
params.conversation_mode = COMMON_CONVERSATION_MODE_ENABLED;
|
||||||
|
} else {
|
||||||
|
params.conversation_mode = COMMON_CONVERSATION_MODE_DISABLED;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// in case user force-activate conversation mode (via -cnv) without proper chat template, we show a warning
|
||||||
|
if (params.conversation_mode && !has_chat_template) {
|
||||||
|
LOG_WRN("%s: chat template is not available or is not supported. This may cause the model to output suboptimal responses\n", __func__);
|
||||||
|
}
|
||||||
|
|
||||||
// print chat template example in conversation mode
|
// print chat template example in conversation mode
|
||||||
if (params.conversation) {
|
if (params.conversation_mode) {
|
||||||
if (params.enable_chat_template) {
|
if (params.enable_chat_template) {
|
||||||
LOG_INF("%s: chat template example:\n%s\n", __func__, common_chat_format_example(model, params.chat_template).c_str());
|
LOG_INF("%s: chat template example:\n%s\n", __func__, common_chat_format_example(model, params.chat_template).c_str());
|
||||||
} else {
|
} else {
|
||||||
@ -241,9 +260,9 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
const bool add_bos = llama_add_bos_token(model);
|
const bool add_bos = llama_vocab_get_add_bos(vocab);
|
||||||
if (!llama_model_has_encoder(model)) {
|
if (!llama_model_has_encoder(model)) {
|
||||||
GGML_ASSERT(!llama_add_eos_token(model));
|
GGML_ASSERT(!llama_vocab_get_add_eos(vocab));
|
||||||
}
|
}
|
||||||
|
|
||||||
LOG_DBG("n_ctx: %d, add_bos: %d\n", n_ctx, add_bos);
|
LOG_DBG("n_ctx: %d, add_bos: %d\n", n_ctx, add_bos);
|
||||||
@ -251,8 +270,10 @@ int main(int argc, char ** argv) {
|
|||||||
std::vector<llama_token> embd_inp;
|
std::vector<llama_token> embd_inp;
|
||||||
|
|
||||||
{
|
{
|
||||||
auto prompt = (params.conversation && params.enable_chat_template && !params.prompt.empty())
|
auto prompt = (params.conversation_mode && params.enable_chat_template)
|
||||||
? chat_add_and_format(model, chat_msgs, "system", params.prompt) // format the system prompt in conversation mode
|
// format the system prompt in conversation mode (fallback to default if empty)
|
||||||
|
? chat_add_and_format(model, chat_msgs, "system", params.prompt.empty() ? DEFAULT_SYSTEM_MESSAGE : params.prompt)
|
||||||
|
// otherwise use the prompt as is
|
||||||
: params.prompt;
|
: params.prompt;
|
||||||
if (params.interactive_first || !params.prompt.empty() || session_tokens.empty()) {
|
if (params.interactive_first || !params.prompt.empty() || session_tokens.empty()) {
|
||||||
LOG_DBG("tokenize the prompt\n");
|
LOG_DBG("tokenize the prompt\n");
|
||||||
@ -269,7 +290,7 @@ int main(int argc, char ** argv) {
|
|||||||
// Should not run without any tokens
|
// Should not run without any tokens
|
||||||
if (embd_inp.empty()) {
|
if (embd_inp.empty()) {
|
||||||
if (add_bos) {
|
if (add_bos) {
|
||||||
embd_inp.push_back(llama_token_bos(model));
|
embd_inp.push_back(llama_vocab_bos(vocab));
|
||||||
LOG_WRN("embd_inp was considered empty and bos was added: %s\n", string_from(ctx, embd_inp).c_str());
|
LOG_WRN("embd_inp was considered empty and bos was added: %s\n", string_from(ctx, embd_inp).c_str());
|
||||||
} else {
|
} else {
|
||||||
LOG_ERR("input is empty\n");
|
LOG_ERR("input is empty\n");
|
||||||
@ -326,7 +347,7 @@ int main(int argc, char ** argv) {
|
|||||||
params.n_keep += add_bos; // always keep the BOS token
|
params.n_keep += add_bos; // always keep the BOS token
|
||||||
}
|
}
|
||||||
|
|
||||||
if (params.conversation) {
|
if (params.conversation_mode) {
|
||||||
params.interactive_first = true;
|
params.interactive_first = true;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -450,7 +471,11 @@ int main(int argc, char ** argv) {
|
|||||||
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
|
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
|
||||||
LOG_INF( " - Press Ctrl+C to interject at any time.\n");
|
LOG_INF( " - Press Ctrl+C to interject at any time.\n");
|
||||||
#endif
|
#endif
|
||||||
LOG_INF( "%s\n", control_message);
|
LOG_INF( "%s", control_message);
|
||||||
|
if (params.conversation_mode && params.enable_chat_template && params.prompt.empty()) {
|
||||||
|
LOG_INF( " - Using default system message. To change it, set a different value via -p PROMPT or -f FILE argument.\n");
|
||||||
|
}
|
||||||
|
LOG_INF("\n");
|
||||||
|
|
||||||
is_interacting = params.interactive_first;
|
is_interacting = params.interactive_first;
|
||||||
}
|
}
|
||||||
@ -495,7 +520,7 @@ int main(int argc, char ** argv) {
|
|||||||
|
|
||||||
llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
|
llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
|
||||||
if (decoder_start_token_id == LLAMA_TOKEN_NULL) {
|
if (decoder_start_token_id == LLAMA_TOKEN_NULL) {
|
||||||
decoder_start_token_id = llama_token_bos(model);
|
decoder_start_token_id = llama_vocab_bos(vocab);
|
||||||
}
|
}
|
||||||
|
|
||||||
embd_inp.clear();
|
embd_inp.clear();
|
||||||
@ -742,7 +767,7 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// deal with end of generation tokens in interactive mode
|
// deal with end of generation tokens in interactive mode
|
||||||
if (llama_token_is_eog(model, common_sampler_last(smpl))) {
|
if (llama_vocab_is_eog(vocab, common_sampler_last(smpl))) {
|
||||||
LOG_DBG("found an EOG token\n");
|
LOG_DBG("found an EOG token\n");
|
||||||
|
|
||||||
if (params.interactive) {
|
if (params.interactive) {
|
||||||
@ -762,7 +787,7 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// if current token is not EOG, we add it to current assistant message
|
// if current token is not EOG, we add it to current assistant message
|
||||||
if (params.conversation) {
|
if (params.conversation_mode) {
|
||||||
const auto id = common_sampler_last(smpl);
|
const auto id = common_sampler_last(smpl);
|
||||||
assistant_ss << common_token_to_piece(ctx, id, false);
|
assistant_ss << common_token_to_piece(ctx, id, false);
|
||||||
}
|
}
|
||||||
@ -770,17 +795,17 @@ int main(int argc, char ** argv) {
|
|||||||
if (n_past > 0 && is_interacting) {
|
if (n_past > 0 && is_interacting) {
|
||||||
LOG_DBG("waiting for user input\n");
|
LOG_DBG("waiting for user input\n");
|
||||||
|
|
||||||
if (params.conversation) {
|
if (params.conversation_mode) {
|
||||||
LOG("\n> ");
|
LOG("\n> ");
|
||||||
}
|
}
|
||||||
|
|
||||||
if (params.input_prefix_bos) {
|
if (params.input_prefix_bos) {
|
||||||
LOG_DBG("adding input prefix BOS token\n");
|
LOG_DBG("adding input prefix BOS token\n");
|
||||||
embd_inp.push_back(llama_token_bos(model));
|
embd_inp.push_back(llama_vocab_bos(vocab));
|
||||||
}
|
}
|
||||||
|
|
||||||
std::string buffer;
|
std::string buffer;
|
||||||
if (!params.input_prefix.empty() && !params.conversation) {
|
if (!params.input_prefix.empty() && !params.conversation_mode) {
|
||||||
LOG_DBG("appending input prefix: '%s'\n", params.input_prefix.c_str());
|
LOG_DBG("appending input prefix: '%s'\n", params.input_prefix.c_str());
|
||||||
LOG("%s", params.input_prefix.c_str());
|
LOG("%s", params.input_prefix.c_str());
|
||||||
}
|
}
|
||||||
@ -804,7 +829,7 @@ int main(int argc, char ** argv) {
|
|||||||
// Entering a empty line lets the user pass control back
|
// Entering a empty line lets the user pass control back
|
||||||
if (buffer.length() > 1) {
|
if (buffer.length() > 1) {
|
||||||
// append input suffix if any
|
// append input suffix if any
|
||||||
if (!params.input_suffix.empty() && !params.conversation) {
|
if (!params.input_suffix.empty() && !params.conversation_mode) {
|
||||||
LOG_DBG("appending input suffix: '%s'\n", params.input_suffix.c_str());
|
LOG_DBG("appending input suffix: '%s'\n", params.input_suffix.c_str());
|
||||||
LOG("%s", params.input_suffix.c_str());
|
LOG("%s", params.input_suffix.c_str());
|
||||||
}
|
}
|
||||||
@ -817,7 +842,7 @@ int main(int argc, char ** argv) {
|
|||||||
string_process_escapes(buffer);
|
string_process_escapes(buffer);
|
||||||
}
|
}
|
||||||
|
|
||||||
bool format_chat = params.conversation && params.enable_chat_template;
|
bool format_chat = params.conversation_mode && params.enable_chat_template;
|
||||||
std::string user_inp = format_chat
|
std::string user_inp = format_chat
|
||||||
? chat_add_and_format(model, chat_msgs, "user", std::move(buffer))
|
? chat_add_and_format(model, chat_msgs, "user", std::move(buffer))
|
||||||
: std::move(buffer);
|
: std::move(buffer);
|
||||||
@ -830,8 +855,8 @@ int main(int argc, char ** argv) {
|
|||||||
|
|
||||||
// if user stop generation mid-way, we must add EOT to finish model's last response
|
// if user stop generation mid-way, we must add EOT to finish model's last response
|
||||||
if (need_insert_eot && format_chat) {
|
if (need_insert_eot && format_chat) {
|
||||||
llama_token eot = llama_token_eot(model);
|
llama_token eot = llama_vocab_eot(vocab);
|
||||||
embd_inp.push_back(eot == LLAMA_TOKEN_NULL ? llama_token_eos(model) : eot);
|
embd_inp.push_back(eot == LLAMA_TOKEN_NULL ? llama_vocab_eos(vocab) : eot);
|
||||||
need_insert_eot = false;
|
need_insert_eot = false;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -866,7 +891,7 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// end of generation
|
// end of generation
|
||||||
if (!embd.empty() && llama_token_is_eog(model, embd.back()) && !(params.interactive)) {
|
if (!embd.empty() && llama_vocab_is_eog(vocab, embd.back()) && !(params.interactive)) {
|
||||||
LOG(" [end of text]\n");
|
LOG(" [end of text]\n");
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
@ -135,6 +135,8 @@ int main(int argc, char ** argv) {
|
|||||||
llama_model * model = llama_init.model.get();
|
llama_model * model = llama_init.model.get();
|
||||||
llama_context * ctx = llama_init.context.get();
|
llama_context * ctx = llama_init.context.get();
|
||||||
|
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
// load the prompts from an external file if there are any
|
// load the prompts from an external file if there are any
|
||||||
if (params.prompt.empty()) {
|
if (params.prompt.empty()) {
|
||||||
LOG_INF("\033[32mNo new questions so proceed with build-in defaults.\033[0m\n");
|
LOG_INF("\033[32mNo new questions so proceed with build-in defaults.\033[0m\n");
|
||||||
@ -358,7 +360,7 @@ int main(int argc, char ** argv) {
|
|||||||
// client.id, client.seq_id, id, client.n_decoded, client.i_batch, token_str.c_str());
|
// client.id, client.seq_id, id, client.n_decoded, client.i_batch, token_str.c_str());
|
||||||
|
|
||||||
if (client.n_decoded > 2 &&
|
if (client.n_decoded > 2 &&
|
||||||
(llama_token_is_eog(model, id) ||
|
(llama_vocab_is_eog(vocab, id) ||
|
||||||
(params.n_predict > 0 && client.n_decoded + client.n_prompt >= params.n_predict) ||
|
(params.n_predict > 0 && client.n_decoded + client.n_prompt >= params.n_predict) ||
|
||||||
client.response.find("User:") != std::string::npos ||
|
client.response.find("User:") != std::string::npos ||
|
||||||
client.response.find('\n') != std::string::npos)) {
|
client.response.find('\n') != std::string::npos)) {
|
||||||
|
@ -70,15 +70,17 @@ int main(int argc, char ** argv) {
|
|||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
// initialize the context
|
// initialize the context
|
||||||
|
|
||||||
llama_context_params ctx_params = common_context_params_to_llama(params);
|
llama_context_params ctx_params = common_context_params_to_llama(params);
|
||||||
|
|
||||||
ctx_params.n_ctx = llama_n_ctx_train(model)*n_grp + n_keep;
|
ctx_params.n_ctx = llama_model_n_ctx_train(model)*n_grp + n_keep;
|
||||||
|
|
||||||
GGML_ASSERT(ctx_params.n_batch % n_grp == 0 && "n_batch must be divisible by n_grp");
|
GGML_ASSERT(ctx_params.n_batch % n_grp == 0 && "n_batch must be divisible by n_grp");
|
||||||
|
|
||||||
llama_context * ctx = llama_new_context_with_model(model, ctx_params);
|
llama_context * ctx = llama_init_from_model(model, ctx_params);
|
||||||
if (ctx == NULL) {
|
if (ctx == NULL) {
|
||||||
LOG_ERR("%s: failed to create the llama_context\n" , __func__);
|
LOG_ERR("%s: failed to create the llama_context\n" , __func__);
|
||||||
return 1;
|
return 1;
|
||||||
@ -223,7 +225,7 @@ int main(int argc, char ** argv) {
|
|||||||
const llama_token new_token_id = llama_sampler_sample(smpl, ctx, batch.n_tokens - 1);
|
const llama_token new_token_id = llama_sampler_sample(smpl, ctx, batch.n_tokens - 1);
|
||||||
|
|
||||||
// is it an end of generation?
|
// is it an end of generation?
|
||||||
if (llama_token_is_eog(model, new_token_id) || n_cur == n_len) {
|
if (llama_vocab_is_eog(vocab, new_token_id) || n_cur == n_len) {
|
||||||
LOG("\n");
|
LOG("\n");
|
||||||
|
|
||||||
break;
|
break;
|
||||||
|
@ -296,8 +296,11 @@ static results_perplexity perplexity_v2(llama_context * ctx, const common_params
|
|||||||
// Output: `perplexity: 13.5106 [114/114]`
|
// Output: `perplexity: 13.5106 [114/114]`
|
||||||
// BOS tokens will be added for each chunk before eval
|
// BOS tokens will be added for each chunk before eval
|
||||||
|
|
||||||
const bool add_bos = llama_add_bos_token(llama_get_model(ctx));
|
const llama_model * model = llama_get_model(ctx);
|
||||||
GGML_ASSERT(!llama_add_eos_token(llama_get_model(ctx)));
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
|
const bool add_bos = llama_vocab_get_add_bos(vocab);
|
||||||
|
GGML_ASSERT(!llama_vocab_get_add_eos(vocab));
|
||||||
|
|
||||||
LOG_INF("%s: tokenizing the input ..\n", __func__);
|
LOG_INF("%s: tokenizing the input ..\n", __func__);
|
||||||
|
|
||||||
@ -338,7 +341,7 @@ static results_perplexity perplexity_v2(llama_context * ctx, const common_params
|
|||||||
const int n_chunk = params.n_chunks < 0 ? n_chunk_max : std::min(params.n_chunks, n_chunk_max);
|
const int n_chunk = params.n_chunks < 0 ? n_chunk_max : std::min(params.n_chunks, n_chunk_max);
|
||||||
const int n_batch = params.n_batch;
|
const int n_batch = params.n_batch;
|
||||||
|
|
||||||
const int n_vocab = llama_n_vocab(llama_get_model(ctx));
|
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||||||
|
|
||||||
int count = 0;
|
int count = 0;
|
||||||
double nll = 0.0;
|
double nll = 0.0;
|
||||||
@ -382,7 +385,7 @@ static results_perplexity perplexity_v2(llama_context * ctx, const common_params
|
|||||||
|
|
||||||
// add BOS token for the first batch of each chunk
|
// add BOS token for the first batch of each chunk
|
||||||
if (add_bos && j == 0) {
|
if (add_bos && j == 0) {
|
||||||
tokens[batch_start] = llama_token_bos(llama_get_model(ctx));
|
tokens[batch_start] = llama_vocab_bos(vocab);
|
||||||
}
|
}
|
||||||
|
|
||||||
const auto * batch_logits = llama_get_logits(ctx);
|
const auto * batch_logits = llama_get_logits(ctx);
|
||||||
@ -444,8 +447,11 @@ static results_perplexity perplexity(llama_context * ctx, const common_params &
|
|||||||
// Output: `perplexity: 13.5106 [114/114]`
|
// Output: `perplexity: 13.5106 [114/114]`
|
||||||
// BOS tokens will be added for each chunk before eval
|
// BOS tokens will be added for each chunk before eval
|
||||||
|
|
||||||
const bool add_bos = llama_add_bos_token(llama_get_model(ctx));
|
const llama_model * model = llama_get_model(ctx);
|
||||||
GGML_ASSERT(!llama_add_eos_token(llama_get_model(ctx)));
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
|
const bool add_bos = llama_vocab_get_add_bos(vocab);
|
||||||
|
GGML_ASSERT(!llama_vocab_get_add_eos(vocab));
|
||||||
|
|
||||||
std::ofstream logits_stream;
|
std::ofstream logits_stream;
|
||||||
if (!params.logits_file.empty()) {
|
if (!params.logits_file.empty()) {
|
||||||
@ -485,7 +491,7 @@ static results_perplexity perplexity(llama_context * ctx, const common_params &
|
|||||||
const int n_chunk = params.n_chunks < 0 ? n_chunk_max : std::min(params.n_chunks, n_chunk_max);
|
const int n_chunk = params.n_chunks < 0 ? n_chunk_max : std::min(params.n_chunks, n_chunk_max);
|
||||||
const int n_batch = params.n_batch;
|
const int n_batch = params.n_batch;
|
||||||
|
|
||||||
const int n_vocab = llama_n_vocab(llama_get_model(ctx));
|
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||||||
|
|
||||||
int count = 0;
|
int count = 0;
|
||||||
double nll = 0.0;
|
double nll = 0.0;
|
||||||
@ -557,7 +563,7 @@ static results_perplexity perplexity(llama_context * ctx, const common_params &
|
|||||||
|
|
||||||
// add BOS token for the first batch of each chunk
|
// add BOS token for the first batch of each chunk
|
||||||
if (add_bos && j == 0) {
|
if (add_bos && j == 0) {
|
||||||
tokens[seq_start] = llama_token_bos(llama_get_model(ctx));
|
tokens[seq_start] = llama_vocab_bos(vocab);
|
||||||
}
|
}
|
||||||
|
|
||||||
for (int k = 0; k < batch_size; ++k) {
|
for (int k = 0; k < batch_size; ++k) {
|
||||||
@ -732,6 +738,9 @@ static void compute_logprobs(const float * batch_logits, int n_vocab, std::vecto
|
|||||||
}
|
}
|
||||||
|
|
||||||
static void hellaswag_score(llama_context * ctx, const common_params & params) {
|
static void hellaswag_score(llama_context * ctx, const common_params & params) {
|
||||||
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
// Calculates hellaswag score (acc_norm) from prompt
|
// Calculates hellaswag score (acc_norm) from prompt
|
||||||
//
|
//
|
||||||
// Data extracted from the HellaSwag validation dataset (MIT license) https://github.com/rowanz/hellaswag/blob/master/data/hellaswag_val.jsonl
|
// Data extracted from the HellaSwag validation dataset (MIT license) https://github.com/rowanz/hellaswag/blob/master/data/hellaswag_val.jsonl
|
||||||
@ -765,7 +774,7 @@ static void hellaswag_score(llama_context * ctx, const common_params & params) {
|
|||||||
size_t hs_task_count = prompt_lines.size()/6;
|
size_t hs_task_count = prompt_lines.size()/6;
|
||||||
LOG_INF("%s : loaded %zu tasks from prompt.\n", __func__, hs_task_count);
|
LOG_INF("%s : loaded %zu tasks from prompt.\n", __func__, hs_task_count);
|
||||||
|
|
||||||
const bool is_spm = llama_vocab_type(llama_get_model(ctx)) == LLAMA_VOCAB_TYPE_SPM;
|
const bool is_spm = llama_vocab_type(vocab) == LLAMA_VOCAB_TYPE_SPM;
|
||||||
LOG_INF("================================= is_spm = %d\n", is_spm);
|
LOG_INF("================================= is_spm = %d\n", is_spm);
|
||||||
|
|
||||||
// The tasks should be randomized so the score stabilizes quickly.
|
// The tasks should be randomized so the score stabilizes quickly.
|
||||||
@ -848,7 +857,7 @@ static void hellaswag_score(llama_context * ctx, const common_params & params) {
|
|||||||
const int n_ctx = llama_n_ctx(ctx);
|
const int n_ctx = llama_n_ctx(ctx);
|
||||||
const int n_batch = params.n_batch;
|
const int n_batch = params.n_batch;
|
||||||
|
|
||||||
const int n_vocab = llama_n_vocab(llama_get_model(ctx));
|
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||||||
|
|
||||||
const int max_tasks_per_batch = 32;
|
const int max_tasks_per_batch = 32;
|
||||||
const int max_seq = std::min(4*max_tasks_per_batch, (int) llama_n_seq_max(ctx));
|
const int max_seq = std::min(4*max_tasks_per_batch, (int) llama_n_seq_max(ctx));
|
||||||
@ -1072,6 +1081,8 @@ static std::vector<winogrande_entry> load_winogrande_from_csv(const std::string
|
|||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
static void winogrande_score(llama_context * ctx, const common_params & params) {
|
static void winogrande_score(llama_context * ctx, const common_params & params) {
|
||||||
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
constexpr int k_min_trailing_ctx = 3;
|
constexpr int k_min_trailing_ctx = 3;
|
||||||
|
|
||||||
@ -1130,7 +1141,7 @@ static void winogrande_score(llama_context * ctx, const common_params & params)
|
|||||||
const int n_ctx = llama_n_ctx(ctx);
|
const int n_ctx = llama_n_ctx(ctx);
|
||||||
const int n_batch = params.n_batch;
|
const int n_batch = params.n_batch;
|
||||||
|
|
||||||
const int n_vocab = llama_n_vocab(llama_get_model(ctx));
|
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||||||
|
|
||||||
const int max_tasks_per_batch = 128;
|
const int max_tasks_per_batch = 128;
|
||||||
const int max_seq = std::min(2*max_tasks_per_batch, (int) llama_n_seq_max(ctx));
|
const int max_seq = std::min(2*max_tasks_per_batch, (int) llama_n_seq_max(ctx));
|
||||||
@ -1374,6 +1385,8 @@ static bool multiple_choice_prepare_one_task(llama_context * ctx, multiple_choic
|
|||||||
// https://huggingface.co/datasets/truthful_qa
|
// https://huggingface.co/datasets/truthful_qa
|
||||||
//
|
//
|
||||||
static void multiple_choice_score(llama_context * ctx, const common_params & params) {
|
static void multiple_choice_score(llama_context * ctx, const common_params & params) {
|
||||||
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
std::istringstream strstream(params.prompt);
|
std::istringstream strstream(params.prompt);
|
||||||
uint32_t n_task;
|
uint32_t n_task;
|
||||||
@ -1482,7 +1495,7 @@ static void multiple_choice_score(llama_context * ctx, const common_params & par
|
|||||||
const int n_ctx = llama_n_ctx(ctx);
|
const int n_ctx = llama_n_ctx(ctx);
|
||||||
const int n_batch = params.n_batch;
|
const int n_batch = params.n_batch;
|
||||||
|
|
||||||
const int n_vocab = llama_n_vocab(llama_get_model(ctx));
|
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||||||
|
|
||||||
const int max_tasks_per_batch = 32;
|
const int max_tasks_per_batch = 32;
|
||||||
const int max_seq = std::min(4*max_tasks_per_batch, (int) llama_n_seq_max(ctx));
|
const int max_seq = std::min(4*max_tasks_per_batch, (int) llama_n_seq_max(ctx));
|
||||||
@ -1655,6 +1668,9 @@ static void multiple_choice_score(llama_context * ctx, const common_params & par
|
|||||||
}
|
}
|
||||||
|
|
||||||
static void kl_divergence(llama_context * ctx, const common_params & params) {
|
static void kl_divergence(llama_context * ctx, const common_params & params) {
|
||||||
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
if (params.logits_file.empty()) {
|
if (params.logits_file.empty()) {
|
||||||
LOG_ERR("%s: you must provide a name of a file containing the log probabilities of the base model\n", __func__);
|
LOG_ERR("%s: you must provide a name of a file containing the log probabilities of the base model\n", __func__);
|
||||||
return;
|
return;
|
||||||
@ -1688,8 +1704,8 @@ static void kl_divergence(llama_context * ctx, const common_params & params) {
|
|||||||
LOG_ERR("%s: failed reading n_vocab, n_chunk from %s\n", __func__, params.logits_file.c_str());
|
LOG_ERR("%s: failed reading n_vocab, n_chunk from %s\n", __func__, params.logits_file.c_str());
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
if (n_vocab != llama_n_vocab(llama_get_model(ctx))) {
|
if (n_vocab != llama_vocab_n_tokens(vocab)) {
|
||||||
LOG_ERR("%s: inconsistent vocabulary (%d vs %d)\n", __func__, n_vocab, llama_n_vocab(llama_get_model(ctx)));
|
LOG_ERR("%s: inconsistent vocabulary (%d vs %d)\n", __func__, n_vocab, llama_vocab_n_tokens(vocab));
|
||||||
}
|
}
|
||||||
|
|
||||||
std::vector<llama_token> tokens(size_t(n_ctx) * n_chunk);
|
std::vector<llama_token> tokens(size_t(n_ctx) * n_chunk);
|
||||||
@ -1701,8 +1717,8 @@ static void kl_divergence(llama_context * ctx, const common_params & params) {
|
|||||||
const int n_batch = params.n_batch;
|
const int n_batch = params.n_batch;
|
||||||
const int num_batches = (n_ctx + n_batch - 1)/n_batch;
|
const int num_batches = (n_ctx + n_batch - 1)/n_batch;
|
||||||
const int nv = 2*((n_vocab + 1)/2) + 4;
|
const int nv = 2*((n_vocab + 1)/2) + 4;
|
||||||
const bool add_bos = llama_add_bos_token(llama_get_model(ctx));
|
const bool add_bos = llama_vocab_get_add_bos(vocab);
|
||||||
GGML_ASSERT(!llama_add_eos_token(llama_get_model(ctx)));
|
GGML_ASSERT(!llama_vocab_get_add_eos(vocab));
|
||||||
|
|
||||||
std::vector<uint16_t> log_probs_uint16(size_t(n_ctx - 1 - n_ctx/2) * nv);
|
std::vector<uint16_t> log_probs_uint16(size_t(n_ctx - 1 - n_ctx/2) * nv);
|
||||||
std::vector<float> kld_values(size_t(n_ctx - 1 - n_ctx/2)*n_chunk);
|
std::vector<float> kld_values(size_t(n_ctx - 1 - n_ctx/2)*n_chunk);
|
||||||
@ -1761,7 +1777,7 @@ static void kl_divergence(llama_context * ctx, const common_params & params) {
|
|||||||
|
|
||||||
// add BOS token for the first batch of each chunk
|
// add BOS token for the first batch of each chunk
|
||||||
if (add_bos && j == 0) {
|
if (add_bos && j == 0) {
|
||||||
tokens[batch_start] = llama_token_bos(llama_get_model(ctx));
|
tokens[batch_start] = llama_vocab_bos(vocab);
|
||||||
}
|
}
|
||||||
|
|
||||||
common_batch_clear(batch);
|
common_batch_clear(batch);
|
||||||
@ -1995,7 +2011,7 @@ int main(int argc, char ** argv) {
|
|||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
const int n_ctx_train = llama_n_ctx_train(model);
|
const int n_ctx_train = llama_model_n_ctx_train(model);
|
||||||
|
|
||||||
if (params.n_ctx > n_ctx_train) {
|
if (params.n_ctx > n_ctx_train) {
|
||||||
LOG_WRN("%s: model was trained on only %d context tokens (%d specified)\n",
|
LOG_WRN("%s: model was trained on only %d context tokens (%d specified)\n",
|
||||||
|
@ -319,7 +319,7 @@ int main(int argc, char ** argv) {
|
|||||||
auto cparams = llama_context_default_params();
|
auto cparams = llama_context_default_params();
|
||||||
cparams.n_ctx = 256;
|
cparams.n_ctx = 256;
|
||||||
|
|
||||||
ctx = llama_new_context_with_model(model, cparams);
|
ctx = llama_init_from_model(model, cparams);
|
||||||
|
|
||||||
if (ctx == NULL) {
|
if (ctx == NULL) {
|
||||||
fprintf(stderr, "%s: error: failed to create context with model '%s'\n", __func__, params.model.c_str());
|
fprintf(stderr, "%s: error: failed to create context with model '%s'\n", __func__, params.model.c_str());
|
||||||
|
@ -159,7 +159,9 @@ int main(int argc, char ** argv) {
|
|||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
const int n_ctx_train = llama_n_ctx_train(model);
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
|
const int n_ctx_train = llama_model_n_ctx_train(model);
|
||||||
const int n_ctx = llama_n_ctx(ctx);
|
const int n_ctx = llama_n_ctx(ctx);
|
||||||
|
|
||||||
const enum llama_pooling_type pooling_type = llama_pooling_type(ctx);
|
const enum llama_pooling_type pooling_type = llama_pooling_type(ctx);
|
||||||
@ -192,8 +194,8 @@ int main(int argc, char ** argv) {
|
|||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
// add eos if not present
|
// add eos if not present
|
||||||
if (llama_token_eos(model) >= 0 && (inp.empty() || inp.back() != llama_token_eos(model))) {
|
if (llama_vocab_eos(vocab) >= 0 && (inp.empty() || inp.back() != llama_vocab_eos(vocab))) {
|
||||||
inp.push_back(llama_token_eos(model));
|
inp.push_back(llama_vocab_eos(vocab));
|
||||||
}
|
}
|
||||||
chunk.tokens = inp;
|
chunk.tokens = inp;
|
||||||
}
|
}
|
||||||
@ -215,7 +217,7 @@ int main(int argc, char ** argv) {
|
|||||||
struct llama_batch batch = llama_batch_init(n_batch, 0, 1);
|
struct llama_batch batch = llama_batch_init(n_batch, 0, 1);
|
||||||
|
|
||||||
// allocate output
|
// allocate output
|
||||||
const int n_embd = llama_n_embd(model);
|
const int n_embd = llama_model_n_embd(model);
|
||||||
std::vector<float> embeddings(n_chunks * n_embd, 0);
|
std::vector<float> embeddings(n_chunks * n_embd, 0);
|
||||||
float * emb = embeddings.data();
|
float * emb = embeddings.data();
|
||||||
|
|
||||||
|
@ -11,6 +11,8 @@
|
|||||||
# include <curl/curl.h>
|
# include <curl/curl.h>
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
#include <signal.h>
|
||||||
|
|
||||||
#include <climits>
|
#include <climits>
|
||||||
#include <cstdarg>
|
#include <cstdarg>
|
||||||
#include <cstdio>
|
#include <cstdio>
|
||||||
@ -25,6 +27,13 @@
|
|||||||
#include "json.hpp"
|
#include "json.hpp"
|
||||||
#include "llama-cpp.h"
|
#include "llama-cpp.h"
|
||||||
|
|
||||||
|
#if defined(__unix__) || (defined(__APPLE__) && defined(__MACH__)) || defined(_WIN32)
|
||||||
|
[[noreturn]] static void sigint_handler(int) {
|
||||||
|
printf("\n\033[0m");
|
||||||
|
exit(0); // not ideal, but it's the only way to guarantee exit in all cases
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
GGML_ATTRIBUTE_FORMAT(1, 2)
|
GGML_ATTRIBUTE_FORMAT(1, 2)
|
||||||
static std::string fmt(const char * fmt, ...) {
|
static std::string fmt(const char * fmt, ...) {
|
||||||
va_list ap;
|
va_list ap;
|
||||||
@ -676,7 +685,7 @@ class LlamaData {
|
|||||||
|
|
||||||
// Initializes the context with the specified parameters
|
// Initializes the context with the specified parameters
|
||||||
llama_context_ptr initialize_context(const llama_model_ptr & model, const Opt & opt) {
|
llama_context_ptr initialize_context(const llama_model_ptr & model, const Opt & opt) {
|
||||||
llama_context_ptr context(llama_new_context_with_model(model.get(), opt.ctx_params));
|
llama_context_ptr context(llama_init_from_model(model.get(), opt.ctx_params));
|
||||||
if (!context) {
|
if (!context) {
|
||||||
printe("%s: error: failed to create the llama_context\n", __func__);
|
printe("%s: error: failed to create the llama_context\n", __func__);
|
||||||
}
|
}
|
||||||
@ -704,11 +713,11 @@ static void add_message(const char * role, const std::string & text, LlamaData &
|
|||||||
// Function to apply the chat template and resize `formatted` if needed
|
// Function to apply the chat template and resize `formatted` if needed
|
||||||
static int apply_chat_template(LlamaData & llama_data, const bool append) {
|
static int apply_chat_template(LlamaData & llama_data, const bool append) {
|
||||||
int result = llama_chat_apply_template(
|
int result = llama_chat_apply_template(
|
||||||
llama_data.model.get(), nullptr, llama_data.messages.data(), llama_data.messages.size(), append,
|
llama_model_chat_template(llama_data.model.get()), llama_data.messages.data(), llama_data.messages.size(), append,
|
||||||
append ? llama_data.fmtted.data() : nullptr, append ? llama_data.fmtted.size() : 0);
|
append ? llama_data.fmtted.data() : nullptr, append ? llama_data.fmtted.size() : 0);
|
||||||
if (append && result > static_cast<int>(llama_data.fmtted.size())) {
|
if (append && result > static_cast<int>(llama_data.fmtted.size())) {
|
||||||
llama_data.fmtted.resize(result);
|
llama_data.fmtted.resize(result);
|
||||||
result = llama_chat_apply_template(llama_data.model.get(), nullptr, llama_data.messages.data(),
|
result = llama_chat_apply_template(llama_model_chat_template(llama_data.model.get()), llama_data.messages.data(),
|
||||||
llama_data.messages.size(), append, llama_data.fmtted.data(),
|
llama_data.messages.size(), append, llama_data.fmtted.data(),
|
||||||
llama_data.fmtted.size());
|
llama_data.fmtted.size());
|
||||||
}
|
}
|
||||||
@ -717,11 +726,11 @@ static int apply_chat_template(LlamaData & llama_data, const bool append) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Function to tokenize the prompt
|
// Function to tokenize the prompt
|
||||||
static int tokenize_prompt(const llama_model_ptr & model, const std::string & prompt,
|
static int tokenize_prompt(const llama_vocab * vocab, const std::string & prompt,
|
||||||
std::vector<llama_token> & prompt_tokens) {
|
std::vector<llama_token> & prompt_tokens) {
|
||||||
const int n_prompt_tokens = -llama_tokenize(model.get(), prompt.c_str(), prompt.size(), NULL, 0, true, true);
|
const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, true, true);
|
||||||
prompt_tokens.resize(n_prompt_tokens);
|
prompt_tokens.resize(n_prompt_tokens);
|
||||||
if (llama_tokenize(model.get(), prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), true,
|
if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), true,
|
||||||
true) < 0) {
|
true) < 0) {
|
||||||
printe("failed to tokenize the prompt\n");
|
printe("failed to tokenize the prompt\n");
|
||||||
return -1;
|
return -1;
|
||||||
@ -744,9 +753,9 @@ static int check_context_size(const llama_context_ptr & ctx, const llama_batch &
|
|||||||
}
|
}
|
||||||
|
|
||||||
// convert the token to a string
|
// convert the token to a string
|
||||||
static int convert_token_to_string(const llama_model_ptr & model, const llama_token token_id, std::string & piece) {
|
static int convert_token_to_string(const llama_vocab * vocab, const llama_token token_id, std::string & piece) {
|
||||||
char buf[256];
|
char buf[256];
|
||||||
int n = llama_token_to_piece(model.get(), token_id, buf, sizeof(buf), 0, true);
|
int n = llama_token_to_piece(vocab, token_id, buf, sizeof(buf), 0, true);
|
||||||
if (n < 0) {
|
if (n < 0) {
|
||||||
printe("failed to convert token to piece\n");
|
printe("failed to convert token to piece\n");
|
||||||
return 1;
|
return 1;
|
||||||
@ -764,8 +773,10 @@ static void print_word_and_concatenate_to_response(const std::string & piece, st
|
|||||||
|
|
||||||
// helper function to evaluate a prompt and generate a response
|
// helper function to evaluate a prompt and generate a response
|
||||||
static int generate(LlamaData & llama_data, const std::string & prompt, std::string & response) {
|
static int generate(LlamaData & llama_data, const std::string & prompt, std::string & response) {
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(llama_data.model.get());
|
||||||
|
|
||||||
std::vector<llama_token> tokens;
|
std::vector<llama_token> tokens;
|
||||||
if (tokenize_prompt(llama_data.model, prompt, tokens) < 0) {
|
if (tokenize_prompt(vocab, prompt, tokens) < 0) {
|
||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -781,12 +792,12 @@ static int generate(LlamaData & llama_data, const std::string & prompt, std::str
|
|||||||
|
|
||||||
// sample the next token, check is it an end of generation?
|
// sample the next token, check is it an end of generation?
|
||||||
new_token_id = llama_sampler_sample(llama_data.sampler.get(), llama_data.context.get(), -1);
|
new_token_id = llama_sampler_sample(llama_data.sampler.get(), llama_data.context.get(), -1);
|
||||||
if (llama_token_is_eog(llama_data.model.get(), new_token_id)) {
|
if (llama_vocab_is_eog(vocab, new_token_id)) {
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
std::string piece;
|
std::string piece;
|
||||||
if (convert_token_to_string(llama_data.model, new_token_id, piece)) {
|
if (convert_token_to_string(vocab, new_token_id, piece)) {
|
||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -801,7 +812,20 @@ static int generate(LlamaData & llama_data, const std::string & prompt, std::str
|
|||||||
|
|
||||||
static int read_user_input(std::string & user) {
|
static int read_user_input(std::string & user) {
|
||||||
std::getline(std::cin, user);
|
std::getline(std::cin, user);
|
||||||
return user.empty(); // Should have data in happy path
|
if (std::cin.eof()) {
|
||||||
|
printf("\n");
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (user == "/bye") {
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (user.empty()) {
|
||||||
|
return 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
return 0; // Should have data in happy path
|
||||||
}
|
}
|
||||||
|
|
||||||
// Function to generate a response based on the prompt
|
// Function to generate a response based on the prompt
|
||||||
@ -868,7 +892,25 @@ static bool is_stdout_a_terminal() {
|
|||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
|
|
||||||
// Function to tokenize the prompt
|
// Function to handle user input
|
||||||
|
static int get_user_input(std::string & user_input, const std::string & user) {
|
||||||
|
while (true) {
|
||||||
|
const int ret = handle_user_input(user_input, user);
|
||||||
|
if (ret == 1) {
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (ret == 2) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Main chat loop function
|
||||||
static int chat_loop(LlamaData & llama_data, const std::string & user) {
|
static int chat_loop(LlamaData & llama_data, const std::string & user) {
|
||||||
int prev_len = 0;
|
int prev_len = 0;
|
||||||
llama_data.fmtted.resize(llama_n_ctx(llama_data.context.get()));
|
llama_data.fmtted.resize(llama_n_ctx(llama_data.context.get()));
|
||||||
@ -876,7 +918,8 @@ static int chat_loop(LlamaData & llama_data, const std::string & user) {
|
|||||||
while (true) {
|
while (true) {
|
||||||
// Get user input
|
// Get user input
|
||||||
std::string user_input;
|
std::string user_input;
|
||||||
while (handle_user_input(user_input, user)) {
|
if (get_user_input(user_input, user) == 1) {
|
||||||
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
add_message("user", user.empty() ? user_input : user, llama_data);
|
add_message("user", user.empty() ? user_input : user, llama_data);
|
||||||
@ -917,7 +960,23 @@ static std::string read_pipe_data() {
|
|||||||
return result.str();
|
return result.str();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
static void ctrl_c_handling() {
|
||||||
|
#if defined(__unix__) || (defined(__APPLE__) && defined(__MACH__))
|
||||||
|
struct sigaction sigint_action;
|
||||||
|
sigint_action.sa_handler = sigint_handler;
|
||||||
|
sigemptyset(&sigint_action.sa_mask);
|
||||||
|
sigint_action.sa_flags = 0;
|
||||||
|
sigaction(SIGINT, &sigint_action, NULL);
|
||||||
|
#elif defined(_WIN32)
|
||||||
|
auto console_ctrl_handler = +[](DWORD ctrl_type) -> BOOL {
|
||||||
|
return (ctrl_type == CTRL_C_EVENT) ? (sigint_handler(SIGINT), true) : false;
|
||||||
|
};
|
||||||
|
SetConsoleCtrlHandler(reinterpret_cast<PHANDLER_ROUTINE>(console_ctrl_handler), true);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
int main(int argc, const char ** argv) {
|
int main(int argc, const char ** argv) {
|
||||||
|
ctrl_c_handling();
|
||||||
Opt opt;
|
Opt opt;
|
||||||
const int ret = opt.init(argc, argv);
|
const int ret = opt.init(argc, argv);
|
||||||
if (ret == 2) {
|
if (ret == 2) {
|
||||||
|
@ -97,7 +97,7 @@ int main(int argc, char ** argv) {
|
|||||||
printf("\n\n");
|
printf("\n\n");
|
||||||
|
|
||||||
// make new context
|
// make new context
|
||||||
llama_context * ctx2 = llama_new_context_with_model(model, common_context_params_to_llama(params));
|
llama_context * ctx2 = llama_init_from_model(model, common_context_params_to_llama(params));
|
||||||
|
|
||||||
llama_sampler * smpl2 = llama_sampler_chain_init(sparams);
|
llama_sampler * smpl2 = llama_sampler_chain_init(sparams);
|
||||||
|
|
||||||
@ -154,7 +154,7 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// make new context
|
// make new context
|
||||||
llama_context * ctx3 = llama_new_context_with_model(model, common_context_params_to_llama(params));
|
llama_context * ctx3 = llama_init_from_model(model, common_context_params_to_llama(params));
|
||||||
|
|
||||||
llama_sampler * smpl3 = llama_sampler_chain_init(sparams);
|
llama_sampler * smpl3 = llama_sampler_chain_init(sparams);
|
||||||
|
|
||||||
|
Binary file not shown.
@ -98,7 +98,7 @@ struct slot_params {
|
|||||||
int64_t t_max_prompt_ms = -1; // TODO: implement
|
int64_t t_max_prompt_ms = -1; // TODO: implement
|
||||||
int64_t t_max_predict_ms = -1; // if positive, limit the generation phase to this time limit
|
int64_t t_max_predict_ms = -1; // if positive, limit the generation phase to this time limit
|
||||||
|
|
||||||
std::vector<common_lora_adapter_info> lora;
|
std::vector<common_adapter_lora_info> lora;
|
||||||
|
|
||||||
std::vector<std::string> antiprompt;
|
std::vector<std::string> antiprompt;
|
||||||
std::vector<std::string> response_fields;
|
std::vector<std::string> response_fields;
|
||||||
@ -198,15 +198,17 @@ struct server_task {
|
|||||||
bool metrics_reset_bucket = false;
|
bool metrics_reset_bucket = false;
|
||||||
|
|
||||||
// used by SERVER_TASK_TYPE_SET_LORA
|
// used by SERVER_TASK_TYPE_SET_LORA
|
||||||
std::vector<common_lora_adapter_info> set_lora;
|
std::vector<common_adapter_lora_info> set_lora;
|
||||||
|
|
||||||
server_task(server_task_type type) : type(type) {}
|
server_task(server_task_type type) : type(type) {}
|
||||||
|
|
||||||
static slot_params params_from_json_cmpl(
|
static slot_params params_from_json_cmpl(
|
||||||
const llama_model * model,
|
|
||||||
const llama_context * ctx,
|
const llama_context * ctx,
|
||||||
const common_params & params_base,
|
const common_params & params_base,
|
||||||
const json & data) {
|
const json & data) {
|
||||||
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
slot_params params;
|
slot_params params;
|
||||||
|
|
||||||
// Sampling parameter defaults are loaded from the global server context (but individual requests can still override them)
|
// Sampling parameter defaults are loaded from the global server context (but individual requests can still override them)
|
||||||
@ -329,7 +331,7 @@ struct server_task {
|
|||||||
|
|
||||||
const auto & logit_bias = data.find("logit_bias");
|
const auto & logit_bias = data.find("logit_bias");
|
||||||
if (logit_bias != data.end() && logit_bias->is_array()) {
|
if (logit_bias != data.end() && logit_bias->is_array()) {
|
||||||
const int n_vocab = llama_n_vocab(model);
|
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||||||
for (const auto & el : *logit_bias) {
|
for (const auto & el : *logit_bias) {
|
||||||
// TODO: we may want to throw errors here, in case "el" is incorrect
|
// TODO: we may want to throw errors here, in case "el" is incorrect
|
||||||
if (el.is_array() && el.size() == 2) {
|
if (el.is_array() && el.size() == 2) {
|
||||||
@ -348,7 +350,7 @@ struct server_task {
|
|||||||
params.sampling.logit_bias.push_back({tok, bias});
|
params.sampling.logit_bias.push_back({tok, bias});
|
||||||
}
|
}
|
||||||
} else if (el[0].is_string()) {
|
} else if (el[0].is_string()) {
|
||||||
auto toks = common_tokenize(model, el[0].get<std::string>(), false);
|
auto toks = common_tokenize(vocab, el[0].get<std::string>(), false);
|
||||||
for (auto tok : toks) {
|
for (auto tok : toks) {
|
||||||
params.sampling.logit_bias.push_back({tok, bias});
|
params.sampling.logit_bias.push_back({tok, bias});
|
||||||
}
|
}
|
||||||
@ -1131,7 +1133,7 @@ struct server_slot {
|
|||||||
|
|
||||||
common_speculative * spec = nullptr;
|
common_speculative * spec = nullptr;
|
||||||
|
|
||||||
std::vector<common_lora_adapter_info> lora;
|
std::vector<common_adapter_lora_info> lora;
|
||||||
|
|
||||||
// the index relative to completion multi-task request
|
// the index relative to completion multi-task request
|
||||||
size_t index = 0;
|
size_t index = 0;
|
||||||
@ -1633,6 +1635,8 @@ struct server_context {
|
|||||||
llama_model * model = nullptr;
|
llama_model * model = nullptr;
|
||||||
llama_context * ctx = nullptr;
|
llama_context * ctx = nullptr;
|
||||||
|
|
||||||
|
const llama_vocab * vocab = nullptr;
|
||||||
|
|
||||||
llama_model * model_dft = nullptr;
|
llama_model * model_dft = nullptr;
|
||||||
|
|
||||||
llama_context_params cparams_dft;
|
llama_context_params cparams_dft;
|
||||||
@ -1690,10 +1694,12 @@ struct server_context {
|
|||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
n_ctx = llama_n_ctx(ctx);
|
n_ctx = llama_n_ctx(ctx);
|
||||||
|
|
||||||
add_bos_token = llama_add_bos_token(model);
|
add_bos_token = llama_vocab_get_add_bos(vocab);
|
||||||
has_eos_token = llama_token_eos(model) != LLAMA_TOKEN_NULL;
|
has_eos_token = llama_vocab_eos(vocab) != LLAMA_TOKEN_NULL;
|
||||||
|
|
||||||
if (!params_base.speculative.model.empty()) {
|
if (!params_base.speculative.model.empty()) {
|
||||||
SRV_INF("loading draft model '%s'\n", params_base.speculative.model.c_str());
|
SRV_INF("loading draft model '%s'\n", params_base.speculative.model.c_str());
|
||||||
@ -1736,7 +1742,8 @@ struct server_context {
|
|||||||
|
|
||||||
bool validate_builtin_chat_template() const {
|
bool validate_builtin_chat_template() const {
|
||||||
llama_chat_message chat[] = {{"user", "test"}};
|
llama_chat_message chat[] = {{"user", "test"}};
|
||||||
int32_t chat_res = llama_chat_apply_template(model, nullptr, chat, 1, true, nullptr, 0);
|
const char * tmpl = llama_model_chat_template(model);
|
||||||
|
const int32_t chat_res = llama_chat_apply_template(tmpl, chat, 1, true, nullptr, 0);
|
||||||
return chat_res > 0;
|
return chat_res > 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -1756,7 +1763,7 @@ struct server_context {
|
|||||||
if (model_dft) {
|
if (model_dft) {
|
||||||
slot.batch_spec = llama_batch_init(params_base.speculative.n_max + 1, 0, 1);
|
slot.batch_spec = llama_batch_init(params_base.speculative.n_max + 1, 0, 1);
|
||||||
|
|
||||||
slot.ctx_dft = llama_new_context_with_model(model_dft, cparams_dft);
|
slot.ctx_dft = llama_init_from_model(model_dft, cparams_dft);
|
||||||
if (slot.ctx_dft == nullptr) {
|
if (slot.ctx_dft == nullptr) {
|
||||||
SRV_ERR("%s", "failed to create draft context\n");
|
SRV_ERR("%s", "failed to create draft context\n");
|
||||||
return;
|
return;
|
||||||
@ -1891,7 +1898,7 @@ struct server_context {
|
|||||||
}
|
}
|
||||||
|
|
||||||
if (slot.params.ignore_eos && has_eos_token) {
|
if (slot.params.ignore_eos && has_eos_token) {
|
||||||
slot.params.sampling.logit_bias.push_back({llama_token_eos(model), -INFINITY});
|
slot.params.sampling.logit_bias.push_back({llama_vocab_eos(vocab), -INFINITY});
|
||||||
}
|
}
|
||||||
|
|
||||||
{
|
{
|
||||||
@ -2047,14 +2054,14 @@ struct server_context {
|
|||||||
slot.n_decoded, slot.n_prompt_tokens, slot.n_past, slot.n_ctx);
|
slot.n_decoded, slot.n_prompt_tokens, slot.n_past, slot.n_ctx);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (llama_token_is_eog(model, result.tok)) {
|
if (llama_vocab_is_eog(vocab, result.tok)) {
|
||||||
slot.stop = STOP_TYPE_EOS;
|
slot.stop = STOP_TYPE_EOS;
|
||||||
slot.has_next_token = false;
|
slot.has_next_token = false;
|
||||||
|
|
||||||
SLT_DBG(slot, "%s", "stopped by EOS\n");
|
SLT_DBG(slot, "%s", "stopped by EOS\n");
|
||||||
}
|
}
|
||||||
|
|
||||||
const auto n_ctx_train = llama_n_ctx_train(model);
|
const auto n_ctx_train = llama_model_n_ctx_train(model);
|
||||||
|
|
||||||
if (slot.params.n_predict < 1 && slot.n_predict < 1 && slot.n_prompt_tokens + slot.n_decoded >= n_ctx_train) {
|
if (slot.params.n_predict < 1 && slot.n_predict < 1 && slot.n_prompt_tokens + slot.n_decoded >= n_ctx_train) {
|
||||||
slot.truncated = true;
|
slot.truncated = true;
|
||||||
@ -2074,7 +2081,7 @@ struct server_context {
|
|||||||
|
|
||||||
void populate_token_probs(const server_slot & slot, completion_token_output & result, bool post_sampling, bool special, int idx) {
|
void populate_token_probs(const server_slot & slot, completion_token_output & result, bool post_sampling, bool special, int idx) {
|
||||||
size_t n_probs = slot.params.sampling.n_probs;
|
size_t n_probs = slot.params.sampling.n_probs;
|
||||||
size_t n_vocab = llama_n_vocab(llama_get_model(ctx));
|
size_t n_vocab = llama_vocab_n_tokens(vocab);
|
||||||
if (post_sampling) {
|
if (post_sampling) {
|
||||||
const auto * cur_p = common_sampler_get_candidates(slot.smpl);
|
const auto * cur_p = common_sampler_get_candidates(slot.smpl);
|
||||||
const size_t max_probs = cur_p->size;
|
const size_t max_probs = cur_p->size;
|
||||||
@ -2225,7 +2232,7 @@ struct server_context {
|
|||||||
res->n_tokens = slot.n_prompt_tokens;
|
res->n_tokens = slot.n_prompt_tokens;
|
||||||
res->oaicompat = slot.params.oaicompat;
|
res->oaicompat = slot.params.oaicompat;
|
||||||
|
|
||||||
const int n_embd = llama_n_embd(model);
|
const int n_embd = llama_model_n_embd(model);
|
||||||
|
|
||||||
std::vector<float> embd_res(n_embd, 0.0f);
|
std::vector<float> embd_res(n_embd, 0.0f);
|
||||||
|
|
||||||
@ -2927,7 +2934,7 @@ struct server_context {
|
|||||||
// make sure we're in the right embedding mode
|
// make sure we're in the right embedding mode
|
||||||
llama_set_embeddings(ctx, slot_batched->is_non_causal());
|
llama_set_embeddings(ctx, slot_batched->is_non_causal());
|
||||||
// apply lora, only need to do it once per batch
|
// apply lora, only need to do it once per batch
|
||||||
common_lora_adapters_apply(ctx, slot_batched->lora);
|
common_set_adapter_lora(ctx, slot_batched->lora);
|
||||||
}
|
}
|
||||||
|
|
||||||
// process the created batch of tokens
|
// process the created batch of tokens
|
||||||
@ -3129,12 +3136,12 @@ struct server_context {
|
|||||||
|
|
||||||
json model_meta() const {
|
json model_meta() const {
|
||||||
return json {
|
return json {
|
||||||
{"vocab_type", llama_vocab_type (model)},
|
{"vocab_type", llama_vocab_type (vocab)},
|
||||||
{"n_vocab", llama_n_vocab (model)},
|
{"n_vocab", llama_vocab_n_tokens (vocab)},
|
||||||
{"n_ctx_train", llama_n_ctx_train (model)},
|
{"n_ctx_train", llama_model_n_ctx_train(model)},
|
||||||
{"n_embd", llama_n_embd (model)},
|
{"n_embd", llama_model_n_embd (model)},
|
||||||
{"n_params", llama_model_n_params(model)},
|
{"n_params", llama_model_n_params (model)},
|
||||||
{"size", llama_model_size (model)},
|
{"size", llama_model_size (model)},
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
@ -3639,7 +3646,7 @@ int main(int argc, char ** argv) {
|
|||||||
std::vector<server_task> tasks;
|
std::vector<server_task> tasks;
|
||||||
|
|
||||||
try {
|
try {
|
||||||
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.ctx, data.at("prompt"), true, true);
|
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, data.at("prompt"), true, true);
|
||||||
tasks.reserve(tokenized_prompts.size());
|
tasks.reserve(tokenized_prompts.size());
|
||||||
for (size_t i = 0; i < tokenized_prompts.size(); i++) {
|
for (size_t i = 0; i < tokenized_prompts.size(); i++) {
|
||||||
server_task task = server_task(type);
|
server_task task = server_task(type);
|
||||||
@ -3649,7 +3656,6 @@ int main(int argc, char ** argv) {
|
|||||||
|
|
||||||
task.prompt_tokens = std::move(tokenized_prompts[i]);
|
task.prompt_tokens = std::move(tokenized_prompts[i]);
|
||||||
task.params = server_task::params_from_json_cmpl(
|
task.params = server_task::params_from_json_cmpl(
|
||||||
ctx_server.model,
|
|
||||||
ctx_server.ctx,
|
ctx_server.ctx,
|
||||||
ctx_server.params_base,
|
ctx_server.params_base,
|
||||||
data);
|
data);
|
||||||
@ -3745,13 +3751,13 @@ int main(int argc, char ** argv) {
|
|||||||
const auto handle_infill = [&ctx_server, &res_error, &handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
|
const auto handle_infill = [&ctx_server, &res_error, &handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
|
||||||
// check model compatibility
|
// check model compatibility
|
||||||
std::string err;
|
std::string err;
|
||||||
if (llama_token_fim_pre(ctx_server.model) == LLAMA_TOKEN_NULL) {
|
if (llama_vocab_fim_pre(ctx_server.vocab) == LLAMA_TOKEN_NULL) {
|
||||||
err += "prefix token is missing. ";
|
err += "prefix token is missing. ";
|
||||||
}
|
}
|
||||||
if (llama_token_fim_suf(ctx_server.model) == LLAMA_TOKEN_NULL) {
|
if (llama_vocab_fim_suf(ctx_server.vocab) == LLAMA_TOKEN_NULL) {
|
||||||
err += "suffix token is missing. ";
|
err += "suffix token is missing. ";
|
||||||
}
|
}
|
||||||
if (llama_token_fim_mid(ctx_server.model) == LLAMA_TOKEN_NULL) {
|
if (llama_vocab_fim_mid(ctx_server.vocab) == LLAMA_TOKEN_NULL) {
|
||||||
err += "middle token is missing. ";
|
err += "middle token is missing. ";
|
||||||
}
|
}
|
||||||
if (!err.empty()) {
|
if (!err.empty()) {
|
||||||
@ -3797,10 +3803,10 @@ int main(int argc, char ** argv) {
|
|||||||
data["input_extra"] = input_extra; // default to empty array if it's not exist
|
data["input_extra"] = input_extra; // default to empty array if it's not exist
|
||||||
|
|
||||||
std::string prompt = json_value(data, "prompt", std::string());
|
std::string prompt = json_value(data, "prompt", std::string());
|
||||||
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.ctx, prompt, false, true);
|
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, prompt, false, true);
|
||||||
SRV_DBG("creating infill tasks, n_prompts = %d\n", (int) tokenized_prompts.size());
|
SRV_DBG("creating infill tasks, n_prompts = %d\n", (int) tokenized_prompts.size());
|
||||||
data["prompt"] = format_infill(
|
data["prompt"] = format_infill(
|
||||||
ctx_server.ctx,
|
ctx_server.vocab,
|
||||||
data.at("input_prefix"),
|
data.at("input_prefix"),
|
||||||
data.at("input_suffix"),
|
data.at("input_suffix"),
|
||||||
data.at("input_extra"),
|
data.at("input_extra"),
|
||||||
@ -3857,7 +3863,7 @@ int main(int argc, char ** argv) {
|
|||||||
const bool add_special = json_value(body, "add_special", false);
|
const bool add_special = json_value(body, "add_special", false);
|
||||||
const bool with_pieces = json_value(body, "with_pieces", false);
|
const bool with_pieces = json_value(body, "with_pieces", false);
|
||||||
|
|
||||||
llama_tokens tokens = tokenize_mixed(ctx_server.ctx, body.at("content"), add_special, true);
|
llama_tokens tokens = tokenize_mixed(ctx_server.vocab, body.at("content"), add_special, true);
|
||||||
|
|
||||||
if (with_pieces) {
|
if (with_pieces) {
|
||||||
for (const auto& token : tokens) {
|
for (const auto& token : tokens) {
|
||||||
@ -3933,7 +3939,7 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.ctx, prompt, true, true);
|
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, prompt, true, true);
|
||||||
for (const auto & tokens : tokenized_prompts) {
|
for (const auto & tokens : tokenized_prompts) {
|
||||||
// this check is necessary for models that do not add BOS token to the input
|
// this check is necessary for models that do not add BOS token to the input
|
||||||
if (tokens.empty()) {
|
if (tokens.empty()) {
|
||||||
@ -4033,20 +4039,20 @@ int main(int argc, char ** argv) {
|
|||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
llama_tokens tokenized_query = tokenize_input_prompts(ctx_server.ctx, query, /* add_special */ false, true)[0];
|
llama_tokens tokenized_query = tokenize_input_prompts(ctx_server.vocab, query, /* add_special */ false, true)[0];
|
||||||
|
|
||||||
// create and queue the task
|
// create and queue the task
|
||||||
json responses = json::array();
|
json responses = json::array();
|
||||||
bool error = false;
|
bool error = false;
|
||||||
{
|
{
|
||||||
std::vector<server_task> tasks;
|
std::vector<server_task> tasks;
|
||||||
std::vector<llama_tokens> tokenized_docs = tokenize_input_prompts(ctx_server.ctx, documents, /* add_special */ false, true);
|
std::vector<llama_tokens> tokenized_docs = tokenize_input_prompts(ctx_server.vocab, documents, /* add_special */ false, true);
|
||||||
tasks.reserve(tokenized_docs.size());
|
tasks.reserve(tokenized_docs.size());
|
||||||
for (size_t i = 0; i < tokenized_docs.size(); i++) {
|
for (size_t i = 0; i < tokenized_docs.size(); i++) {
|
||||||
server_task task = server_task(SERVER_TASK_TYPE_RERANK);
|
server_task task = server_task(SERVER_TASK_TYPE_RERANK);
|
||||||
task.id = ctx_server.queue_tasks.get_new_id();
|
task.id = ctx_server.queue_tasks.get_new_id();
|
||||||
task.index = i;
|
task.index = i;
|
||||||
task.prompt_tokens = format_rerank(ctx_server.model, tokenized_query, tokenized_docs[i]);
|
task.prompt_tokens = format_rerank(ctx_server.vocab, tokenized_query, tokenized_docs[i]);
|
||||||
tasks.push_back(task);
|
tasks.push_back(task);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -118,7 +118,7 @@ static json json_get_nested_values(const std::vector<std::string> & paths, const
|
|||||||
* - only string, example: "string"
|
* - only string, example: "string"
|
||||||
* - mixed string and tokens, example: [12, 34, "string", 56, 78]
|
* - mixed string and tokens, example: [12, 34, "string", 56, 78]
|
||||||
*/
|
*/
|
||||||
static llama_tokens tokenize_mixed(const llama_context * ctx, const json & json_prompt, bool add_special, bool parse_special) {
|
static llama_tokens tokenize_mixed(const llama_vocab * vocab, const json & json_prompt, bool add_special, bool parse_special) {
|
||||||
// If `add_bos` is true, we only add BOS, when json_prompt is a string,
|
// If `add_bos` is true, we only add BOS, when json_prompt is a string,
|
||||||
// or the first element of the json_prompt array is a string.
|
// or the first element of the json_prompt array is a string.
|
||||||
llama_tokens prompt_tokens;
|
llama_tokens prompt_tokens;
|
||||||
@ -131,10 +131,10 @@ static llama_tokens tokenize_mixed(const llama_context * ctx, const json & json_
|
|||||||
|
|
||||||
llama_tokens p;
|
llama_tokens p;
|
||||||
if (first) {
|
if (first) {
|
||||||
p = common_tokenize(ctx, s, add_special, parse_special);
|
p = common_tokenize(vocab, s, add_special, parse_special);
|
||||||
first = false;
|
first = false;
|
||||||
} else {
|
} else {
|
||||||
p = common_tokenize(ctx, s, false, parse_special);
|
p = common_tokenize(vocab, s, false, parse_special);
|
||||||
}
|
}
|
||||||
|
|
||||||
prompt_tokens.insert(prompt_tokens.end(), p.begin(), p.end());
|
prompt_tokens.insert(prompt_tokens.end(), p.begin(), p.end());
|
||||||
@ -148,7 +148,7 @@ static llama_tokens tokenize_mixed(const llama_context * ctx, const json & json_
|
|||||||
}
|
}
|
||||||
} else {
|
} else {
|
||||||
auto s = json_prompt.template get<std::string>();
|
auto s = json_prompt.template get<std::string>();
|
||||||
prompt_tokens = common_tokenize(ctx, s, add_special, parse_special);
|
prompt_tokens = common_tokenize(vocab, s, add_special, parse_special);
|
||||||
}
|
}
|
||||||
|
|
||||||
return prompt_tokens;
|
return prompt_tokens;
|
||||||
@ -166,11 +166,11 @@ static llama_tokens tokenize_mixed(const llama_context * ctx, const json & json_
|
|||||||
* - "prompt": [[12, 34, 56], [78, 90, 12]]
|
* - "prompt": [[12, 34, 56], [78, 90, 12]]
|
||||||
* - "prompt": [[12, 34, "string", 56, 78], [12, 34, 56]]
|
* - "prompt": [[12, 34, "string", 56, 78], [12, 34, 56]]
|
||||||
*/
|
*/
|
||||||
static std::vector<llama_tokens> tokenize_input_prompts(llama_context * ctx, const json & json_prompt, bool add_special, bool parse_special) {
|
static std::vector<llama_tokens> tokenize_input_prompts(const llama_vocab * vocab, const json & json_prompt, bool add_special, bool parse_special) {
|
||||||
std::vector<llama_tokens> result;
|
std::vector<llama_tokens> result;
|
||||||
if (json_prompt.is_string() || json_is_array_of_mixed_numbers_strings(json_prompt)) {
|
if (json_prompt.is_string() || json_is_array_of_mixed_numbers_strings(json_prompt)) {
|
||||||
// string or mixed
|
// string or mixed
|
||||||
result.push_back(tokenize_mixed(ctx, json_prompt, add_special, parse_special));
|
result.push_back(tokenize_mixed(vocab, json_prompt, add_special, parse_special));
|
||||||
} else if (json_is_array_of_numbers(json_prompt)) {
|
} else if (json_is_array_of_numbers(json_prompt)) {
|
||||||
// array of tokens
|
// array of tokens
|
||||||
result.push_back(json_prompt.get<llama_tokens>());
|
result.push_back(json_prompt.get<llama_tokens>());
|
||||||
@ -179,7 +179,7 @@ static std::vector<llama_tokens> tokenize_input_prompts(llama_context * ctx, con
|
|||||||
result.reserve(json_prompt.size());
|
result.reserve(json_prompt.size());
|
||||||
for (const auto & p : json_prompt) {
|
for (const auto & p : json_prompt) {
|
||||||
if (p.is_string() || json_is_array_of_mixed_numbers_strings(p)) {
|
if (p.is_string() || json_is_array_of_mixed_numbers_strings(p)) {
|
||||||
result.push_back(tokenize_mixed(ctx, p, add_special, parse_special));
|
result.push_back(tokenize_mixed(vocab, p, add_special, parse_special));
|
||||||
} else if (json_is_array_of_numbers(p)) {
|
} else if (json_is_array_of_numbers(p)) {
|
||||||
// array of tokens
|
// array of tokens
|
||||||
result.push_back(p.get<llama_tokens>());
|
result.push_back(p.get<llama_tokens>());
|
||||||
@ -231,21 +231,23 @@ static size_t validate_utf8(const std::string& text) {
|
|||||||
//
|
//
|
||||||
|
|
||||||
// format rerank task: [BOS]query[EOS][SEP]doc[EOS]
|
// format rerank task: [BOS]query[EOS][SEP]doc[EOS]
|
||||||
static llama_tokens format_rerank(const struct llama_model * model, const llama_tokens & query, const llama_tokens & doc) {
|
static llama_tokens format_rerank(const struct llama_vocab * vocab, const llama_tokens & query, const llama_tokens & doc) {
|
||||||
llama_tokens result;
|
llama_tokens result;
|
||||||
|
|
||||||
result.reserve(doc.size() + query.size() + 4);
|
result.reserve(doc.size() + query.size() + 4);
|
||||||
result.push_back(llama_token_bos(model));
|
result.push_back(llama_vocab_bos(vocab));
|
||||||
result.insert(result.end(), query.begin(), query.end());
|
result.insert(result.end(), query.begin(), query.end());
|
||||||
result.push_back(llama_token_eos(model));
|
result.push_back(llama_vocab_eos(vocab));
|
||||||
result.push_back(llama_token_sep(model));
|
result.push_back(llama_vocab_sep(vocab));
|
||||||
result.insert(result.end(), doc.begin(), doc.end());
|
result.insert(result.end(), doc.begin(), doc.end());
|
||||||
result.push_back(llama_token_eos(model));
|
result.push_back(llama_vocab_eos(vocab));
|
||||||
|
|
||||||
return result;
|
return result;
|
||||||
}
|
}
|
||||||
|
|
||||||
// format infill task
|
// format infill task
|
||||||
static llama_tokens format_infill(
|
static llama_tokens format_infill(
|
||||||
const llama_context * ctx,
|
const llama_vocab * vocab,
|
||||||
const json & input_prefix,
|
const json & input_prefix,
|
||||||
const json & input_suffix,
|
const json & input_suffix,
|
||||||
const json & input_extra,
|
const json & input_extra,
|
||||||
@ -272,15 +274,14 @@ static llama_tokens format_infill(
|
|||||||
llama_tokens extra_tokens;
|
llama_tokens extra_tokens;
|
||||||
extra_tokens.reserve(n_ctx);
|
extra_tokens.reserve(n_ctx);
|
||||||
|
|
||||||
auto model = llama_get_model(ctx);
|
auto tokens_prefix = tokenize_mixed(vocab, input_prefix, false, false);
|
||||||
auto tokens_prefix = tokenize_mixed(ctx, input_prefix, false, false);
|
auto tokens_suffix = tokenize_mixed(vocab, input_suffix, false, false);
|
||||||
auto tokens_suffix = tokenize_mixed(ctx, input_suffix, false, false);
|
|
||||||
|
|
||||||
if (llama_token_fim_rep(model) != LLAMA_TOKEN_NULL) {
|
if (llama_vocab_fim_rep(vocab) != LLAMA_TOKEN_NULL) {
|
||||||
// TODO: make project name an input
|
// TODO: make project name an input
|
||||||
static const auto k_fim_repo = common_tokenize(ctx, "myproject\n", false, false);
|
static const auto k_fim_repo = common_tokenize(vocab, "myproject\n", false, false);
|
||||||
|
|
||||||
extra_tokens.push_back(llama_token_fim_rep(model));
|
extra_tokens.push_back(llama_vocab_fim_rep(vocab));
|
||||||
extra_tokens.insert(extra_tokens.end(), k_fim_repo.begin(), k_fim_repo.end());
|
extra_tokens.insert(extra_tokens.end(), k_fim_repo.begin(), k_fim_repo.end());
|
||||||
}
|
}
|
||||||
for (const auto & chunk : input_extra) {
|
for (const auto & chunk : input_extra) {
|
||||||
@ -288,28 +289,28 @@ static llama_tokens format_infill(
|
|||||||
const std::string text = json_value(chunk, "text", std::string());
|
const std::string text = json_value(chunk, "text", std::string());
|
||||||
const std::string filename = json_value(chunk, "filename", std::string("tmp"));
|
const std::string filename = json_value(chunk, "filename", std::string("tmp"));
|
||||||
|
|
||||||
if (llama_token_fim_sep(model) != LLAMA_TOKEN_NULL) {
|
if (llama_vocab_fim_sep(vocab) != LLAMA_TOKEN_NULL) {
|
||||||
const auto k_fim_file = common_tokenize(ctx, filename + "\n", false, false);
|
const auto k_fim_file = common_tokenize(vocab, filename + "\n", false, false);
|
||||||
|
|
||||||
extra_tokens.insert(extra_tokens.end(), llama_token_fim_sep(model));
|
extra_tokens.insert(extra_tokens.end(), llama_vocab_fim_sep(vocab));
|
||||||
extra_tokens.insert(extra_tokens.end(), k_fim_file.begin(), k_fim_file.end());
|
extra_tokens.insert(extra_tokens.end(), k_fim_file.begin(), k_fim_file.end());
|
||||||
} else {
|
} else {
|
||||||
// chunk separator in binary form to avoid confusing the AI
|
// chunk separator in binary form to avoid confusing the AI
|
||||||
static const char k_chunk_prefix_str[] = {0x0a, 0x0a, 0x2d, 0x2d, 0x2d, 0x20, 0x73, 0x6e, 0x69, 0x70, 0x70, 0x65, 0x74, 0x20, 0x2d, 0x2d, 0x2d, 0x0a, 0x0a, 0x00};
|
static const char k_chunk_prefix_str[] = {0x0a, 0x0a, 0x2d, 0x2d, 0x2d, 0x20, 0x73, 0x6e, 0x69, 0x70, 0x70, 0x65, 0x74, 0x20, 0x2d, 0x2d, 0x2d, 0x0a, 0x0a, 0x00};
|
||||||
static const auto k_chunk_prefix_tokens = common_tokenize(ctx, k_chunk_prefix_str, false, false);
|
static const auto k_chunk_prefix_tokens = common_tokenize(vocab, k_chunk_prefix_str, false, false);
|
||||||
|
|
||||||
extra_tokens.insert(extra_tokens.end(), k_chunk_prefix_tokens.begin(), k_chunk_prefix_tokens.end());
|
extra_tokens.insert(extra_tokens.end(), k_chunk_prefix_tokens.begin(), k_chunk_prefix_tokens.end());
|
||||||
}
|
}
|
||||||
|
|
||||||
const auto chunk_tokens = common_tokenize(ctx, text, false, false);
|
const auto chunk_tokens = common_tokenize(vocab, text, false, false);
|
||||||
extra_tokens.insert(extra_tokens.end(), chunk_tokens.begin(), chunk_tokens.end());
|
extra_tokens.insert(extra_tokens.end(), chunk_tokens.begin(), chunk_tokens.end());
|
||||||
}
|
}
|
||||||
|
|
||||||
if (llama_token_fim_sep(model) != LLAMA_TOKEN_NULL) {
|
if (llama_vocab_fim_sep(vocab) != LLAMA_TOKEN_NULL) {
|
||||||
// TODO: current filename
|
// TODO: current filename
|
||||||
static const auto k_fim_file = common_tokenize(ctx, "filename\n", false, false);
|
static const auto k_fim_file = common_tokenize(vocab, "filename\n", false, false);
|
||||||
|
|
||||||
extra_tokens.insert(extra_tokens.end(), llama_token_fim_sep(model));
|
extra_tokens.insert(extra_tokens.end(), llama_vocab_fim_sep(vocab));
|
||||||
extra_tokens.insert(extra_tokens.end(), k_fim_file.begin(), k_fim_file.end());
|
extra_tokens.insert(extra_tokens.end(), k_fim_file.begin(), k_fim_file.end());
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -325,15 +326,15 @@ static llama_tokens format_infill(
|
|||||||
tokens_prefix.erase(tokens_prefix.begin(), tokens_prefix.begin() + tokens_prefix.size() - n_prefix_take);
|
tokens_prefix.erase(tokens_prefix.begin(), tokens_prefix.begin() + tokens_prefix.size() - n_prefix_take);
|
||||||
tokens_suffix.resize(n_suffix_take);
|
tokens_suffix.resize(n_suffix_take);
|
||||||
|
|
||||||
tokens_prefix.insert(tokens_prefix.begin(), llama_token_fim_pre(model));
|
tokens_prefix.insert(tokens_prefix.begin(), llama_vocab_fim_pre(vocab));
|
||||||
tokens_prefix.insert(tokens_prefix.end(), tokens_prompt.begin(), tokens_prompt.end());
|
tokens_prefix.insert(tokens_prefix.end(), tokens_prompt.begin(), tokens_prompt.end());
|
||||||
tokens_suffix.insert(tokens_suffix.begin(), llama_token_fim_suf(model));
|
tokens_suffix.insert(tokens_suffix.begin(), llama_vocab_fim_suf(vocab));
|
||||||
|
|
||||||
auto embd_inp = spm_infill ? tokens_suffix : tokens_prefix;
|
auto embd_inp = spm_infill ? tokens_suffix : tokens_prefix;
|
||||||
auto embd_end = spm_infill ? tokens_prefix : tokens_suffix;
|
auto embd_end = spm_infill ? tokens_prefix : tokens_suffix;
|
||||||
|
|
||||||
if (llama_add_bos_token(model)) {
|
if (llama_vocab_get_add_bos(vocab)) {
|
||||||
embd_inp.insert(embd_inp.begin(), llama_token_bos(model));
|
embd_inp.insert(embd_inp.begin(), llama_vocab_bos(vocab));
|
||||||
}
|
}
|
||||||
|
|
||||||
SRV_DBG("extra: n_ctx = %d, n_extra_take = %d, n_extra = %d\n", n_ctx, n_extra_take, (int) extra_tokens.size());
|
SRV_DBG("extra: n_ctx = %d, n_extra_take = %d, n_extra = %d\n", n_ctx, n_extra_take, (int) extra_tokens.size());
|
||||||
@ -342,7 +343,7 @@ static llama_tokens format_infill(
|
|||||||
embd_inp.insert(embd_inp.begin(), extra_tokens.end() - n_extra_take, extra_tokens.end());
|
embd_inp.insert(embd_inp.begin(), extra_tokens.end() - n_extra_take, extra_tokens.end());
|
||||||
|
|
||||||
embd_inp.insert(embd_inp.end(), embd_end.begin(), embd_end.end());
|
embd_inp.insert(embd_inp.end(), embd_end.begin(), embd_end.end());
|
||||||
embd_inp.push_back(llama_token_fim_mid(model));
|
embd_inp.push_back(llama_vocab_fim_mid(vocab));
|
||||||
|
|
||||||
return embd_inp;
|
return embd_inp;
|
||||||
}
|
}
|
||||||
@ -764,14 +765,18 @@ static json format_logit_bias(const std::vector<llama_logit_bias> & logit_bias)
|
|||||||
return data;
|
return data;
|
||||||
}
|
}
|
||||||
|
|
||||||
static std::string safe_json_to_str(json data) {
|
static std::string safe_json_to_str(const json & data) {
|
||||||
return data.dump(-1, ' ', false, json::error_handler_t::replace);
|
return data.dump(-1, ' ', false, json::error_handler_t::replace);
|
||||||
}
|
}
|
||||||
|
|
||||||
static std::vector<llama_token_data> get_token_probabilities(llama_context * ctx, int idx) {
|
static std::vector<llama_token_data> get_token_probabilities(llama_context * ctx, int idx) {
|
||||||
std::vector<llama_token_data> cur;
|
std::vector<llama_token_data> cur;
|
||||||
const auto * logits = llama_get_logits_ith(ctx, idx);
|
const auto * logits = llama_get_logits_ith(ctx, idx);
|
||||||
const int n_vocab = llama_n_vocab(llama_get_model(ctx));
|
|
||||||
|
const llama_model * model = llama_get_model(ctx);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
|
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||||||
|
|
||||||
cur.resize(n_vocab);
|
cur.resize(n_vocab);
|
||||||
for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
|
for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
|
||||||
@ -799,8 +804,8 @@ static std::vector<llama_token_data> get_token_probabilities(llama_context * ctx
|
|||||||
}
|
}
|
||||||
|
|
||||||
static bool are_lora_equal(
|
static bool are_lora_equal(
|
||||||
const std::vector<common_lora_adapter_info> & l1,
|
const std::vector<common_adapter_lora_info> & l1,
|
||||||
const std::vector<common_lora_adapter_info> & l2) {
|
const std::vector<common_adapter_lora_info> & l2) {
|
||||||
if (l1.size() != l2.size()) {
|
if (l1.size() != l2.size()) {
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
@ -814,10 +819,10 @@ static bool are_lora_equal(
|
|||||||
}
|
}
|
||||||
|
|
||||||
// parse lora config from JSON request, returned a copy of lora_base with updated scale
|
// parse lora config from JSON request, returned a copy of lora_base with updated scale
|
||||||
static std::vector<common_lora_adapter_info> parse_lora_request(
|
static std::vector<common_adapter_lora_info> parse_lora_request(
|
||||||
const std::vector<common_lora_adapter_info> & lora_base,
|
const std::vector<common_adapter_lora_info> & lora_base,
|
||||||
const json & data) {
|
const json & data) {
|
||||||
std::vector<common_lora_adapter_info> lora(lora_base);
|
std::vector<common_adapter_lora_info> lora(lora_base);
|
||||||
int max_idx = lora.size();
|
int max_idx = lora.size();
|
||||||
|
|
||||||
// clear existing value
|
// clear existing value
|
||||||
|
@ -37,7 +37,7 @@
|
|||||||
<div v-for="conv in conversations" :class="{
|
<div v-for="conv in conversations" :class="{
|
||||||
'btn btn-ghost justify-start font-normal': true,
|
'btn btn-ghost justify-start font-normal': true,
|
||||||
'btn-active': conv.id === viewingConvId,
|
'btn-active': conv.id === viewingConvId,
|
||||||
}" @click="setViewingConv(conv.id)">
|
}" @click="setViewingConv(conv.id)" dir="auto">
|
||||||
<span class="truncate">{{ conv.messages[0].content }}</span>
|
<span class="truncate">{{ conv.messages[0].content }}</span>
|
||||||
</div>
|
</div>
|
||||||
<div class="text-center text-xs opacity-40 mt-auto mx-4">
|
<div class="text-center text-xs opacity-40 mt-auto mx-4">
|
||||||
@ -62,53 +62,57 @@
|
|||||||
<!-- action buttons (top right) -->
|
<!-- action buttons (top right) -->
|
||||||
<div class="flex items-center">
|
<div class="flex items-center">
|
||||||
<div v-if="messages.length > 0" class="dropdown dropdown-end">
|
<div v-if="messages.length > 0" class="dropdown dropdown-end">
|
||||||
<!-- "more" button -->
|
<!-- "..." button -->
|
||||||
<button tabindex="0" role="button" class="btn m-1" :disabled="isGenerating">
|
<button tabindex="0" role="button" class="btn m-1" :disabled="isGenerating">
|
||||||
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-three-dots-vertical" viewBox="0 0 16 16">
|
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-three-dots-vertical" viewBox="0 0 16 16">
|
||||||
<path d="M9.5 13a1.5 1.5 0 1 1-3 0 1.5 1.5 0 0 1 3 0m0-5a1.5 1.5 0 1 1-3 0 1.5 1.5 0 0 1 3 0m0-5a1.5 1.5 0 1 1-3 0 1.5 1.5 0 0 1 3 0"/>
|
<path d="M9.5 13a1.5 1.5 0 1 1-3 0 1.5 1.5 0 0 1 3 0m0-5a1.5 1.5 0 1 1-3 0 1.5 1.5 0 0 1 3 0m0-5a1.5 1.5 0 1 1-3 0 1.5 1.5 0 0 1 3 0"/>
|
||||||
</svg>
|
</svg>
|
||||||
</button>
|
</button>
|
||||||
<!-- "more" dropdown menu -->
|
<!-- "delete" dropdown menu -->
|
||||||
<ul tabindex="0" class="dropdown-content menu bg-base-100 rounded-box z-[1] w-52 p-2 shadow">
|
<ul tabindex="0" class="dropdown-content menu bg-base-100 rounded-box z-[1] w-52 p-2 shadow">
|
||||||
<li @click="downloadConv(viewingConvId)"><a>Download</a></li>
|
<li @click="downloadConv(viewingConvId)"><a>Download</a></li>
|
||||||
<li class="text-error" @click="deleteConv(viewingConvId)"><a>Delete</a></li>
|
<li class="text-error" @click="deleteConv(viewingConvId)"><a>Delete</a></li>
|
||||||
</ul>
|
</ul>
|
||||||
</div>
|
</div>
|
||||||
<button class="btn" @click="showConfigDialog = true" :disabled="isGenerating">
|
<div class="tooltip tooltip-bottom" data-tip="Settings">
|
||||||
<!-- settings button -->
|
<button class="btn" @click="showConfigDialog = true" :disabled="isGenerating">
|
||||||
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-gear" viewBox="0 0 16 16">
|
<!-- settings button -->
|
||||||
<path d="M8 4.754a3.246 3.246 0 1 0 0 6.492 3.246 3.246 0 0 0 0-6.492M5.754 8a2.246 2.246 0 1 1 4.492 0 2.246 2.246 0 0 1-4.492 0"/>
|
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-gear" viewBox="0 0 16 16">
|
||||||
<path d="M9.796 1.343c-.527-1.79-3.065-1.79-3.592 0l-.094.319a.873.873 0 0 1-1.255.52l-.292-.16c-1.64-.892-3.433.902-2.54 2.541l.159.292a.873.873 0 0 1-.52 1.255l-.319.094c-1.79.527-1.79 3.065 0 3.592l.319.094a.873.873 0 0 1 .52 1.255l-.16.292c-.892 1.64.901 3.434 2.541 2.54l.292-.159a.873.873 0 0 1 1.255.52l.094.319c.527 1.79 3.065 1.79 3.592 0l.094-.319a.873.873 0 0 1 1.255-.52l.292.16c1.64.893 3.434-.902 2.54-2.541l-.159-.292a.873.873 0 0 1 .52-1.255l.319-.094c1.79-.527 1.79-3.065 0-3.592l-.319-.094a.873.873 0 0 1-.52-1.255l.16-.292c.893-1.64-.902-3.433-2.541-2.54l-.292.159a.873.873 0 0 1-1.255-.52zm-2.633.283c.246-.835 1.428-.835 1.674 0l.094.319a1.873 1.873 0 0 0 2.693 1.115l.291-.16c.764-.415 1.6.42 1.184 1.185l-.159.292a1.873 1.873 0 0 0 1.116 2.692l.318.094c.835.246.835 1.428 0 1.674l-.319.094a1.873 1.873 0 0 0-1.115 2.693l.16.291c.415.764-.42 1.6-1.185 1.184l-.291-.159a1.873 1.873 0 0 0-2.693 1.116l-.094.318c-.246.835-1.428.835-1.674 0l-.094-.319a1.873 1.873 0 0 0-2.692-1.115l-.292.16c-.764.415-1.6-.42-1.184-1.185l.159-.291A1.873 1.873 0 0 0 1.945 8.93l-.319-.094c-.835-.246-.835-1.428 0-1.674l.319-.094A1.873 1.873 0 0 0 3.06 4.377l-.16-.292c-.415-.764.42-1.6 1.185-1.184l.292.159a1.873 1.873 0 0 0 2.692-1.115z"/>
|
<path d="M8 4.754a3.246 3.246 0 1 0 0 6.492 3.246 3.246 0 0 0 0-6.492M5.754 8a2.246 2.246 0 1 1 4.492 0 2.246 2.246 0 0 1-4.492 0"/>
|
||||||
</svg>
|
<path d="M9.796 1.343c-.527-1.79-3.065-1.79-3.592 0l-.094.319a.873.873 0 0 1-1.255.52l-.292-.16c-1.64-.892-3.433.902-2.54 2.541l.159.292a.873.873 0 0 1-.52 1.255l-.319.094c-1.79.527-1.79 3.065 0 3.592l.319.094a.873.873 0 0 1 .52 1.255l-.16.292c-.892 1.64.901 3.434 2.541 2.54l.292-.159a.873.873 0 0 1 1.255.52l.094.319c.527 1.79 3.065 1.79 3.592 0l.094-.319a.873.873 0 0 1 1.255-.52l.292.16c1.64.893 3.434-.902 2.54-2.541l-.159-.292a.873.873 0 0 1 .52-1.255l.319-.094c1.79-.527 1.79-3.065 0-3.592l-.319-.094a.873.873 0 0 1-.52-1.255l.16-.292c.893-1.64-.902-3.433-2.541-2.54l-.292.159a.873.873 0 0 1-1.255-.52zm-2.633.283c.246-.835 1.428-.835 1.674 0l.094.319a1.873 1.873 0 0 0 2.693 1.115l.291-.16c.764-.415 1.6.42 1.184 1.185l-.159.292a1.873 1.873 0 0 0 1.116 2.692l.318.094c.835.246.835 1.428 0 1.674l-.319.094a1.873 1.873 0 0 0-1.115 2.693l.16.291c.415.764-.42 1.6-1.185 1.184l-.291-.159a1.873 1.873 0 0 0-2.693 1.116l-.094.318c-.246.835-1.428.835-1.674 0l-.094-.319a1.873 1.873 0 0 0-2.692-1.115l-.292.16c-.764.415-1.6-.42-1.184-1.185l.159-.291A1.873 1.873 0 0 0 1.945 8.93l-.319-.094c-.835-.246-.835-1.428 0-1.674l.319-.094A1.873 1.873 0 0 0 3.06 4.377l-.16-.292c-.415-.764.42-1.6 1.185-1.184l.292.159a1.873 1.873 0 0 0 2.692-1.115z"/>
|
||||||
</button>
|
</svg>
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
|
||||||
<!-- theme controller is copied from https://daisyui.com/components/theme-controller/ -->
|
<!-- theme controller is copied from https://daisyui.com/components/theme-controller/ -->
|
||||||
<div class="dropdown dropdown-end dropdown-bottom">
|
<div class="tooltip tooltip-bottom" data-tip="Themes">
|
||||||
<div tabindex="0" role="button" class="btn m-1">
|
<div class="dropdown dropdown-end dropdown-bottom">
|
||||||
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-palette2" viewBox="0 0 16 16">
|
<div tabindex="0" role="button" class="btn m-1">
|
||||||
<path d="M0 .5A.5.5 0 0 1 .5 0h5a.5.5 0 0 1 .5.5v5.277l4.147-4.131a.5.5 0 0 1 .707 0l3.535 3.536a.5.5 0 0 1 0 .708L10.261 10H15.5a.5.5 0 0 1 .5.5v5a.5.5 0 0 1-.5.5H3a3 3 0 0 1-2.121-.879A3 3 0 0 1 0 13.044m6-.21 7.328-7.3-2.829-2.828L6 7.188zM4.5 13a1.5 1.5 0 1 0-3 0 1.5 1.5 0 0 0 3 0M15 15v-4H9.258l-4.015 4zM0 .5v12.495zm0 12.495V13z"/>
|
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-palette2" viewBox="0 0 16 16">
|
||||||
</svg>
|
<path d="M0 .5A.5.5 0 0 1 .5 0h5a.5.5 0 0 1 .5.5v5.277l4.147-4.131a.5.5 0 0 1 .707 0l3.535 3.536a.5.5 0 0 1 0 .708L10.261 10H15.5a.5.5 0 0 1 .5.5v5a.5.5 0 0 1-.5.5H3a3 3 0 0 1-2.121-.879A3 3 0 0 1 0 13.044m6-.21 7.328-7.3-2.829-2.828L6 7.188zM4.5 13a1.5 1.5 0 1 0-3 0 1.5 1.5 0 0 0 3 0M15 15v-4H9.258l-4.015 4zM0 .5v12.495zm0 12.495V13z"/>
|
||||||
|
</svg>
|
||||||
|
</div>
|
||||||
|
<ul tabindex="0" class="dropdown-content bg-base-300 rounded-box z-[1] w-52 p-2 shadow-2xl h-80 overflow-y-auto">
|
||||||
|
<li>
|
||||||
|
<button
|
||||||
|
class="btn btn-sm btn-block btn-ghost justify-start"
|
||||||
|
:class="{ 'btn-active': selectedTheme === 'auto' }"
|
||||||
|
@click="setSelectedTheme('auto')">
|
||||||
|
auto
|
||||||
|
</button>
|
||||||
|
</li>
|
||||||
|
<li v-for="theme in themes">
|
||||||
|
<input
|
||||||
|
type="radio"
|
||||||
|
name="theme-dropdown"
|
||||||
|
class="theme-controller btn btn-sm btn-block btn-ghost justify-start"
|
||||||
|
:aria-label="theme"
|
||||||
|
:value="theme"
|
||||||
|
:checked="selectedTheme === theme"
|
||||||
|
@click="setSelectedTheme(theme)" />
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
</div>
|
</div>
|
||||||
<ul tabindex="0" class="dropdown-content bg-base-300 rounded-box z-[1] w-52 p-2 shadow-2xl h-80 overflow-y-auto">
|
|
||||||
<li>
|
|
||||||
<button
|
|
||||||
class="btn btn-sm btn-block btn-ghost justify-start"
|
|
||||||
:class="{ 'btn-active': selectedTheme === 'auto' }"
|
|
||||||
@click="setSelectedTheme('auto')">
|
|
||||||
auto
|
|
||||||
</button>
|
|
||||||
</li>
|
|
||||||
<li v-for="theme in themes">
|
|
||||||
<input
|
|
||||||
type="radio"
|
|
||||||
name="theme-dropdown"
|
|
||||||
class="theme-controller btn btn-sm btn-block btn-ghost justify-start"
|
|
||||||
:aria-label="theme"
|
|
||||||
:value="theme"
|
|
||||||
:checked="selectedTheme === theme"
|
|
||||||
@click="setSelectedTheme(theme)" />
|
|
||||||
</li>
|
|
||||||
</ul>
|
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
@ -152,6 +156,7 @@
|
|||||||
@keydown.enter.shift.exact.prevent="inputMsg += '\n'"
|
@keydown.enter.shift.exact.prevent="inputMsg += '\n'"
|
||||||
:disabled="isGenerating"
|
:disabled="isGenerating"
|
||||||
id="msg-input"
|
id="msg-input"
|
||||||
|
dir="auto"
|
||||||
></textarea>
|
></textarea>
|
||||||
<button v-if="!isGenerating" class="btn btn-primary ml-2" @click="sendMessage" :disabled="inputMsg.length === 0">Send</button>
|
<button v-if="!isGenerating" class="btn btn-primary ml-2" @click="sendMessage" :disabled="inputMsg.length === 0">Send</button>
|
||||||
<button v-else class="btn btn-neutral ml-2" @click="stopGeneration">Stop</button>
|
<button v-else class="btn btn-neutral ml-2" @click="stopGeneration">Stop</button>
|
||||||
@ -244,6 +249,7 @@
|
|||||||
<!-- textarea for editing message -->
|
<!-- textarea for editing message -->
|
||||||
<template v-if="editingContent !== null">
|
<template v-if="editingContent !== null">
|
||||||
<textarea
|
<textarea
|
||||||
|
dir="auto"
|
||||||
class="textarea textarea-bordered bg-base-100 text-base-content w-[calc(90vw-8em)] lg:w-96"
|
class="textarea textarea-bordered bg-base-100 text-base-content w-[calc(90vw-8em)] lg:w-96"
|
||||||
v-model="editingContent"></textarea>
|
v-model="editingContent"></textarea>
|
||||||
<br/>
|
<br/>
|
||||||
@ -254,7 +260,9 @@
|
|||||||
<!-- show loading dots for pending message -->
|
<!-- show loading dots for pending message -->
|
||||||
<span v-if="msg.content === null" class="loading loading-dots loading-md"></span>
|
<span v-if="msg.content === null" class="loading loading-dots loading-md"></span>
|
||||||
<!-- render message as markdown -->
|
<!-- render message as markdown -->
|
||||||
<vue-markdown v-else :source="msg.content"></vue-markdown>
|
<div v-else dir="auto">
|
||||||
|
<vue-markdown :source="msg.content"></vue-markdown>
|
||||||
|
</div>
|
||||||
<!-- render timings if enabled -->
|
<!-- render timings if enabled -->
|
||||||
<div class="dropdown dropdown-hover dropdown-top mt-2" v-if="timings && config.showTokensPerSecond">
|
<div class="dropdown dropdown-hover dropdown-top mt-2" v-if="timings && config.showTokensPerSecond">
|
||||||
<div tabindex="0" role="button" class="cursor-pointer font-semibold text-sm opacity-60">Speed: {{ timings.predicted_per_second.toFixed(1) }} t/s</div>
|
<div tabindex="0" role="button" class="cursor-pointer font-semibold text-sm opacity-60">Speed: {{ timings.predicted_per_second.toFixed(1) }} t/s</div>
|
||||||
|
@ -75,12 +75,14 @@ int main(int argc, char ** argv) {
|
|||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
// initialize the context
|
// initialize the context
|
||||||
llama_context_params ctx_params = llama_context_default_params();
|
llama_context_params ctx_params = llama_context_default_params();
|
||||||
ctx_params.n_ctx = n_ctx;
|
ctx_params.n_ctx = n_ctx;
|
||||||
ctx_params.n_batch = n_ctx;
|
ctx_params.n_batch = n_ctx;
|
||||||
|
|
||||||
llama_context * ctx = llama_new_context_with_model(model, ctx_params);
|
llama_context * ctx = llama_init_from_model(model, ctx_params);
|
||||||
if (!ctx) {
|
if (!ctx) {
|
||||||
fprintf(stderr , "%s: error: failed to create the llama_context\n" , __func__);
|
fprintf(stderr , "%s: error: failed to create the llama_context\n" , __func__);
|
||||||
return 1;
|
return 1;
|
||||||
@ -97,9 +99,9 @@ int main(int argc, char ** argv) {
|
|||||||
std::string response;
|
std::string response;
|
||||||
|
|
||||||
// tokenize the prompt
|
// tokenize the prompt
|
||||||
const int n_prompt_tokens = -llama_tokenize(model, prompt.c_str(), prompt.size(), NULL, 0, true, true);
|
const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, true, true);
|
||||||
std::vector<llama_token> prompt_tokens(n_prompt_tokens);
|
std::vector<llama_token> prompt_tokens(n_prompt_tokens);
|
||||||
if (llama_tokenize(model, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), llama_get_kv_cache_used_cells(ctx) == 0, true) < 0) {
|
if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), llama_get_kv_cache_used_cells(ctx) == 0, true) < 0) {
|
||||||
GGML_ABORT("failed to tokenize the prompt\n");
|
GGML_ABORT("failed to tokenize the prompt\n");
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -124,13 +126,13 @@ int main(int argc, char ** argv) {
|
|||||||
new_token_id = llama_sampler_sample(smpl, ctx, -1);
|
new_token_id = llama_sampler_sample(smpl, ctx, -1);
|
||||||
|
|
||||||
// is it an end of generation?
|
// is it an end of generation?
|
||||||
if (llama_token_is_eog(model, new_token_id)) {
|
if (llama_vocab_is_eog(vocab, new_token_id)) {
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
// convert the token to a string, print it and add it to the response
|
// convert the token to a string, print it and add it to the response
|
||||||
char buf[256];
|
char buf[256];
|
||||||
int n = llama_token_to_piece(model, new_token_id, buf, sizeof(buf), 0, true);
|
int n = llama_token_to_piece(vocab, new_token_id, buf, sizeof(buf), 0, true);
|
||||||
if (n < 0) {
|
if (n < 0) {
|
||||||
GGML_ABORT("failed to convert token to piece\n");
|
GGML_ABORT("failed to convert token to piece\n");
|
||||||
}
|
}
|
||||||
@ -159,12 +161,14 @@ int main(int argc, char ** argv) {
|
|||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const char * tmpl = llama_model_chat_template(model);
|
||||||
|
|
||||||
// add the user input to the message list and format it
|
// add the user input to the message list and format it
|
||||||
messages.push_back({"user", strdup(user.c_str())});
|
messages.push_back({"user", strdup(user.c_str())});
|
||||||
int new_len = llama_chat_apply_template(model, nullptr, messages.data(), messages.size(), true, formatted.data(), formatted.size());
|
int new_len = llama_chat_apply_template(tmpl, messages.data(), messages.size(), true, formatted.data(), formatted.size());
|
||||||
if (new_len > (int)formatted.size()) {
|
if (new_len > (int)formatted.size()) {
|
||||||
formatted.resize(new_len);
|
formatted.resize(new_len);
|
||||||
new_len = llama_chat_apply_template(model, nullptr, messages.data(), messages.size(), true, formatted.data(), formatted.size());
|
new_len = llama_chat_apply_template(tmpl, messages.data(), messages.size(), true, formatted.data(), formatted.size());
|
||||||
}
|
}
|
||||||
if (new_len < 0) {
|
if (new_len < 0) {
|
||||||
fprintf(stderr, "failed to apply the chat template\n");
|
fprintf(stderr, "failed to apply the chat template\n");
|
||||||
@ -181,7 +185,7 @@ int main(int argc, char ** argv) {
|
|||||||
|
|
||||||
// add the response to the messages
|
// add the response to the messages
|
||||||
messages.push_back({"assistant", strdup(response.c_str())});
|
messages.push_back({"assistant", strdup(response.c_str())});
|
||||||
prev_len = llama_chat_apply_template(model, nullptr, messages.data(), messages.size(), false, nullptr, 0);
|
prev_len = llama_chat_apply_template(tmpl, messages.data(), messages.size(), false, nullptr, 0);
|
||||||
if (prev_len < 0) {
|
if (prev_len < 0) {
|
||||||
fprintf(stderr, "failed to apply the chat template\n");
|
fprintf(stderr, "failed to apply the chat template\n");
|
||||||
return 1;
|
return 1;
|
||||||
|
@ -84,6 +84,7 @@ int main(int argc, char ** argv) {
|
|||||||
model_params.n_gpu_layers = ngl;
|
model_params.n_gpu_layers = ngl;
|
||||||
|
|
||||||
llama_model * model = llama_model_load_from_file(model_path.c_str(), model_params);
|
llama_model * model = llama_model_load_from_file(model_path.c_str(), model_params);
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
if (model == NULL) {
|
if (model == NULL) {
|
||||||
fprintf(stderr , "%s: error: unable to load model\n" , __func__);
|
fprintf(stderr , "%s: error: unable to load model\n" , __func__);
|
||||||
@ -93,11 +94,11 @@ int main(int argc, char ** argv) {
|
|||||||
// tokenize the prompt
|
// tokenize the prompt
|
||||||
|
|
||||||
// find the number of tokens in the prompt
|
// find the number of tokens in the prompt
|
||||||
const int n_prompt = -llama_tokenize(model, prompt.c_str(), prompt.size(), NULL, 0, true, true);
|
const int n_prompt = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, true, true);
|
||||||
|
|
||||||
// allocate space for the tokens and tokenize the prompt
|
// allocate space for the tokens and tokenize the prompt
|
||||||
std::vector<llama_token> prompt_tokens(n_prompt);
|
std::vector<llama_token> prompt_tokens(n_prompt);
|
||||||
if (llama_tokenize(model, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), true, true) < 0) {
|
if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), true, true) < 0) {
|
||||||
fprintf(stderr, "%s: error: failed to tokenize the prompt\n", __func__);
|
fprintf(stderr, "%s: error: failed to tokenize the prompt\n", __func__);
|
||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
@ -112,7 +113,7 @@ int main(int argc, char ** argv) {
|
|||||||
// enable performance counters
|
// enable performance counters
|
||||||
ctx_params.no_perf = false;
|
ctx_params.no_perf = false;
|
||||||
|
|
||||||
llama_context * ctx = llama_new_context_with_model(model, ctx_params);
|
llama_context * ctx = llama_init_from_model(model, ctx_params);
|
||||||
|
|
||||||
if (ctx == NULL) {
|
if (ctx == NULL) {
|
||||||
fprintf(stderr , "%s: error: failed to create the llama_context\n" , __func__);
|
fprintf(stderr , "%s: error: failed to create the llama_context\n" , __func__);
|
||||||
@ -131,7 +132,7 @@ int main(int argc, char ** argv) {
|
|||||||
|
|
||||||
for (auto id : prompt_tokens) {
|
for (auto id : prompt_tokens) {
|
||||||
char buf[128];
|
char buf[128];
|
||||||
int n = llama_token_to_piece(model, id, buf, sizeof(buf), 0, true);
|
int n = llama_token_to_piece(vocab, id, buf, sizeof(buf), 0, true);
|
||||||
if (n < 0) {
|
if (n < 0) {
|
||||||
fprintf(stderr, "%s: error: failed to convert token to piece\n", __func__);
|
fprintf(stderr, "%s: error: failed to convert token to piece\n", __func__);
|
||||||
return 1;
|
return 1;
|
||||||
@ -164,12 +165,12 @@ int main(int argc, char ** argv) {
|
|||||||
new_token_id = llama_sampler_sample(smpl, ctx, -1);
|
new_token_id = llama_sampler_sample(smpl, ctx, -1);
|
||||||
|
|
||||||
// is it an end of generation?
|
// is it an end of generation?
|
||||||
if (llama_token_is_eog(model, new_token_id)) {
|
if (llama_vocab_is_eog(vocab, new_token_id)) {
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
char buf[128];
|
char buf[128];
|
||||||
int n = llama_token_to_piece(model, new_token_id, buf, sizeof(buf), 0, true);
|
int n = llama_token_to_piece(vocab, new_token_id, buf, sizeof(buf), 0, true);
|
||||||
if (n < 0) {
|
if (n < 0) {
|
||||||
fprintf(stderr, "%s: error: failed to convert token to piece\n", __func__);
|
fprintf(stderr, "%s: error: failed to convert token to piece\n", __func__);
|
||||||
return 1;
|
return 1;
|
||||||
|
@ -45,6 +45,8 @@ int main(int argc, char ** argv) {
|
|||||||
model_tgt = llama_init_tgt.model.get();
|
model_tgt = llama_init_tgt.model.get();
|
||||||
ctx_tgt = llama_init_tgt.context.get();
|
ctx_tgt = llama_init_tgt.context.get();
|
||||||
|
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model_tgt);
|
||||||
|
|
||||||
// load the draft model
|
// load the draft model
|
||||||
params.devices = params.speculative.devices;
|
params.devices = params.speculative.devices;
|
||||||
params.model = params.speculative.model;
|
params.model = params.speculative.model;
|
||||||
@ -196,7 +198,7 @@ int main(int argc, char ** argv) {
|
|||||||
|
|
||||||
id_last = ids[i];
|
id_last = ids[i];
|
||||||
|
|
||||||
if (llama_token_is_eog(model_tgt, id_last)) {
|
if (llama_vocab_is_eog(vocab, id_last)) {
|
||||||
has_eos = true;
|
has_eos = true;
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
@ -90,10 +90,13 @@ int main(int argc, char ** argv) {
|
|||||||
model_dft = llama_init_dft.model.get();
|
model_dft = llama_init_dft.model.get();
|
||||||
ctx_dft = llama_init_dft.context.get();
|
ctx_dft = llama_init_dft.context.get();
|
||||||
|
|
||||||
const bool vocab_type_tgt = llama_vocab_type(model_tgt);
|
const llama_vocab * vocab_tgt = llama_model_get_vocab(model_tgt);
|
||||||
|
const llama_vocab * vocab_dft = llama_model_get_vocab(model_dft);
|
||||||
|
|
||||||
|
const bool vocab_type_tgt = llama_vocab_type(vocab_tgt);
|
||||||
LOG_DBG("vocab_type tgt: %d\n", vocab_type_tgt);
|
LOG_DBG("vocab_type tgt: %d\n", vocab_type_tgt);
|
||||||
|
|
||||||
const bool vocab_type_dft = llama_vocab_type(model_dft);
|
const bool vocab_type_dft = llama_vocab_type(vocab_dft);
|
||||||
LOG_DBG("vocab_type dft: %d\n", vocab_type_dft);
|
LOG_DBG("vocab_type dft: %d\n", vocab_type_dft);
|
||||||
|
|
||||||
if (vocab_type_tgt != vocab_type_dft) {
|
if (vocab_type_tgt != vocab_type_dft) {
|
||||||
@ -103,18 +106,18 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
if (
|
if (
|
||||||
llama_add_bos_token(model_tgt) != llama_add_bos_token(model_dft) ||
|
llama_vocab_get_add_bos(vocab_tgt) != llama_vocab_get_add_bos(vocab_dft) ||
|
||||||
llama_add_eos_token(model_tgt) != llama_add_eos_token(model_dft) ||
|
llama_vocab_get_add_eos(vocab_tgt) != llama_vocab_get_add_eos(vocab_dft) ||
|
||||||
llama_token_bos(model_tgt) != llama_token_bos(model_dft) ||
|
llama_vocab_bos(vocab_tgt) != llama_vocab_bos(vocab_dft) ||
|
||||||
llama_token_eos(model_tgt) != llama_token_eos(model_dft)
|
llama_vocab_eos(vocab_tgt) != llama_vocab_eos(vocab_dft)
|
||||||
) {
|
) {
|
||||||
LOG_ERR("%s: draft model special tokens must match target model to use speculation\n", __func__);
|
LOG_ERR("%s: draft model special tokens must match target model to use speculation\n", __func__);
|
||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
{
|
{
|
||||||
const int n_vocab_tgt = llama_n_vocab(model_tgt);
|
const int n_vocab_tgt = llama_vocab_n_tokens(vocab_tgt);
|
||||||
const int n_vocab_dft = llama_n_vocab(model_dft);
|
const int n_vocab_dft = llama_vocab_n_tokens(vocab_dft);
|
||||||
const int vocab_diff = n_vocab_tgt > n_vocab_dft
|
const int vocab_diff = n_vocab_tgt > n_vocab_dft
|
||||||
? n_vocab_tgt - n_vocab_dft
|
? n_vocab_tgt - n_vocab_dft
|
||||||
: n_vocab_dft - n_vocab_tgt;
|
: n_vocab_dft - n_vocab_tgt;
|
||||||
@ -122,13 +125,13 @@ int main(int argc, char ** argv) {
|
|||||||
if (vocab_diff > SPEC_VOCAB_MAX_SIZE_DIFFERENCE) {
|
if (vocab_diff > SPEC_VOCAB_MAX_SIZE_DIFFERENCE) {
|
||||||
LOG_ERR("%s: draft model vocab must closely match target model to use speculation but ", __func__);
|
LOG_ERR("%s: draft model vocab must closely match target model to use speculation but ", __func__);
|
||||||
LOG_ERR("target vocab size %d does not match draft vocab size %d - difference %d, max allowed %d\n",
|
LOG_ERR("target vocab size %d does not match draft vocab size %d - difference %d, max allowed %d\n",
|
||||||
n_vocab_tgt, llama_n_vocab(model_dft), vocab_diff, SPEC_VOCAB_MAX_SIZE_DIFFERENCE);
|
n_vocab_tgt, llama_vocab_n_tokens(vocab_dft), vocab_diff, SPEC_VOCAB_MAX_SIZE_DIFFERENCE);
|
||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
for (int i = SPEC_VOCAB_CHECK_START_TOKEN_ID; i < std::min(n_vocab_tgt, n_vocab_dft); ++i) {
|
for (int i = SPEC_VOCAB_CHECK_START_TOKEN_ID; i < std::min(n_vocab_tgt, n_vocab_dft); ++i) {
|
||||||
const char * token_text_tgt = llama_token_get_text(model_tgt, i);
|
const char * token_text_tgt = llama_vocab_get_text(vocab_tgt, i);
|
||||||
const char * token_text_dft = llama_token_get_text(model_dft, i);
|
const char * token_text_dft = llama_vocab_get_text(vocab_dft, i);
|
||||||
if (std::strcmp(token_text_tgt, token_text_dft) != 0) {
|
if (std::strcmp(token_text_tgt, token_text_dft) != 0) {
|
||||||
LOG_ERR("%s: draft model vocab must match target model to use speculation but ", __func__);
|
LOG_ERR("%s: draft model vocab must match target model to use speculation but ", __func__);
|
||||||
LOG_ERR("token %d content differs - target '%s', draft '%s'\n", i,
|
LOG_ERR("token %d content differs - target '%s', draft '%s'\n", i,
|
||||||
@ -170,7 +173,7 @@ int main(int argc, char ** argv) {
|
|||||||
const auto t_enc_end = ggml_time_us();
|
const auto t_enc_end = ggml_time_us();
|
||||||
|
|
||||||
// the 2 models should have the same vocab
|
// the 2 models should have the same vocab
|
||||||
//GGML_ASSERT(n_vocab == llama_n_vocab(model_dft));
|
//GGML_ASSERT(n_vocab == llama_vocab_n_tokens(model_dft));
|
||||||
|
|
||||||
// how many tokens to draft each time
|
// how many tokens to draft each time
|
||||||
int n_draft = params.speculative.n_max;
|
int n_draft = params.speculative.n_max;
|
||||||
@ -386,7 +389,7 @@ int main(int argc, char ** argv) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
if (llama_token_is_eog(model_tgt, token_id)) {
|
if (llama_vocab_is_eog(vocab_tgt, token_id)) {
|
||||||
has_eos = true;
|
has_eos = true;
|
||||||
}
|
}
|
||||||
++n_predict;
|
++n_predict;
|
||||||
|
@ -344,8 +344,10 @@ int main(int raw_argc, char ** raw_argv) {
|
|||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
|
||||||
llama_context_params ctx_params = llama_context_default_params();
|
llama_context_params ctx_params = llama_context_default_params();
|
||||||
llama_context * ctx = llama_new_context_with_model(model, ctx_params);
|
llama_context * ctx = llama_init_from_model(model, ctx_params);
|
||||||
if (!ctx) {
|
if (!ctx) {
|
||||||
fprintf(stderr, "Error: could not create context.\n");
|
fprintf(stderr, "Error: could not create context.\n");
|
||||||
return 1;
|
return 1;
|
||||||
@ -365,7 +367,7 @@ int main(int raw_argc, char ** raw_argv) {
|
|||||||
prompt = stdin_buffer.str();
|
prompt = stdin_buffer.str();
|
||||||
}
|
}
|
||||||
|
|
||||||
const bool model_wants_add_bos = llama_add_bos_token(model);
|
const bool model_wants_add_bos = llama_vocab_get_add_bos(vocab);
|
||||||
const bool add_bos = model_wants_add_bos && !no_bos;
|
const bool add_bos = model_wants_add_bos && !no_bos;
|
||||||
const bool parse_special = !no_parse_special;
|
const bool parse_special = !no_parse_special;
|
||||||
const bool escape = !no_escape;
|
const bool escape = !no_escape;
|
||||||
@ -375,7 +377,7 @@ int main(int raw_argc, char ** raw_argv) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
std::vector<llama_token> tokens;
|
std::vector<llama_token> tokens;
|
||||||
tokens = common_tokenize(model, prompt, add_bos, parse_special);
|
tokens = common_tokenize(vocab, prompt, add_bos, parse_special);
|
||||||
|
|
||||||
if (printing_ids) {
|
if (printing_ids) {
|
||||||
printf("[");
|
printf("[");
|
||||||
|
80
examples/tts/README.md
Normal file
80
examples/tts/README.md
Normal file
@ -0,0 +1,80 @@
|
|||||||
|
# llama.cpp/example/tts
|
||||||
|
This example demonstrates the Text To Speech feature. It uses a
|
||||||
|
[model](https://www.outeai.com/blog/outetts-0.2-500m) from
|
||||||
|
[outeai](https://www.outeai.com/).
|
||||||
|
|
||||||
|
## Quickstart
|
||||||
|
If you have built llama.cpp with `-DLLAMA_CURL=ON` you can simply run the
|
||||||
|
following command and the required models will be downloaded automatically:
|
||||||
|
```console
|
||||||
|
$ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav
|
||||||
|
```
|
||||||
|
For details about the models and how to convert them to the required format
|
||||||
|
see the following sections.
|
||||||
|
|
||||||
|
### Model conversion
|
||||||
|
Checkout or download the model that contains the LLM model:
|
||||||
|
```console
|
||||||
|
$ pushd models
|
||||||
|
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M
|
||||||
|
$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull
|
||||||
|
$ popd
|
||||||
|
```
|
||||||
|
Convert the model to .gguf format:
|
||||||
|
```console
|
||||||
|
(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \
|
||||||
|
--outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16
|
||||||
|
```
|
||||||
|
The generated model will be `models/outetts-0.2-0.5B-f16.gguf`.
|
||||||
|
|
||||||
|
We can optionally quantize this to Q8_0 using the following command:
|
||||||
|
```console
|
||||||
|
$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \
|
||||||
|
models/outetts-0.2-0.5B-q8_0.gguf q8_0
|
||||||
|
```
|
||||||
|
The quantized model will be `models/outetts-0.2-0.5B-q8_0.gguf`.
|
||||||
|
|
||||||
|
Next we do something simlar for the audio decoder. First download or checkout
|
||||||
|
the model for the voice decoder:
|
||||||
|
```console
|
||||||
|
$ pushd models
|
||||||
|
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token
|
||||||
|
$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull
|
||||||
|
$ popd
|
||||||
|
```
|
||||||
|
This model file is PyTorch checkpoint (.ckpt) and we first need to convert it to
|
||||||
|
huggingface format:
|
||||||
|
```console
|
||||||
|
(venv) python examples/tts/convert_pt_to_hf.py \
|
||||||
|
models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt
|
||||||
|
...
|
||||||
|
Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors
|
||||||
|
Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json
|
||||||
|
Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json
|
||||||
|
```
|
||||||
|
Then we can convert the huggingface format to gguf:
|
||||||
|
```console
|
||||||
|
(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \
|
||||||
|
--outfile models/wavtokenizer-large-75-f16.gguf --outtype f16
|
||||||
|
...
|
||||||
|
INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf
|
||||||
|
```
|
||||||
|
|
||||||
|
### Running the example
|
||||||
|
|
||||||
|
With both of the models generated, the LLM model and the voice decoder model,
|
||||||
|
we can run the example:
|
||||||
|
```console
|
||||||
|
$ build/bin/llama-tts -m ./models/outetts-0.2-0.5B-q8_0.gguf \
|
||||||
|
-mv ./models/wavtokenizer-large-75-f16.gguf \
|
||||||
|
-p "Hello world"
|
||||||
|
...
|
||||||
|
main: audio written to file 'output.wav'
|
||||||
|
```
|
||||||
|
The output.wav file will contain the audio of the prompt. This can be heard
|
||||||
|
by playing the file with a media player. On Linux the following command will
|
||||||
|
play the audio:
|
||||||
|
```console
|
||||||
|
$ aplay output.wav
|
||||||
|
```
|
||||||
|
|
@ -414,15 +414,15 @@ static void prompt_add(llama_tokens & prompt, const llama_tokens & tokens) {
|
|||||||
prompt.insert(prompt.end(), tokens.begin(), tokens.end());
|
prompt.insert(prompt.end(), tokens.begin(), tokens.end());
|
||||||
}
|
}
|
||||||
|
|
||||||
static void prompt_add(llama_tokens & prompt, const llama_model * model, const std::string & txt, bool add_special, bool parse_special) {
|
static void prompt_add(llama_tokens & prompt, const llama_vocab * vocab, const std::string & txt, bool add_special, bool parse_special) {
|
||||||
auto tmp = common_tokenize(model, txt, add_special, parse_special);
|
auto tmp = common_tokenize(vocab, txt, add_special, parse_special);
|
||||||
prompt_add(prompt, tmp);
|
prompt_add(prompt, tmp);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void prompt_init(llama_tokens & prompt, const llama_model * model) {
|
static void prompt_init(llama_tokens & prompt, const llama_vocab * vocab) {
|
||||||
prompt.clear();
|
prompt.clear();
|
||||||
|
|
||||||
prompt_add(prompt, model, "<|im_start|>\n", true, true);
|
prompt_add(prompt, vocab, "<|im_start|>\n", true, true);
|
||||||
}
|
}
|
||||||
|
|
||||||
int main(int argc, char ** argv) {
|
int main(int argc, char ** argv) {
|
||||||
@ -462,6 +462,8 @@ int main(int argc, char ** argv) {
|
|||||||
model_ttc = llama_init_ttc.model.get();
|
model_ttc = llama_init_ttc.model.get();
|
||||||
ctx_ttc = llama_init_ttc.context.get();
|
ctx_ttc = llama_init_ttc.context.get();
|
||||||
|
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model_ttc);
|
||||||
|
|
||||||
// TODO: refactor in a common struct
|
// TODO: refactor in a common struct
|
||||||
params.model = params.vocoder.model;
|
params.model = params.vocoder.model;
|
||||||
params.model_url = params.vocoder.model_url;
|
params.model_url = params.vocoder.model_url;
|
||||||
@ -499,9 +501,9 @@ int main(int argc, char ** argv) {
|
|||||||
|
|
||||||
std::vector<llama_token> prompt_inp;
|
std::vector<llama_token> prompt_inp;
|
||||||
|
|
||||||
prompt_init(prompt_inp, model_ttc);
|
prompt_init(prompt_inp, vocab);
|
||||||
|
|
||||||
prompt_add(prompt_inp, model_ttc, "<|text_start|>the<|text_sep|>overall<|text_sep|>package<|text_sep|>from<|text_sep|>just<|text_sep|>two<|text_sep|>people<|text_sep|>is<|text_sep|>pretty<|text_sep|>remarkable<|text_sep|>sure<|text_sep|>i<|text_sep|>have<|text_sep|>some<|text_sep|>critiques<|text_sep|>about<|text_sep|>some<|text_sep|>of<|text_sep|>the<|text_sep|>gameplay<|text_sep|>aspects<|text_sep|>but<|text_sep|>its<|text_sep|>still<|text_sep|>really<|text_sep|>enjoyable<|text_sep|>and<|text_sep|>it<|text_sep|>looks<|text_sep|>lovely<|text_sep|>", false, true);
|
prompt_add(prompt_inp, vocab, "<|text_start|>the<|text_sep|>overall<|text_sep|>package<|text_sep|>from<|text_sep|>just<|text_sep|>two<|text_sep|>people<|text_sep|>is<|text_sep|>pretty<|text_sep|>remarkable<|text_sep|>sure<|text_sep|>i<|text_sep|>have<|text_sep|>some<|text_sep|>critiques<|text_sep|>about<|text_sep|>some<|text_sep|>of<|text_sep|>the<|text_sep|>gameplay<|text_sep|>aspects<|text_sep|>but<|text_sep|>its<|text_sep|>still<|text_sep|>really<|text_sep|>enjoyable<|text_sep|>and<|text_sep|>it<|text_sep|>looks<|text_sep|>lovely<|text_sep|>", false, true);
|
||||||
|
|
||||||
// convert the input text into the necessary format expected by OuteTTS
|
// convert the input text into the necessary format expected by OuteTTS
|
||||||
{
|
{
|
||||||
@ -509,10 +511,10 @@ int main(int argc, char ** argv) {
|
|||||||
|
|
||||||
LOG_INF("%s: prompt: '%s'\n", __func__, prompt_clean.c_str());
|
LOG_INF("%s: prompt: '%s'\n", __func__, prompt_clean.c_str());
|
||||||
|
|
||||||
prompt_add(prompt_inp, model_ttc, prompt_clean, false, true);
|
prompt_add(prompt_inp, vocab, prompt_clean, false, true);
|
||||||
}
|
}
|
||||||
|
|
||||||
prompt_add(prompt_inp, model_ttc, "<|text_end|>\n", false, true);
|
prompt_add(prompt_inp, vocab, "<|text_end|>\n", false, true);
|
||||||
|
|
||||||
// disabled to save time on tokenizing each time
|
// disabled to save time on tokenizing each time
|
||||||
// TODO: load voices from the json files
|
// TODO: load voices from the json files
|
||||||
@ -549,7 +551,7 @@ it<|t_0.09|><|code_start|><|848|><|1366|><|395|><|1601|><|1513|><|593|><|1302|><
|
|||||||
looks<|t_0.27|><|code_start|><|1281|><|1266|><|1755|><|572|><|248|><|1751|><|1257|><|695|><|1380|><|457|><|659|><|585|><|1315|><|1105|><|1776|><|736|><|24|><|736|><|654|><|1027|><|code_end|>
|
looks<|t_0.27|><|code_start|><|1281|><|1266|><|1755|><|572|><|248|><|1751|><|1257|><|695|><|1380|><|457|><|659|><|585|><|1315|><|1105|><|1776|><|736|><|24|><|736|><|654|><|1027|><|code_end|>
|
||||||
lovely<|t_0.56|><|code_start|><|634|><|596|><|1766|><|1556|><|1306|><|1285|><|1481|><|1721|><|1123|><|438|><|1246|><|1251|><|795|><|659|><|1381|><|1658|><|217|><|1772|><|562|><|952|><|107|><|1129|><|1112|><|467|><|550|><|1079|><|840|><|1615|><|1469|><|1380|><|168|><|917|><|836|><|1827|><|437|><|583|><|67|><|595|><|1087|><|1646|><|1493|><|1677|><|code_end|>)";
|
lovely<|t_0.56|><|code_start|><|634|><|596|><|1766|><|1556|><|1306|><|1285|><|1481|><|1721|><|1123|><|438|><|1246|><|1251|><|795|><|659|><|1381|><|1658|><|217|><|1772|><|562|><|952|><|107|><|1129|><|1112|><|467|><|550|><|1079|><|840|><|1615|><|1469|><|1380|><|168|><|917|><|836|><|1827|><|437|><|583|><|67|><|595|><|1087|><|1646|><|1493|><|1677|><|code_end|>)";
|
||||||
|
|
||||||
auto tmp = common_tokenize(model_ttc, voice_data, false, true);
|
auto tmp = common_tokenize(vocab, voice_data, false, true);
|
||||||
printf("\n\n");
|
printf("\n\n");
|
||||||
for (int i = 0; i < tmp.size(); ++i) {
|
for (int i = 0; i < tmp.size(); ++i) {
|
||||||
printf("%d, ", tmp[i]);
|
printf("%d, ", tmp[i]);
|
||||||
@ -735,9 +737,9 @@ lovely<|t_0.56|><|code_start|><|634|><|596|><|1766|><|1556|><|1306|><|1285|><|14
|
|||||||
const auto * cands = common_sampler_get_candidates(smpl[i]);
|
const auto * cands = common_sampler_get_candidates(smpl[i]);
|
||||||
|
|
||||||
// is it an end of generation? -> mark the stream as finished
|
// is it an end of generation? -> mark the stream as finished
|
||||||
if (llama_token_is_eog(model_ttc, new_token_id) || n_decode == n_predict) {
|
if (llama_vocab_is_eog(vocab, new_token_id) || n_decode == n_predict) {
|
||||||
std::string reason;
|
std::string reason;
|
||||||
if (llama_token_is_eog(model_ttc, new_token_id)) {
|
if (llama_vocab_is_eog(vocab, new_token_id)) {
|
||||||
reason = "eos";
|
reason = "eos";
|
||||||
} else {
|
} else {
|
||||||
reason = "n_predict";
|
reason = "n_predict";
|
||||||
@ -873,7 +875,7 @@ lovely<|t_0.56|><|code_start|><|634|><|596|><|1766|><|1556|><|1306|><|1285|><|14
|
|||||||
|
|
||||||
#if 1
|
#if 1
|
||||||
// spectral operations
|
// spectral operations
|
||||||
const int n_embd = llama_n_embd(model_cts);
|
const int n_embd = llama_model_n_embd(model_cts);
|
||||||
const float * embd = llama_get_embeddings(ctx_cts);
|
const float * embd = llama_get_embeddings(ctx_cts);
|
||||||
|
|
||||||
auto audio = embd_to_audio(embd, n_codes, n_embd, params.cpuparams.n_threads);
|
auto audio = embd_to_audio(embd, n_codes, n_embd, params.cpuparams.n_threads);
|
||||||
|
@ -501,6 +501,7 @@ extern "C" {
|
|||||||
GGML_OP_GET_REL_POS,
|
GGML_OP_GET_REL_POS,
|
||||||
GGML_OP_ADD_REL_POS,
|
GGML_OP_ADD_REL_POS,
|
||||||
GGML_OP_RWKV_WKV6,
|
GGML_OP_RWKV_WKV6,
|
||||||
|
GGML_OP_GATED_LINEAR_ATTN,
|
||||||
|
|
||||||
GGML_OP_UNARY,
|
GGML_OP_UNARY,
|
||||||
|
|
||||||
@ -1859,6 +1860,15 @@ extern "C" {
|
|||||||
struct ggml_tensor * td,
|
struct ggml_tensor * td,
|
||||||
struct ggml_tensor * state);
|
struct ggml_tensor * state);
|
||||||
|
|
||||||
|
GGML_API struct ggml_tensor * ggml_gated_linear_attn(
|
||||||
|
struct ggml_context * ctx,
|
||||||
|
struct ggml_tensor * k,
|
||||||
|
struct ggml_tensor * v,
|
||||||
|
struct ggml_tensor * q,
|
||||||
|
struct ggml_tensor * g,
|
||||||
|
struct ggml_tensor * state,
|
||||||
|
float scale);
|
||||||
|
|
||||||
// custom operators
|
// custom operators
|
||||||
|
|
||||||
typedef void (*ggml_unary_op_f32_t) (const int, float *, const float *);
|
typedef void (*ggml_unary_op_f32_t) (const int, float *, const float *);
|
||||||
|
@ -11803,9 +11803,9 @@ static void ggml_compute_forward_add_rel_pos(
|
|||||||
static void ggml_compute_forward_rwkv_wkv6_f32(
|
static void ggml_compute_forward_rwkv_wkv6_f32(
|
||||||
const struct ggml_compute_params * params,
|
const struct ggml_compute_params * params,
|
||||||
struct ggml_tensor * dst) {
|
struct ggml_tensor * dst) {
|
||||||
const int64_t T = dst->src[1]->ne[3];
|
const int64_t T = dst->src[1]->ne[2];
|
||||||
const int64_t C = dst->ne[0];
|
const int64_t C = dst->ne[0];
|
||||||
const int64_t HEADS = dst->src[1]->ne[2];
|
const int64_t HEADS = dst->src[1]->ne[1];
|
||||||
const int64_t n_seqs = dst->src[5]->ne[1];
|
const int64_t n_seqs = dst->src[5]->ne[1];
|
||||||
const int64_t head_size = C / HEADS;
|
const int64_t head_size = C / HEADS;
|
||||||
|
|
||||||
@ -12000,6 +12000,197 @@ static void ggml_compute_forward_rwkv_wkv6(
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ggml_compute_forward_gla
|
||||||
|
|
||||||
|
static void ggml_compute_forward_gla_f32(
|
||||||
|
const struct ggml_compute_params * params,
|
||||||
|
struct ggml_tensor * dst) {
|
||||||
|
const int64_t T = dst->src[1]->ne[2];
|
||||||
|
const int64_t C = dst->ne[0];
|
||||||
|
const int64_t HEADS = dst->src[1]->ne[1];
|
||||||
|
const int64_t n_seqs = dst->src[4]->ne[1];
|
||||||
|
const int64_t head_size = C / HEADS;
|
||||||
|
const float scale = ggml_get_op_params_f32(dst, 0);
|
||||||
|
|
||||||
|
float * dst_data = (float *) dst->data;
|
||||||
|
float * state = ((float *) dst->data) + C * T;
|
||||||
|
|
||||||
|
const int ith = params->ith;
|
||||||
|
const int nth = params->nth;
|
||||||
|
|
||||||
|
if (ith >= HEADS) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const int h_start = (HEADS * ith) / nth;
|
||||||
|
const int h_end = ((HEADS * (ith + 1)) / nth < HEADS) ?
|
||||||
|
(HEADS * (ith + 1)) / nth : HEADS;
|
||||||
|
|
||||||
|
float * k = (float *) dst->src[0]->data;
|
||||||
|
float * v = (float *) dst->src[1]->data;
|
||||||
|
float * q = (float *) dst->src[2]->data;
|
||||||
|
float * g = (float *) dst->src[3]->data;
|
||||||
|
|
||||||
|
size_t t_stride = HEADS * head_size; // Same to C
|
||||||
|
|
||||||
|
size_t h_stride = C / HEADS;
|
||||||
|
GGML_ASSERT(C % HEADS == 0); // C must be divisible by HEADS
|
||||||
|
size_t h_stride_2d = head_size * head_size;
|
||||||
|
|
||||||
|
if (ith == 0) {
|
||||||
|
memset(dst_data, 0, T * C * sizeof(float));
|
||||||
|
}
|
||||||
|
ggml_barrier(params->threadpool);
|
||||||
|
|
||||||
|
|
||||||
|
#if defined(__AVX__) && !defined(__AVX512F__)
|
||||||
|
#define GGML_F32X GGML_F32x8
|
||||||
|
#define GGML_F32X_SET1 GGML_F32x8_SET1
|
||||||
|
#define GGML_F32X_LOAD GGML_F32x8_LOAD
|
||||||
|
#define GGML_F32X_STORE GGML_F32x8_STORE
|
||||||
|
#define GGML_F32X_MUL GGML_F32x8_MUL
|
||||||
|
#define GGML_F32X_FMA GGML_F32x8_FMA
|
||||||
|
#define GLA_VECTOR_SIZE 8
|
||||||
|
#elif defined(__AVX512F__)
|
||||||
|
#define GGML_F32X GGML_F32x16
|
||||||
|
#define GGML_F32X_SET1 GGML_F32x16_SET1
|
||||||
|
#define GGML_F32X_LOAD GGML_F32x16_LOAD
|
||||||
|
#define GGML_F32X_STORE GGML_F32x16_STORE
|
||||||
|
#define GGML_F32X_MUL GGML_F32x16_MUL
|
||||||
|
#define GGML_F32X_FMA GGML_F32x16_FMA
|
||||||
|
#define GLA_VECTOR_SIZE 16
|
||||||
|
#elif defined(__ARM_NEON) && defined(__aarch64__)
|
||||||
|
#define GGML_F32X GGML_F32x4
|
||||||
|
#define GGML_F32X_SET1 GGML_F32x4_SET1
|
||||||
|
#define GGML_F32X_LOAD GGML_F32x4_LOAD
|
||||||
|
#define GGML_F32X_STORE GGML_F32x4_STORE
|
||||||
|
#define GGML_F32X_MUL GGML_F32x4_MUL
|
||||||
|
#define GGML_F32X_FMA GGML_F32x4_FMA
|
||||||
|
#define GLA_VECTOR_SIZE 4
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#ifdef GLA_VECTOR_SIZE
|
||||||
|
const int64_t vec_count = head_size / GLA_VECTOR_SIZE;
|
||||||
|
|
||||||
|
for (int64_t t = 0; t < T; t++) {
|
||||||
|
size_t t_offset = t * t_stride;
|
||||||
|
size_t state_offset = head_size * C * (t / (T / n_seqs));
|
||||||
|
float * state_cur = state + state_offset;
|
||||||
|
float * state_prev = t % (T / n_seqs) ? state_cur : (float*)dst->src[4]->data + state_offset;
|
||||||
|
|
||||||
|
for (int64_t h = h_start; h < h_end; h++) {
|
||||||
|
size_t h_offset = h * h_stride;
|
||||||
|
size_t t_h_offset = t_offset + h_offset;
|
||||||
|
size_t h_2d_offset = h * h_stride_2d;
|
||||||
|
|
||||||
|
for (int64_t i = 0; i < head_size; i++) {
|
||||||
|
size_t t_h_i_offset = t_h_offset + i;
|
||||||
|
size_t h_2d_i_offset = h_2d_offset + i * h_stride;
|
||||||
|
|
||||||
|
float k_val = k[t_h_i_offset];
|
||||||
|
float q_val = q[t_h_i_offset] * scale;
|
||||||
|
float g_val = g[t_h_i_offset];
|
||||||
|
|
||||||
|
// Broadcast scalar values to vectors
|
||||||
|
GGML_F32X k_vec = GGML_F32X_SET1(k_val);
|
||||||
|
GGML_F32X q_vec = GGML_F32X_SET1(q_val);
|
||||||
|
GGML_F32X g_vec = GGML_F32X_SET1(g_val);
|
||||||
|
|
||||||
|
for (int64_t j = 0; j < vec_count; j++) {
|
||||||
|
size_t base_j = j * GLA_VECTOR_SIZE;
|
||||||
|
size_t t_h_j_offset = t_h_offset + base_j;
|
||||||
|
size_t h_2d_i_j_offset = h_2d_i_offset + base_j;
|
||||||
|
|
||||||
|
// Load x elements at once
|
||||||
|
GGML_F32X v_vec = GGML_F32X_LOAD(&v[t_h_j_offset]);
|
||||||
|
GGML_F32X prev_state_vec = GGML_F32X_LOAD(&state_prev[h_2d_i_j_offset]);
|
||||||
|
GGML_F32X dst_vec = GGML_F32X_LOAD(&dst_data[t_h_j_offset]);
|
||||||
|
|
||||||
|
// Compute kv = v * k
|
||||||
|
GGML_F32X kv_vec = GGML_F32X_MUL(v_vec, k_vec);
|
||||||
|
|
||||||
|
// Compute temp = prev_state * g + kv
|
||||||
|
GGML_F32X temp_vec = GGML_F32X_FMA(kv_vec, prev_state_vec, g_vec);
|
||||||
|
|
||||||
|
// Update dst: dst += temp * q
|
||||||
|
dst_vec = GGML_F32X_FMA(dst_vec, temp_vec, q_vec);
|
||||||
|
GGML_F32X_STORE(&dst_data[t_h_j_offset], dst_vec);
|
||||||
|
|
||||||
|
// Update state
|
||||||
|
GGML_F32X_STORE(&state_cur[h_2d_i_j_offset], temp_vec);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Handle remaining elements, this will not be used.
|
||||||
|
for (int64_t j = vec_count * GLA_VECTOR_SIZE; j < head_size; j++) {
|
||||||
|
size_t t_h_j_offset = t_h_offset + j;
|
||||||
|
size_t h_2d_i_j_offset = h_2d_i_offset + j;
|
||||||
|
float v_val = v[t_h_j_offset];
|
||||||
|
float kv_val = v_val * k_val;
|
||||||
|
float prev_state_val = state_prev[h_2d_i_j_offset];
|
||||||
|
float temp_val = kv_val + prev_state_val * g_val;
|
||||||
|
dst_data[t_h_j_offset] += temp_val * q_val;
|
||||||
|
state_cur[h_2d_i_j_offset] = temp_val;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
#else
|
||||||
|
for (int64_t t = 0; t < T; t++) {
|
||||||
|
size_t t_offset = t * t_stride;
|
||||||
|
size_t state_offset = head_size * C * (t / (T / n_seqs));
|
||||||
|
float * state_cur = state + state_offset;
|
||||||
|
float * state_prev = t % (T / n_seqs) ? state_cur : (float*)dst->src[4]->data + state_offset;
|
||||||
|
|
||||||
|
for (int64_t h = h_start; h < h_end; h++) {
|
||||||
|
size_t h_offset = h * h_stride;
|
||||||
|
size_t t_h_offset = t_offset + h_offset;
|
||||||
|
size_t h_2d_offset = h * h_stride_2d;
|
||||||
|
|
||||||
|
for (int64_t i = 0; i < head_size; i++) {
|
||||||
|
size_t t_h_i_offset = t_h_offset + i;
|
||||||
|
size_t h_2d_i_offset = h_2d_offset + i * h_stride;
|
||||||
|
|
||||||
|
float k_val = k[t_h_i_offset];
|
||||||
|
float q_val = q[t_h_i_offset] * scale;
|
||||||
|
float g_val = g[t_h_i_offset];
|
||||||
|
|
||||||
|
for (int64_t j = 0; j < head_size; j++) {
|
||||||
|
size_t t_h_j_offset = t_h_offset + j;
|
||||||
|
size_t h_2d_i_j_offset = h_2d_i_offset + j;
|
||||||
|
|
||||||
|
float v_val = v[t_h_j_offset];
|
||||||
|
float kv_val = v_val * k_val;
|
||||||
|
float prev_state_val = state_prev[h_2d_i_j_offset];
|
||||||
|
float temp_val = prev_state_val * g_val + kv_val;
|
||||||
|
dst_data[t_h_j_offset] += temp_val * q_val;
|
||||||
|
state_cur[h_2d_i_j_offset] = temp_val;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
static void ggml_compute_forward_gla(
|
||||||
|
const struct ggml_compute_params * params,
|
||||||
|
struct ggml_tensor * dst) {
|
||||||
|
|
||||||
|
const struct ggml_tensor * src0 = dst->src[0];
|
||||||
|
|
||||||
|
switch (src0->type) {
|
||||||
|
case GGML_TYPE_F32:
|
||||||
|
{
|
||||||
|
ggml_compute_forward_gla_f32(params, dst);
|
||||||
|
} break;
|
||||||
|
default:
|
||||||
|
{
|
||||||
|
GGML_ABORT("fatal error");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// ggml_compute_forward_map_unary
|
// ggml_compute_forward_map_unary
|
||||||
|
|
||||||
static void ggml_compute_forward_map_unary_f32(
|
static void ggml_compute_forward_map_unary_f32(
|
||||||
@ -12749,6 +12940,10 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm
|
|||||||
{
|
{
|
||||||
ggml_compute_forward_rwkv_wkv6(params, tensor);
|
ggml_compute_forward_rwkv_wkv6(params, tensor);
|
||||||
} break;
|
} break;
|
||||||
|
case GGML_OP_GATED_LINEAR_ATTN:
|
||||||
|
{
|
||||||
|
ggml_compute_forward_gla(params, tensor);
|
||||||
|
} break;
|
||||||
case GGML_OP_MAP_UNARY:
|
case GGML_OP_MAP_UNARY:
|
||||||
{
|
{
|
||||||
ggml_unary_op_f32_t fun;
|
ggml_unary_op_f32_t fun;
|
||||||
@ -13047,6 +13242,7 @@ static int ggml_get_n_tasks(struct ggml_tensor * node, int n_threads) {
|
|||||||
case GGML_OP_WIN_UNPART:
|
case GGML_OP_WIN_UNPART:
|
||||||
case GGML_OP_GET_REL_POS:
|
case GGML_OP_GET_REL_POS:
|
||||||
case GGML_OP_RWKV_WKV6:
|
case GGML_OP_RWKV_WKV6:
|
||||||
|
case GGML_OP_GATED_LINEAR_ATTN:
|
||||||
case GGML_OP_MAP_UNARY:
|
case GGML_OP_MAP_UNARY:
|
||||||
case GGML_OP_MAP_BINARY:
|
case GGML_OP_MAP_BINARY:
|
||||||
case GGML_OP_MAP_CUSTOM1_F32:
|
case GGML_OP_MAP_CUSTOM1_F32:
|
||||||
|
@ -124,7 +124,7 @@ static __global__ void __launch_bounds__(CUDA_CONCAT_BLOCK_SIZE)
|
|||||||
uint64_t nb1,
|
uint64_t nb1,
|
||||||
uint64_t nb2,
|
uint64_t nb2,
|
||||||
uint64_t nb3){
|
uint64_t nb3){
|
||||||
static_assert(dim >= 0 && dim <= 3);
|
static_assert(dim >= 0 && dim <= 3, "dim must be in [0, 3]");
|
||||||
|
|
||||||
const int64_t i3 = blockIdx.z;
|
const int64_t i3 = blockIdx.z;
|
||||||
const int64_t i2 = blockIdx.y;
|
const int64_t i2 = blockIdx.y;
|
||||||
|
@ -37,6 +37,7 @@
|
|||||||
#include "ggml-cuda/unary.cuh"
|
#include "ggml-cuda/unary.cuh"
|
||||||
#include "ggml-cuda/upscale.cuh"
|
#include "ggml-cuda/upscale.cuh"
|
||||||
#include "ggml-cuda/wkv6.cuh"
|
#include "ggml-cuda/wkv6.cuh"
|
||||||
|
#include "ggml-cuda/gla.cuh"
|
||||||
|
|
||||||
#include <algorithm>
|
#include <algorithm>
|
||||||
#include <array>
|
#include <array>
|
||||||
@ -2167,6 +2168,9 @@ static bool ggml_cuda_compute_forward(ggml_backend_cuda_context & ctx, struct gg
|
|||||||
case GGML_OP_RWKV_WKV6:
|
case GGML_OP_RWKV_WKV6:
|
||||||
ggml_cuda_op_rwkv_wkv6(ctx, dst);
|
ggml_cuda_op_rwkv_wkv6(ctx, dst);
|
||||||
break;
|
break;
|
||||||
|
case GGML_OP_GATED_LINEAR_ATTN:
|
||||||
|
ggml_cuda_op_gated_linear_attn(ctx, dst);
|
||||||
|
break;
|
||||||
case GGML_OP_CROSS_ENTROPY_LOSS_BACK:
|
case GGML_OP_CROSS_ENTROPY_LOSS_BACK:
|
||||||
ggml_cuda_cross_entropy_loss_back(ctx, dst);
|
ggml_cuda_cross_entropy_loss_back(ctx, dst);
|
||||||
break;
|
break;
|
||||||
@ -2285,6 +2289,66 @@ static void ggml_backend_cuda_synchronize(ggml_backend_t backend) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
#ifdef USE_CUDA_GRAPH
|
#ifdef USE_CUDA_GRAPH
|
||||||
|
static bool check_node_graph_compatibility_and_refresh_copy_ops(ggml_backend_cuda_context * cuda_ctx, ggml_cgraph * cgraph,
|
||||||
|
std::vector<void *> & ggml_cuda_cpy_fn_ptrs, bool use_cuda_graph) {
|
||||||
|
|
||||||
|
// Loop over nodes in GGML graph to obtain info needed for CUDA graph
|
||||||
|
cuda_ctx->cuda_graph->updated_kernel_arg.clear();
|
||||||
|
for (int i = 0; i < cgraph->n_nodes; i++) {
|
||||||
|
ggml_tensor * node = cgraph->nodes[i];
|
||||||
|
|
||||||
|
if (ggml_is_empty(node) || node->op == GGML_OP_RESHAPE || node->op == GGML_OP_TRANSPOSE || node->op == GGML_OP_VIEW || node->op == GGML_OP_PERMUTE || node->op == GGML_OP_NONE) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (node->src[0] && node->src[0]->buffer && ggml_backend_buft_is_cuda_split(node->src[0]->buffer->buft)) {
|
||||||
|
use_cuda_graph = false; // Split buffers are not supported by CUDA graph capture
|
||||||
|
#ifndef NDEBUG
|
||||||
|
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to split buffer\n", __func__);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
if (node->op == GGML_OP_MUL_MAT_ID) {
|
||||||
|
use_cuda_graph = false; // This node type is not supported by CUDA graph capture
|
||||||
|
#ifndef NDEBUG
|
||||||
|
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to mul_mat_id\n", __func__);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
if (node->op == GGML_OP_ADD && node->src[1] && node->src[1]->ne[1] > 1) {
|
||||||
|
// disable CUDA graphs for batch size > 1 for now.
|
||||||
|
// Changes in batch size or context size can cause changes to the grid size of some kernels.
|
||||||
|
use_cuda_graph = false;
|
||||||
|
#ifndef NDEBUG
|
||||||
|
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to batch size > 1 [%s] [%ld %ld %ld %ld]\n", __func__, node->name, node->ne[0], node->ne[1], node->ne[2], node->ne[3]);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
if (node->op == GGML_OP_CPY) {
|
||||||
|
// store the copy op parameter which changes with each token.
|
||||||
|
cuda_ctx->cuda_graph->updated_kernel_arg.push_back((char **) &(node->src[1]->data));
|
||||||
|
// store a pointer to each copy op CUDA kernel to identify it later
|
||||||
|
void * ptr = ggml_cuda_cpy_fn(node->src[0], node->src[1]);
|
||||||
|
if (!ptr) {
|
||||||
|
use_cuda_graph = false;
|
||||||
|
#ifndef NDEBUG
|
||||||
|
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to unsupported copy op\n", __func__);
|
||||||
|
#endif
|
||||||
|
} else {
|
||||||
|
if (std::find(ggml_cuda_cpy_fn_ptrs.begin(), ggml_cuda_cpy_fn_ptrs.end(), ptr) == ggml_cuda_cpy_fn_ptrs.end()) {
|
||||||
|
ggml_cuda_cpy_fn_ptrs.push_back(ptr);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!use_cuda_graph) {
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return use_cuda_graph;
|
||||||
|
}
|
||||||
|
|
||||||
static void set_ggml_graph_node_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
|
static void set_ggml_graph_node_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
|
||||||
graph_node_properties->node_address = node->data;
|
graph_node_properties->node_address = node->data;
|
||||||
graph_node_properties->node_op = node->op;
|
graph_node_properties->node_op = node->op;
|
||||||
@ -2335,149 +2399,105 @@ static bool ggml_graph_node_has_matching_properties(ggml_tensor * node, ggml_gra
|
|||||||
|
|
||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
#endif
|
|
||||||
|
|
||||||
static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) {
|
static void maintain_cuda_graph(ggml_backend_cuda_context * cuda_ctx, std::vector<void *> & ggml_cuda_cpy_fn_ptrs, bool cuda_graph_update_required) {
|
||||||
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *)backend->context;
|
|
||||||
|
|
||||||
ggml_cuda_set_device(cuda_ctx->device);
|
if (cuda_graph_update_required) {
|
||||||
|
// Extract nodes from graph
|
||||||
|
// First call with null argument gets number of nodes in graph
|
||||||
|
CUDA_CHECK(cudaGraphGetNodes(cuda_ctx->cuda_graph->graph, nullptr, &cuda_ctx->cuda_graph->num_nodes));
|
||||||
|
// Subsequent call with non-null argument gets nodes
|
||||||
|
cuda_ctx->cuda_graph->nodes.clear();
|
||||||
|
cuda_ctx->cuda_graph->nodes.resize(cuda_ctx->cuda_graph->num_nodes);
|
||||||
|
cuda_ctx->cuda_graph->params.clear();
|
||||||
|
cuda_ctx->cuda_graph->params.resize(cuda_ctx->cuda_graph->num_nodes);
|
||||||
|
if (cuda_ctx->cuda_graph->num_nodes > 0) {
|
||||||
|
CUDA_CHECK(cudaGraphGetNodes(cuda_ctx->cuda_graph->graph, cuda_ctx->cuda_graph->nodes.data(), &cuda_ctx->cuda_graph->num_nodes));
|
||||||
|
|
||||||
#ifdef USE_CUDA_GRAPH
|
// Loop over nodes, and extract kernel parameters from each node
|
||||||
static const bool disable_cuda_graphs_due_to_env = (getenv("GGML_CUDA_DISABLE_GRAPHS") != nullptr);
|
for (size_t i = 0; i < cuda_ctx->cuda_graph->num_nodes; i++) {
|
||||||
|
cudaGraphNodeType node_type;
|
||||||
// Objects required for CUDA Graph
|
CUDA_CHECK(cudaGraphNodeGetType(cuda_ctx->cuda_graph->nodes[i], &node_type));
|
||||||
if (cuda_ctx->cuda_graph == nullptr) {
|
if (node_type == cudaGraphNodeTypeKernel) {
|
||||||
cuda_ctx->cuda_graph.reset(new ggml_cuda_graph());
|
cudaError_t stat = cudaGraphKernelNodeGetParams(cuda_ctx->cuda_graph->nodes[i], &cuda_ctx->cuda_graph->params[i]); // Get params using runtime
|
||||||
}
|
if (stat == cudaErrorInvalidDeviceFunction) {
|
||||||
|
// Fails due to incorrect handling by CUDA runtime of CUDA BLAS node.
|
||||||
bool use_cuda_graph = true;
|
// We don't need to update blas nodes, so clear error and move on.
|
||||||
bool cuda_graph_update_required = false;
|
cudaGetLastError();
|
||||||
// vector of pointers to CUDA cpy kernels, which are required to identify
|
} else {
|
||||||
// kernel parameters which need updated in the graph for each token
|
GGML_ASSERT(stat == cudaSuccess);
|
||||||
std::vector<void *> ggml_cuda_cpy_fn_ptrs;
|
|
||||||
|
|
||||||
if (cuda_ctx->cuda_graph->graph == nullptr) {
|
|
||||||
if (ggml_cuda_info().devices[cuda_ctx->device].cc < GGML_CUDA_CC_AMPERE) {
|
|
||||||
cuda_ctx->cuda_graph->disable_due_to_gpu_arch = true;
|
|
||||||
#ifndef NDEBUG
|
|
||||||
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to GPU architecture\n", __func__);
|
|
||||||
#endif
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Disable CUDA graphs in presence of env var, old GPU, use-case which is changing too rapidly,
|
|
||||||
// or previous graph capture failure.
|
|
||||||
// Also disable for multi-gpu for now. TO DO investigate
|
|
||||||
if (disable_cuda_graphs_due_to_env
|
|
||||||
|| cuda_ctx->cuda_graph->disable_due_to_gpu_arch
|
|
||||||
|| cuda_ctx->cuda_graph->disable_due_to_too_many_updates
|
|
||||||
|| cuda_ctx->cuda_graph->disable_due_to_failed_graph_capture) {
|
|
||||||
use_cuda_graph = false;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (use_cuda_graph) {
|
|
||||||
if (cuda_ctx->cuda_graph->instance == nullptr) {
|
|
||||||
cuda_graph_update_required = true;
|
|
||||||
}
|
|
||||||
|
|
||||||
// Check if the graph size has changed
|
|
||||||
if (cuda_ctx->cuda_graph->ggml_graph_properties.size() != (size_t)cgraph->n_nodes) {
|
|
||||||
cuda_graph_update_required = true;
|
|
||||||
cuda_ctx->cuda_graph->ggml_graph_properties.resize(cgraph->n_nodes);
|
|
||||||
}
|
|
||||||
|
|
||||||
// Loop over nodes in GGML graph to determine if CUDA graph update is required
|
|
||||||
// and store properties to allow this comparison for the next token
|
|
||||||
for (int i = 0; i < cgraph->n_nodes; i++) {
|
|
||||||
bool has_matching_properties = true;
|
|
||||||
if (!cuda_graph_update_required) {
|
|
||||||
has_matching_properties = ggml_graph_node_has_matching_properties(cgraph->nodes[i], &cuda_ctx->cuda_graph->ggml_graph_properties[i]);
|
|
||||||
}
|
|
||||||
if (!has_matching_properties) {
|
|
||||||
cuda_graph_update_required = true;
|
|
||||||
}
|
|
||||||
set_ggml_graph_node_properties(cgraph->nodes[i], &cuda_ctx->cuda_graph->ggml_graph_properties[i]);
|
|
||||||
}
|
|
||||||
|
|
||||||
// Loop over nodes in GGML graph to obtain info needed for CUDA graph
|
|
||||||
cuda_ctx->cuda_graph->updated_kernel_arg.clear();
|
|
||||||
for (int i = 0; i < cgraph->n_nodes; i++) {
|
|
||||||
ggml_tensor * node = cgraph->nodes[i];
|
|
||||||
|
|
||||||
if (ggml_is_empty(node) || node->op == GGML_OP_RESHAPE || node->op == GGML_OP_TRANSPOSE || node->op == GGML_OP_VIEW || node->op == GGML_OP_PERMUTE || node->op == GGML_OP_NONE) {
|
|
||||||
continue;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (node->src[0] && node->src[0]->buffer && ggml_backend_buft_is_cuda_split(node->src[0]->buffer->buft)) {
|
|
||||||
use_cuda_graph = false; // Split buffers are not supported by CUDA graph capture
|
|
||||||
#ifndef NDEBUG
|
|
||||||
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to split buffer\n", __func__);
|
|
||||||
#endif
|
|
||||||
}
|
|
||||||
|
|
||||||
if (node->op == GGML_OP_MUL_MAT_ID) {
|
|
||||||
use_cuda_graph = false; // This node type is not supported by CUDA graph capture
|
|
||||||
#ifndef NDEBUG
|
|
||||||
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to mul_mat_id\n", __func__);
|
|
||||||
#endif
|
|
||||||
}
|
|
||||||
|
|
||||||
if (node->op == GGML_OP_ADD && node->src[1] && node->src[1]->ne[1] > 1) {
|
|
||||||
// disable CUDA graphs for batch size > 1 for now.
|
|
||||||
// Changes in batch size or context size can cause changes to the grid size of some kernels.
|
|
||||||
use_cuda_graph = false;
|
|
||||||
#ifndef NDEBUG
|
|
||||||
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to batch size > 1 [%s] [%ld %ld %ld %ld]\n", __func__, node->name, node->ne[0], node->ne[1], node->ne[2], node->ne[3]);
|
|
||||||
#endif
|
|
||||||
}
|
|
||||||
|
|
||||||
if (node->op == GGML_OP_CPY) {
|
|
||||||
// store the copy op parameter which changes with each token.
|
|
||||||
cuda_ctx->cuda_graph->updated_kernel_arg.push_back((char **) &(node->src[1]->data));
|
|
||||||
// store a pointer to each copy op CUDA kernel to identify it later
|
|
||||||
void * ptr = ggml_cuda_cpy_fn(node->src[0], node->src[1]);
|
|
||||||
if (!ptr) {
|
|
||||||
use_cuda_graph = false;
|
|
||||||
#ifndef NDEBUG
|
|
||||||
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to unsupported copy op\n", __func__);
|
|
||||||
#endif
|
|
||||||
} else {
|
|
||||||
if (std::find(ggml_cuda_cpy_fn_ptrs.begin(), ggml_cuda_cpy_fn_ptrs.end(), ptr) == ggml_cuda_cpy_fn_ptrs.end()) {
|
|
||||||
ggml_cuda_cpy_fn_ptrs.push_back(ptr);
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
}
|
||||||
if (!use_cuda_graph) {
|
} else {
|
||||||
break;
|
// One of the arguments to the copy kernel is updated for each token, hence we need to
|
||||||
|
// replace that argument with the updated value in the CUDA graph
|
||||||
|
// on update steps, the live parameters will already be captured
|
||||||
|
int k = 0;
|
||||||
|
for (size_t i = 0; i < cuda_ctx->cuda_graph->num_nodes; i++) {
|
||||||
|
if(count(ggml_cuda_cpy_fn_ptrs.begin(), ggml_cuda_cpy_fn_ptrs.end(), cuda_ctx->cuda_graph->params[i].func) > 0) {
|
||||||
|
char ** updated_kernel_arg_ptr = cuda_ctx->cuda_graph->updated_kernel_arg.at(k++);
|
||||||
|
cuda_ctx->cuda_graph->params[i].kernelParams[1] = updated_kernel_arg_ptr;
|
||||||
|
CUDA_CHECK(cudaGraphKernelNodeSetParams(cuda_ctx->cuda_graph->nodes[i], &cuda_ctx->cuda_graph->params[i]));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Disable CUDA graphs (from the next token) if the use-case is demanding too many consecutive graph updates.
|
|
||||||
if (use_cuda_graph && cuda_graph_update_required) {
|
|
||||||
cuda_ctx->cuda_graph->number_consecutive_updates++;
|
|
||||||
} else {
|
|
||||||
cuda_ctx->cuda_graph->number_consecutive_updates = 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (cuda_ctx->cuda_graph->number_consecutive_updates >= 4) {
|
|
||||||
cuda_ctx->cuda_graph->disable_due_to_too_many_updates = true;
|
|
||||||
#ifndef NDEBUG
|
|
||||||
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to too many consecutive updates\n", __func__);
|
|
||||||
#endif
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
|
||||||
if (use_cuda_graph && cuda_graph_update_required) { // Start CUDA graph capture
|
static bool is_cuda_graph_update_required(ggml_backend_cuda_context * cuda_ctx, ggml_cgraph * cgraph) {
|
||||||
CUDA_CHECK(cudaStreamBeginCapture(cuda_ctx->stream(), cudaStreamCaptureModeRelaxed));
|
|
||||||
}
|
|
||||||
|
|
||||||
#else
|
|
||||||
bool use_cuda_graph = false;
|
|
||||||
bool cuda_graph_update_required = false;
|
bool cuda_graph_update_required = false;
|
||||||
#endif // USE_CUDA_GRAPH
|
|
||||||
|
|
||||||
bool graph_evaluated_or_captured = false;
|
if (cuda_ctx->cuda_graph->instance == nullptr) {
|
||||||
|
cuda_graph_update_required = true;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Check if the graph size has changed
|
||||||
|
if (cuda_ctx->cuda_graph->ggml_graph_properties.size() != (size_t)cgraph->n_nodes) {
|
||||||
|
cuda_graph_update_required = true;
|
||||||
|
cuda_ctx->cuda_graph->ggml_graph_properties.resize(cgraph->n_nodes);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Loop over nodes in GGML graph to determine if CUDA graph update is required
|
||||||
|
// and store properties to allow this comparison for the next token
|
||||||
|
for (int i = 0; i < cgraph->n_nodes; i++) {
|
||||||
|
bool has_matching_properties = true;
|
||||||
|
if (!cuda_graph_update_required) {
|
||||||
|
has_matching_properties = ggml_graph_node_has_matching_properties(cgraph->nodes[i], &cuda_ctx->cuda_graph->ggml_graph_properties[i]);
|
||||||
|
}
|
||||||
|
if (!has_matching_properties) {
|
||||||
|
cuda_graph_update_required = true;
|
||||||
|
}
|
||||||
|
set_ggml_graph_node_properties(cgraph->nodes[i], &cuda_ctx->cuda_graph->ggml_graph_properties[i]);
|
||||||
|
}
|
||||||
|
|
||||||
|
return cuda_graph_update_required;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void update_cuda_graph_executable(ggml_backend_cuda_context * cuda_ctx) {
|
||||||
|
|
||||||
|
cudaGraphExecUpdateResultInfo result_info;
|
||||||
|
cudaError_t stat = cudaGraphExecUpdate(cuda_ctx->cuda_graph->instance, cuda_ctx->cuda_graph->graph, &result_info);
|
||||||
|
if (stat == cudaErrorGraphExecUpdateFailure) {
|
||||||
|
#ifndef NDEBUG
|
||||||
|
GGML_LOG_DEBUG("%s: CUDA graph update failed\n", __func__);
|
||||||
|
#endif
|
||||||
|
// The pre-existing graph exec cannot be updated due to violated constraints
|
||||||
|
// so instead clear error and re-instantiate
|
||||||
|
cudaGetLastError();
|
||||||
|
CUDA_CHECK(cudaGraphExecDestroy(cuda_ctx->cuda_graph->instance));
|
||||||
|
cuda_ctx->cuda_graph->instance = nullptr;
|
||||||
|
CUDA_CHECK(cudaGraphInstantiate(&cuda_ctx->cuda_graph->instance, cuda_ctx->cuda_graph->graph, NULL, NULL, 0));
|
||||||
|
} else {
|
||||||
|
GGML_ASSERT(stat == cudaSuccess);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
|
static void evaluate_and_capture_cuda_graph(ggml_backend_cuda_context * cuda_ctx, ggml_cgraph * cgraph,
|
||||||
|
[[maybe_unused]] std::vector<void *> & ggml_cuda_cpy_fn_ptrs, bool & graph_evaluated_or_captured, bool & use_cuda_graph,
|
||||||
|
bool & cuda_graph_update_required) {
|
||||||
|
|
||||||
while (!graph_evaluated_or_captured) {
|
while (!graph_evaluated_or_captured) {
|
||||||
// Only perform the graph execution if CUDA graphs are not enabled, or we are capturing the graph.
|
// Only perform the graph execution if CUDA graphs are not enabled, or we are capturing the graph.
|
||||||
@ -2515,19 +2535,8 @@ static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t backend,
|
|||||||
CUDA_CHECK(cudaGraphDestroy(cuda_ctx->cuda_graph->graph));
|
CUDA_CHECK(cudaGraphDestroy(cuda_ctx->cuda_graph->graph));
|
||||||
cuda_ctx->cuda_graph->graph = nullptr;
|
cuda_ctx->cuda_graph->graph = nullptr;
|
||||||
}
|
}
|
||||||
CUDA_CHECK(cudaStreamEndCapture(cuda_ctx->stream(), &cuda_ctx->cuda_graph->graph));
|
|
||||||
|
|
||||||
#if 0
|
CUDA_CHECK(cudaStreamEndCapture(cuda_ctx->stream(), &cuda_ctx->cuda_graph->graph));
|
||||||
if (disable_cuda_graphs_due_to_failed_capture) {
|
|
||||||
use_cuda_graph = false;
|
|
||||||
cuda_ctx->cuda_graph->disable_due_to_failed_graph_capture = true;
|
|
||||||
#ifndef NDEBUG
|
|
||||||
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to failed graph capture\n", __func__);
|
|
||||||
#endif
|
|
||||||
} else {
|
|
||||||
graph_evaluated_or_captured = true; // CUDA graph has been captured
|
|
||||||
}
|
|
||||||
#endif
|
|
||||||
graph_evaluated_or_captured = true; // CUDA graph has been captured
|
graph_evaluated_or_captured = true; // CUDA graph has been captured
|
||||||
} else {
|
} else {
|
||||||
graph_evaluated_or_captured = true; // ggml graph has been directly evaluated
|
graph_evaluated_or_captured = true; // ggml graph has been directly evaluated
|
||||||
@ -2540,72 +2549,91 @@ static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t backend,
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Perform update to graph (if required for this token), and change copy parameter (required for every token)
|
// Perform update to graph (if required for this token), and change copy parameter (required for every token)
|
||||||
|
maintain_cuda_graph(cuda_ctx, ggml_cuda_cpy_fn_ptrs, cuda_graph_update_required);
|
||||||
if (cuda_graph_update_required) {
|
|
||||||
// Extract nodes from graph
|
|
||||||
// First call with null argument gets number of nodes in graph
|
|
||||||
CUDA_CHECK(cudaGraphGetNodes(cuda_ctx->cuda_graph->graph, nullptr, &cuda_ctx->cuda_graph->num_nodes));
|
|
||||||
// Subsequent call with non-null argument gets nodes
|
|
||||||
cuda_ctx->cuda_graph->nodes.clear();
|
|
||||||
cuda_ctx->cuda_graph->nodes.resize(cuda_ctx->cuda_graph->num_nodes);
|
|
||||||
cuda_ctx->cuda_graph->params.clear();
|
|
||||||
cuda_ctx->cuda_graph->params.resize(cuda_ctx->cuda_graph->num_nodes);
|
|
||||||
if (cuda_ctx->cuda_graph->num_nodes > 0) {
|
|
||||||
CUDA_CHECK(cudaGraphGetNodes(cuda_ctx->cuda_graph->graph, cuda_ctx->cuda_graph->nodes.data(), &cuda_ctx->cuda_graph->num_nodes));
|
|
||||||
|
|
||||||
// Loop over nodes, and extract kernel parameters from each node
|
|
||||||
for (size_t i = 0; i < cuda_ctx->cuda_graph->num_nodes; i++) {
|
|
||||||
cudaGraphNodeType node_type;
|
|
||||||
CUDA_CHECK(cudaGraphNodeGetType(cuda_ctx->cuda_graph->nodes[i], &node_type));
|
|
||||||
if (node_type == cudaGraphNodeTypeKernel) {
|
|
||||||
cudaError_t stat = cudaGraphKernelNodeGetParams(cuda_ctx->cuda_graph->nodes[i], &cuda_ctx->cuda_graph->params[i]); // Get params using runtime
|
|
||||||
if (stat == cudaErrorInvalidDeviceFunction) {
|
|
||||||
// Fails due to incorrect handling by CUDA runtime of CUDA BLAS node.
|
|
||||||
// We don't need to update blas nodes, so clear error and move on.
|
|
||||||
cudaGetLastError();
|
|
||||||
} else {
|
|
||||||
GGML_ASSERT(stat == cudaSuccess);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// One of the arguments to the copy kernel is updated for each token, hence we need to
|
|
||||||
// replace that argument with the updated value in the CUDA graph
|
|
||||||
if (!cuda_graph_update_required) { // on update steps, the live parameters will already be captured
|
|
||||||
int k = 0;
|
|
||||||
for (size_t i = 0; i < cuda_ctx->cuda_graph->num_nodes; i++) {
|
|
||||||
if(count(ggml_cuda_cpy_fn_ptrs.begin(), ggml_cuda_cpy_fn_ptrs.end(), cuda_ctx->cuda_graph->params[i].func) > 0) {
|
|
||||||
char ** updated_kernel_arg_ptr = cuda_ctx->cuda_graph->updated_kernel_arg.at(k++);
|
|
||||||
cuda_ctx->cuda_graph->params[i].kernelParams[1] = updated_kernel_arg_ptr;
|
|
||||||
CUDA_CHECK(cudaGraphKernelNodeSetParams(cuda_ctx->cuda_graph->nodes[i], &cuda_ctx->cuda_graph->params[i]));
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Update graph executable
|
// Update graph executable
|
||||||
cudaGraphExecUpdateResultInfo result_info;
|
update_cuda_graph_executable(cuda_ctx);
|
||||||
cudaError_t stat = cudaGraphExecUpdate(cuda_ctx->cuda_graph->instance, cuda_ctx->cuda_graph->graph, &result_info);
|
|
||||||
if (stat == cudaErrorGraphExecUpdateFailure) {
|
|
||||||
#ifndef NDEBUG
|
|
||||||
GGML_LOG_DEBUG("%s: CUDA graph update failed\n", __func__);
|
|
||||||
#endif
|
|
||||||
// The pre-existing graph exec cannot be updated due to violated constraints
|
|
||||||
// so instead clear error and re-instantiate
|
|
||||||
cudaGetLastError();
|
|
||||||
CUDA_CHECK(cudaGraphExecDestroy(cuda_ctx->cuda_graph->instance));
|
|
||||||
cuda_ctx->cuda_graph->instance = nullptr;
|
|
||||||
CUDA_CHECK(cudaGraphInstantiate(&cuda_ctx->cuda_graph->instance, cuda_ctx->cuda_graph->graph, NULL, NULL, 0));
|
|
||||||
} else {
|
|
||||||
GGML_ASSERT(stat == cudaSuccess);
|
|
||||||
}
|
|
||||||
// Launch graph
|
// Launch graph
|
||||||
CUDA_CHECK(cudaGraphLaunch(cuda_ctx->cuda_graph->instance, cuda_ctx->stream()));
|
CUDA_CHECK(cudaGraphLaunch(cuda_ctx->cuda_graph->instance, cuda_ctx->stream()));
|
||||||
#else
|
#else
|
||||||
graph_evaluated_or_captured = true;
|
graph_evaluated_or_captured = true;
|
||||||
#endif // USE_CUDA_GRAPH
|
#endif // USE_CUDA_GRAPH
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) {
|
||||||
|
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *)backend->context;
|
||||||
|
|
||||||
|
ggml_cuda_set_device(cuda_ctx->device);
|
||||||
|
|
||||||
|
// vector of pointers to CUDA cpy kernels, which are required to identify
|
||||||
|
// kernel parameters which need updated in the graph for each token
|
||||||
|
std::vector<void *> ggml_cuda_cpy_fn_ptrs;
|
||||||
|
|
||||||
|
#ifdef USE_CUDA_GRAPH
|
||||||
|
static const bool disable_cuda_graphs_due_to_env = (getenv("GGML_CUDA_DISABLE_GRAPHS") != nullptr);
|
||||||
|
|
||||||
|
// Objects required for CUDA Graph
|
||||||
|
if (cuda_ctx->cuda_graph == nullptr) {
|
||||||
|
cuda_ctx->cuda_graph.reset(new ggml_cuda_graph());
|
||||||
|
}
|
||||||
|
|
||||||
|
bool use_cuda_graph = true;
|
||||||
|
bool cuda_graph_update_required = false;
|
||||||
|
|
||||||
|
if (cuda_ctx->cuda_graph->graph == nullptr) {
|
||||||
|
if (ggml_cuda_info().devices[cuda_ctx->device].cc < GGML_CUDA_CC_AMPERE) {
|
||||||
|
cuda_ctx->cuda_graph->disable_due_to_gpu_arch = true;
|
||||||
|
#ifndef NDEBUG
|
||||||
|
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to GPU architecture\n", __func__);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Disable CUDA graphs in presence of env var, old GPU, use-case which is changing too rapidly,
|
||||||
|
// or previous graph capture failure.
|
||||||
|
// Also disable for multi-gpu for now. TO DO investigate
|
||||||
|
if (disable_cuda_graphs_due_to_env
|
||||||
|
|| cuda_ctx->cuda_graph->disable_due_to_gpu_arch
|
||||||
|
|| cuda_ctx->cuda_graph->disable_due_to_too_many_updates
|
||||||
|
|| cuda_ctx->cuda_graph->disable_due_to_failed_graph_capture) {
|
||||||
|
use_cuda_graph = false;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (use_cuda_graph) {
|
||||||
|
cuda_graph_update_required = is_cuda_graph_update_required(cuda_ctx, cgraph);
|
||||||
|
|
||||||
|
use_cuda_graph = check_node_graph_compatibility_and_refresh_copy_ops(cuda_ctx, cgraph,
|
||||||
|
ggml_cuda_cpy_fn_ptrs, use_cuda_graph);
|
||||||
|
|
||||||
|
// Disable CUDA graphs (from the next token) if the use-case is demanding too many consecutive graph updates.
|
||||||
|
if (use_cuda_graph && cuda_graph_update_required) {
|
||||||
|
cuda_ctx->cuda_graph->number_consecutive_updates++;
|
||||||
|
} else {
|
||||||
|
cuda_ctx->cuda_graph->number_consecutive_updates = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (cuda_ctx->cuda_graph->number_consecutive_updates >= 4) {
|
||||||
|
cuda_ctx->cuda_graph->disable_due_to_too_many_updates = true;
|
||||||
|
#ifndef NDEBUG
|
||||||
|
GGML_LOG_DEBUG("%s: disabling CUDA graphs due to too many consecutive updates\n", __func__);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (use_cuda_graph && cuda_graph_update_required) { // Start CUDA graph capture
|
||||||
|
CUDA_CHECK(cudaStreamBeginCapture(cuda_ctx->stream(), cudaStreamCaptureModeRelaxed));
|
||||||
|
}
|
||||||
|
|
||||||
|
#else
|
||||||
|
bool use_cuda_graph = false;
|
||||||
|
bool cuda_graph_update_required = false;
|
||||||
|
#endif // USE_CUDA_GRAPH
|
||||||
|
|
||||||
|
bool graph_evaluated_or_captured = false;
|
||||||
|
|
||||||
|
evaluate_and_capture_cuda_graph(cuda_ctx, cgraph, ggml_cuda_cpy_fn_ptrs, graph_evaluated_or_captured, use_cuda_graph, cuda_graph_update_required);
|
||||||
|
|
||||||
return GGML_STATUS_SUCCESS;
|
return GGML_STATUS_SUCCESS;
|
||||||
}
|
}
|
||||||
@ -3011,6 +3039,7 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
|
|||||||
case GGML_OP_TIMESTEP_EMBEDDING:
|
case GGML_OP_TIMESTEP_EMBEDDING:
|
||||||
case GGML_OP_LEAKY_RELU:
|
case GGML_OP_LEAKY_RELU:
|
||||||
case GGML_OP_RWKV_WKV6:
|
case GGML_OP_RWKV_WKV6:
|
||||||
|
case GGML_OP_GATED_LINEAR_ATTN:
|
||||||
return true;
|
return true;
|
||||||
case GGML_OP_FLASH_ATTN_EXT: {
|
case GGML_OP_FLASH_ATTN_EXT: {
|
||||||
#ifndef FLASH_ATTN_AVAILABLE
|
#ifndef FLASH_ATTN_AVAILABLE
|
||||||
|
93
ggml/src/ggml-cuda/gla.cu
Normal file
93
ggml/src/ggml-cuda/gla.cu
Normal file
@ -0,0 +1,93 @@
|
|||||||
|
#include "common.cuh"
|
||||||
|
#include "gla.cuh"
|
||||||
|
|
||||||
|
template<int HEAD_SIZE>
|
||||||
|
static __global__ void gated_linear_attn_f32(const int B, const int T, const int C, const int H, const float scale,
|
||||||
|
const float * k, const float * v, const float * r, const float * td, const float * s, float * dst) {
|
||||||
|
const int tid = threadIdx.x;
|
||||||
|
const int bid = blockIdx.x;
|
||||||
|
|
||||||
|
const int head_size = HEAD_SIZE;
|
||||||
|
const int batch_i = bid / H;
|
||||||
|
const int head_i = bid % H;
|
||||||
|
const int state_size = C * head_size;
|
||||||
|
const int n_seq_tokens = T / B;
|
||||||
|
|
||||||
|
float state[head_size];
|
||||||
|
__shared__ float _k[head_size], _r[head_size], _td[head_size];
|
||||||
|
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < head_size; i++) {
|
||||||
|
state[i] = s[batch_i * state_size + head_i * head_size * head_size + i * head_size + tid];
|
||||||
|
}
|
||||||
|
|
||||||
|
for (int t = batch_i * n_seq_tokens * C + head_i * head_size + tid; t < (batch_i + 1) * n_seq_tokens * C + head_i * head_size + tid; t += C) {
|
||||||
|
__syncthreads();
|
||||||
|
_k[tid] = k[t];
|
||||||
|
_r[tid] = r[t];
|
||||||
|
_td[tid] = td[t];
|
||||||
|
__syncthreads();
|
||||||
|
|
||||||
|
const float _v = v[t];
|
||||||
|
float y = 0;
|
||||||
|
for (int j = 0; j < head_size; j += 4) {
|
||||||
|
const float4 & k = (float4 &)(_k[j]);
|
||||||
|
const float4 & r = (float4 &)(_r[j]);
|
||||||
|
const float4 & td = (float4 &)(_td[j]);
|
||||||
|
float4 & s = (float4 &)(state[j]);
|
||||||
|
float4 kv;
|
||||||
|
|
||||||
|
kv.x = k.x * _v;
|
||||||
|
kv.y = k.y * _v;
|
||||||
|
kv.z = k.z * _v;
|
||||||
|
kv.w = k.w * _v;
|
||||||
|
|
||||||
|
s.x = s.x * td.x + kv.x;
|
||||||
|
s.y = s.y * td.y + kv.y;
|
||||||
|
s.z = s.z * td.z + kv.z;
|
||||||
|
s.w = s.w * td.w + kv.w;
|
||||||
|
|
||||||
|
y += r.x * s.x;
|
||||||
|
y += r.y * s.y;
|
||||||
|
y += r.z * s.z;
|
||||||
|
y += r.w * s.w;
|
||||||
|
}
|
||||||
|
dst[t] = y * scale;
|
||||||
|
}
|
||||||
|
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < head_size; i++) {
|
||||||
|
dst[T * C + batch_i * state_size + head_i * head_size * head_size + i * head_size + tid] = state[i];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void ggml_cuda_op_gated_linear_attn(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
|
||||||
|
const float * k_d = (const float *)dst->src[0]->data;
|
||||||
|
const float * v_d = (const float *)dst->src[1]->data;
|
||||||
|
const float * r_d = (const float *)dst->src[2]->data;
|
||||||
|
const float * td_d = (const float *)dst->src[3]->data;
|
||||||
|
const float * s_d = (const float *)dst->src[4]->data;
|
||||||
|
|
||||||
|
const int64_t B = dst->src[4]->ne[1];
|
||||||
|
const int64_t T = dst->src[0]->ne[2];
|
||||||
|
const int64_t C = dst->ne[0];
|
||||||
|
const int64_t H = dst->src[0]->ne[1];
|
||||||
|
|
||||||
|
float scale;
|
||||||
|
memcpy(&scale, (float*)dst->op_params, sizeof(float));
|
||||||
|
|
||||||
|
float * dst_d = (float *)dst->data;
|
||||||
|
|
||||||
|
cudaStream_t stream = ctx.stream();
|
||||||
|
|
||||||
|
GGML_ASSERT(dst->src[4]->type == GGML_TYPE_F32);
|
||||||
|
GGML_ASSERT(C % H == 0);
|
||||||
|
GGML_ASSERT(C / H == 64 || C / H == 128);
|
||||||
|
|
||||||
|
|
||||||
|
if (C / H == 64) {
|
||||||
|
gated_linear_attn_f32<64><<<B * H, C / H, 0, stream>>>(B, T, C, H, scale, k_d, v_d, r_d, td_d, s_d, dst_d);
|
||||||
|
} else {
|
||||||
|
gated_linear_attn_f32<128><<<B * H, C / H, 0, stream>>>(B, T, C, H, scale, k_d, v_d, r_d, td_d, s_d, dst_d);
|
||||||
|
}
|
||||||
|
}
|
3
ggml/src/ggml-cuda/gla.cuh
Normal file
3
ggml/src/ggml-cuda/gla.cuh
Normal file
@ -0,0 +1,3 @@
|
|||||||
|
#include "common.cuh"
|
||||||
|
|
||||||
|
void ggml_cuda_op_gated_linear_attn(ggml_backend_cuda_context & ctx, ggml_tensor * dst);
|
@ -73,9 +73,9 @@ void ggml_cuda_op_rwkv_wkv6(ggml_backend_cuda_context & ctx, ggml_tensor * dst)
|
|||||||
const float * s_d = (const float *)dst->src[5]->data;
|
const float * s_d = (const float *)dst->src[5]->data;
|
||||||
|
|
||||||
const int64_t B = dst->src[5]->ne[1];
|
const int64_t B = dst->src[5]->ne[1];
|
||||||
const int64_t T = dst->src[0]->ne[3];
|
const int64_t T = dst->src[0]->ne[2];
|
||||||
const int64_t C = dst->ne[0];
|
const int64_t C = dst->ne[0];
|
||||||
const int64_t H = dst->src[0]->ne[2];
|
const int64_t H = dst->src[0]->ne[1];
|
||||||
|
|
||||||
float * dst_d = (float *)dst->data;
|
float * dst_d = (float *)dst->data;
|
||||||
|
|
||||||
|
@ -70,7 +70,9 @@ ggml_add_backend_library(ggml-hip
|
|||||||
)
|
)
|
||||||
|
|
||||||
# TODO: do not use CUDA definitions for HIP
|
# TODO: do not use CUDA definitions for HIP
|
||||||
target_compile_definitions(ggml PUBLIC GGML_USE_CUDA)
|
if (NOT GGML_BACKEND_DL)
|
||||||
|
target_compile_definitions(ggml PUBLIC GGML_USE_CUDA)
|
||||||
|
endif()
|
||||||
|
|
||||||
add_compile_definitions(GGML_USE_HIP)
|
add_compile_definitions(GGML_USE_HIP)
|
||||||
|
|
||||||
|
@ -51,6 +51,10 @@ void ggml_sycl_host_free(void* ptr) try {
|
|||||||
std::exit(1);
|
std::exit(1);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
bool gpu_has_xmx(sycl::device &dev) {
|
||||||
|
return dev.has(sycl::aspect::ext_intel_matrix);
|
||||||
|
}
|
||||||
|
|
||||||
int64_t downsample_sycl_global_range(int64_t accumulate_block_num, int64_t block_size) {
|
int64_t downsample_sycl_global_range(int64_t accumulate_block_num, int64_t block_size) {
|
||||||
const int64_t max_range = std::numeric_limits<int>::max();
|
const int64_t max_range = std::numeric_limits<int>::max();
|
||||||
int64_t sycl_down_blk_size = block_size;
|
int64_t sycl_down_blk_size = block_size;
|
||||||
|
@ -662,6 +662,7 @@ inline void ggml_sycl_op_bin_bcast(ggml_backend_sycl_context & ctx, const ggml_t
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
bool gpu_has_xmx(sycl::device &dev);
|
||||||
|
|
||||||
void ggml_sycl_op_flatten(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
|
void ggml_sycl_op_flatten(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
|
||||||
const ggml_tensor *src1, ggml_tensor *dst,
|
const ggml_tensor *src1, ggml_tensor *dst,
|
||||||
|
@ -158,8 +158,9 @@ static void concat_f32_sycl_non_cont(
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_op_concat(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
|
void ggml_sycl_op_concat(ggml_backend_sycl_context & ctx, ggml_tensor *dst) {
|
||||||
const ggml_tensor *src1, ggml_tensor *dst) {
|
const ggml_tensor *src0 = dst->src[0];
|
||||||
|
const ggml_tensor *src1 = dst->src[1];
|
||||||
queue_ptr stream = ctx.stream();
|
queue_ptr stream = ctx.stream();
|
||||||
|
|
||||||
const int32_t dim = ((int32_t *)dst->op_params)[0];
|
const int32_t dim = ((int32_t *)dst->op_params)[0];
|
||||||
|
@ -15,7 +15,6 @@
|
|||||||
|
|
||||||
#include "common.hpp"
|
#include "common.hpp"
|
||||||
|
|
||||||
void ggml_sycl_op_concat(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
|
void ggml_sycl_op_concat(ggml_backend_sycl_context & ctx, ggml_tensor *dst);
|
||||||
const ggml_tensor *src1, ggml_tensor *dst);
|
|
||||||
|
|
||||||
#endif // GGML_SYCL_CONCAT_HPP
|
#endif // GGML_SYCL_CONCAT_HPP
|
||||||
|
@ -71,8 +71,9 @@ static void conv_transpose_1d_f32_f32_sycl(
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_op_conv_transpose_1d(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
|
void ggml_sycl_op_conv_transpose_1d(ggml_backend_sycl_context & ctx, ggml_tensor *dst) {
|
||||||
const ggml_tensor *src1, ggml_tensor *dst) {
|
const ggml_tensor *src0 = dst->src[0];
|
||||||
|
const ggml_tensor *src1 = dst->src[1];
|
||||||
const float * src0_d = (const float *)src0->data;
|
const float * src0_d = (const float *)src0->data;
|
||||||
const float * src1_d = (const float *)src1->data;
|
const float * src1_d = (const float *)src1->data;
|
||||||
|
|
||||||
|
@ -15,7 +15,6 @@
|
|||||||
|
|
||||||
#include "common.hpp"
|
#include "common.hpp"
|
||||||
|
|
||||||
void ggml_sycl_op_conv_transpose_1d(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
|
void ggml_sycl_op_conv_transpose_1d(ggml_backend_sycl_context & ctx, ggml_tensor *dst);
|
||||||
const ggml_tensor *src1, ggml_tensor *dst);
|
|
||||||
|
|
||||||
#endif // GGML_SYCL_CONV_HPP
|
#endif // GGML_SYCL_CONV_HPP
|
||||||
|
@ -882,149 +882,149 @@ inline void ggml_sycl_op_div(ggml_backend_sycl_context & ctx, const ggml_tensor
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
void ggml_sycl_sqrt(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_sqrt(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sqrt);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sqrt);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_sin(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_sin(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sin);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sin);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_cos(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_cos(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_cos);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_cos);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_acc(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_acc(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_acc);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_acc);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_gelu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_gelu(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_gelu);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_gelu);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_silu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_silu(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_silu);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_silu);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_gelu_quick(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_gelu_quick(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_gelu_quick);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_gelu_quick);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_tanh(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_tanh(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_tanh);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_tanh);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_relu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_relu(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_relu);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_relu);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_sigmoid(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_sigmoid(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sigmoid);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sigmoid);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_hardsigmoid(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_hardsigmoid(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_hardsigmoid);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_hardsigmoid);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_hardswish(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_hardswish(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_hardswish);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_hardswish);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
void ggml_sycl_exp(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_exp(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_exp);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_exp);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_log(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_log(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_log);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_log);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_neg(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_neg(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_neg);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_neg);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_step(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_step(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_step);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_step);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_leaky_relu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_leaky_relu(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_leaky_relu);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_leaky_relu);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_sqr(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_sqr(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sqr);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sqr);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_upscale(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_upscale(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_upscale);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_upscale);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_pad(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_pad(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_pad);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_pad);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
void ggml_sycl_add(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_add(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_add);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_add);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_sub(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_sub(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sub);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sub);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_mul(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_mul(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_mul);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_mul);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_div(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
void ggml_sycl_div(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_div);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_div);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
@ -25,52 +25,52 @@ static __dpct_inline__ float op_div(const float a, const float b) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
void ggml_sycl_sqrt(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_sqrt(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_sin(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_sin(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_cos(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_cos(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_acc(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_acc(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_gelu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_gelu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_silu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_silu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_gelu_quick(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_gelu_quick(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_tanh(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_tanh(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_relu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_relu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_sigmoid(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_sigmoid(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_hardsigmoid(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_hardsigmoid(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_hardswish(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_hardswish(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_exp(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_exp(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_log(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_log(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_neg(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_neg(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_step(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_step(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_leaky_relu(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_leaky_relu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_sqr(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_sqr(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_upscale(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_upscale(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_pad(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_pad(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_add(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_add(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_sub(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_sub(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_mul(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_mul(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
void ggml_sycl_div(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
void ggml_sycl_div(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
#endif // GGML_SYCL_ELEMENTWISE_HPP
|
#endif // GGML_SYCL_ELEMENTWISE_HPP
|
||||||
|
@ -54,18 +54,12 @@ static ggml_sycl_device_info ggml_sycl_init() {
|
|||||||
GGML_ASSERT(info.device_count <= GGML_SYCL_MAX_DEVICES);
|
GGML_ASSERT(info.device_count <= GGML_SYCL_MAX_DEVICES);
|
||||||
|
|
||||||
int64_t total_vram = 0;
|
int64_t total_vram = 0;
|
||||||
#if defined(GGML_SYCL_FORCE_MMQ)
|
/* This is a bit misleading; reserved for later */
|
||||||
GGML_LOG_INFO("%s: GGML_SYCL_FORCE_MMQ: yes\n", __func__);
|
// #if defined(SYCL_USE_XMX)
|
||||||
#else
|
// GGML_LOG_INFO("%s: SYCL_USE_XMX: yes\n", __func__);
|
||||||
GGML_LOG_INFO("%s: GGML_SYCL_FORCE_MMQ: no\n", __func__);
|
// #else
|
||||||
#endif
|
// GGML_LOG_INFO("%s: SYCL_USE_XMX: no\n", __func__);
|
||||||
#if defined(SYCL_USE_XMX)
|
// #endif
|
||||||
GGML_LOG_INFO("%s: SYCL_USE_XMX: yes\n", __func__);
|
|
||||||
#else
|
|
||||||
GGML_LOG_INFO("%s: SYCL_USE_XMX: no\n", __func__);
|
|
||||||
#endif
|
|
||||||
GGML_LOG_INFO("%s: found %d %s devices:\n", __func__, info.device_count, GGML_SYCL_NAME);
|
|
||||||
|
|
||||||
for (int i = 0; i < info.device_count; ++i) {
|
for (int i = 0; i < info.device_count; ++i) {
|
||||||
info.devices[i].vmm = 0;
|
info.devices[i].vmm = 0;
|
||||||
dpct::device_info prop;
|
dpct::device_info prop;
|
||||||
@ -109,11 +103,11 @@ void print_device_detail(int id, sycl::device &device, std::string device_type)
|
|||||||
name = std::regex_replace(name, std::regex("\\(TM\\)"), "");
|
name = std::regex_replace(name, std::regex("\\(TM\\)"), "");
|
||||||
|
|
||||||
auto global_mem_size = prop.get_global_mem_size()/1000000;
|
auto global_mem_size = prop.get_global_mem_size()/1000000;
|
||||||
|
std::string xmx = gpu_has_xmx(device) ? "yes" : "no";
|
||||||
GGML_LOG_INFO("|%2d|%19s|%39s|%7s|%7d|%8d|%5d|%6luM|%21s|\n", id, device_type.c_str(),
|
GGML_LOG_INFO("|%2d|%19s|%39s|%7s|%7d|%8d|%5d|%6luM|%21s|%14s|\n", id, device_type.c_str(),
|
||||||
name.c_str(), version.c_str(), prop.get_max_compute_units(),
|
name.c_str(), version.c_str(), prop.get_max_compute_units(),
|
||||||
prop.get_max_work_group_size(), prop.get_max_sub_group_size(),
|
prop.get_max_work_group_size(), prop.get_max_sub_group_size(),
|
||||||
global_mem_size, device.get_info<sycl::info::device::driver_version>().c_str());
|
global_mem_size, device.get_info<sycl::info::device::driver_version>().c_str(), xmx.c_str());
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_backend_sycl_print_sycl_devices() {
|
void ggml_backend_sycl_print_sycl_devices() {
|
||||||
@ -124,16 +118,16 @@ void ggml_backend_sycl_print_sycl_devices() {
|
|||||||
|
|
||||||
GGML_LOG_INFO(
|
GGML_LOG_INFO(
|
||||||
"| | | | "
|
"| | | | "
|
||||||
" |Max | |Max |Global | |\n");
|
" |Max | |Max |Global | | XMX |\n");
|
||||||
GGML_LOG_INFO(
|
GGML_LOG_INFO(
|
||||||
"| | | | "
|
"| | | | "
|
||||||
" |compute|Max work|sub |mem | |\n");
|
" |compute|Max work|sub |mem | | or |\n");
|
||||||
GGML_LOG_INFO(
|
GGML_LOG_INFO(
|
||||||
"|ID| Device Type| "
|
"|ID| Device Type| "
|
||||||
"Name|Version|units |group |group|size | Driver version|\n");
|
"Name|Version|units |group |group|size | Driver version| Tensor Cores |\n");
|
||||||
GGML_LOG_INFO(
|
GGML_LOG_INFO(
|
||||||
"|--|-------------------|---------------------------------------|------"
|
"|--|-------------------|---------------------------------------|------"
|
||||||
"-|-------|--------|-----|-------|---------------------|\n");
|
"-|-------|--------|-----|-------|---------------------|--------------|\n");
|
||||||
|
|
||||||
for (int id = 0; id < device_count; ++id) {
|
for (int id = 0; id < device_count; ++id) {
|
||||||
sycl::device device = dpct::dev_mgr::instance().get_device(id);
|
sycl::device device = dpct::dev_mgr::instance().get_device(id);
|
||||||
@ -164,14 +158,18 @@ static void ggml_check_sycl() try {
|
|||||||
static bool initialized = false;
|
static bool initialized = false;
|
||||||
|
|
||||||
if (!initialized) {
|
if (!initialized) {
|
||||||
GGML_LOG_INFO("[SYCL] call ggml_check_sycl\n");
|
GGML_SYCL_DEBUG("[SYCL] call ggml_check_sycl\n");
|
||||||
g_ggml_sycl_debug = get_sycl_env("GGML_SYCL_DEBUG", 0);
|
g_ggml_sycl_debug = get_sycl_env("GGML_SYCL_DEBUG", 0);
|
||||||
GGML_LOG_INFO("%s: GGML_SYCL_DEBUG: %d\n", __func__, g_ggml_sycl_debug);
|
GGML_LOG_INFO("GGML_SYCL_DEBUG: %d\n", g_ggml_sycl_debug);
|
||||||
|
#if defined(GGML_SYCL_FORCE_MMQ)
|
||||||
#if defined(GGML_SYCL_F16)
|
GGML_LOG_INFO("GGML_SYCL_FORCE_MMQ: yes\n");
|
||||||
GGML_LOG_INFO("%s: GGML_SYCL_F16: yes\n", __func__);
|
|
||||||
#else
|
#else
|
||||||
GGML_LOG_INFO("%s: GGML_SYCL_F16: no\n", __func__);
|
GGML_LOG_INFO("GGML_SYCL_FORCE_MMQ: no\n");
|
||||||
|
#endif
|
||||||
|
#if defined(GGML_SYCL_F16)
|
||||||
|
GGML_LOG_INFO("GGML_SYCL_F16: yes\n");
|
||||||
|
#else
|
||||||
|
GGML_LOG_INFO("GGML_SYCL_F16: no\n");
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
/* NOT REMOVE, keep it for next optimize for XMX.
|
/* NOT REMOVE, keep it for next optimize for XMX.
|
||||||
@ -1189,7 +1187,6 @@ std::unique_ptr<ggml_sycl_pool> ggml_backend_sycl_context::new_pool_for_device(q
|
|||||||
/// kernels
|
/// kernels
|
||||||
|
|
||||||
typedef void (*cpy_kernel_t)(const char * cx, char * cdst);
|
typedef void (*cpy_kernel_t)(const char * cx, char * cdst);
|
||||||
typedef void (*ggml_sycl_func_t)(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
|
|
||||||
typedef void (*ggml_sycl_op_mul_mat_t)(
|
typedef void (*ggml_sycl_op_mul_mat_t)(
|
||||||
ggml_backend_sycl_context & ctx,
|
ggml_backend_sycl_context & ctx,
|
||||||
const ggml_tensor *src0, const ggml_tensor *src1, ggml_tensor *dst,
|
const ggml_tensor *src0, const ggml_tensor *src1, ggml_tensor *dst,
|
||||||
@ -3171,33 +3168,33 @@ catch (sycl::exception const &exc) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
static void ggml_sycl_repeat(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_repeat(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_repeat);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_repeat);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_get_rows(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_get_rows(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_get_rows);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_get_rows);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_norm(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_norm(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_norm);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_norm);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_rms_norm(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_rms_norm(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_rms_norm);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_rms_norm);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_group_norm(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_group_norm(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_SYCL_DEBUG("call %s\n", __func__);
|
GGML_SYCL_DEBUG("call %s\n", __func__);
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_group_norm);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_group_norm);
|
||||||
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
GGML_SYCL_DEBUG("call %s done\n", __func__);
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -3572,9 +3569,10 @@ __dpct_inline__ static void k_copy_dst_from_contiguous(
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_mul_mat_id(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
|
static void ggml_sycl_mul_mat_id(ggml_backend_sycl_context & ctx,
|
||||||
const ggml_tensor *src1,
|
|
||||||
ggml_tensor *dst) try {
|
ggml_tensor *dst) try {
|
||||||
|
const ggml_tensor *src0 = dst->src[0];
|
||||||
|
const ggml_tensor *src1 = dst->src[1];
|
||||||
GGML_ASSERT(!ggml_backend_buffer_is_sycl_split(src0->buffer) && "mul_mat_id does not support split buffers");
|
GGML_ASSERT(!ggml_backend_buffer_is_sycl_split(src0->buffer) && "mul_mat_id does not support split buffers");
|
||||||
|
|
||||||
const ggml_tensor *ids = dst->src[2];
|
const ggml_tensor *ids = dst->src[2];
|
||||||
@ -3740,12 +3738,12 @@ catch (sycl::exception const &exc) {
|
|||||||
std::exit(1);
|
std::exit(1);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_scale(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_scale(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_scale);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_scale);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_clamp(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_clamp(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_clamp);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_clamp);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_cpy(ggml_backend_sycl_context & ctx, const ggml_tensor *src0, const ggml_tensor *src1,
|
static void ggml_sycl_cpy(ggml_backend_sycl_context & ctx, const ggml_tensor *src0, const ggml_tensor *src1,
|
||||||
@ -3787,7 +3785,6 @@ static void ggml_sycl_cpy(ggml_backend_sycl_context & ctx, const ggml_tensor *sr
|
|||||||
ggml_type_name(src0->type), ggml_type_name(src1->type));
|
ggml_type_name(src0->type), ggml_type_name(src1->type));
|
||||||
GGML_ABORT("fatal error");
|
GGML_ABORT("fatal error");
|
||||||
}
|
}
|
||||||
|
|
||||||
GGML_UNUSED(dst);
|
GGML_UNUSED(dst);
|
||||||
}
|
}
|
||||||
catch (sycl::exception const &exc) {
|
catch (sycl::exception const &exc) {
|
||||||
@ -3796,59 +3793,52 @@ catch (sycl::exception const &exc) {
|
|||||||
std::exit(1);
|
std::exit(1);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_dup(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_dup(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
// TODO: why do we pass dst as src1 here?
|
// TODO: why do we pass dst as src1 here?
|
||||||
ggml_sycl_cpy(ctx, src0, dst, nullptr);
|
ggml_sycl_cpy(ctx, dst->src[0], dst, nullptr);
|
||||||
GGML_UNUSED(src1);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_diag_mask_inf(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_diag_mask_inf(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_diag_mask_inf);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_diag_mask_inf);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_soft_max(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_soft_max(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_soft_max);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_soft_max);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_rope(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_rope(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_ASSERT(ggml_is_contiguous(src0)); // TODO: this restriction is temporary until non-cont support is implemented
|
GGML_ASSERT(ggml_is_contiguous(dst->src[0])); // TODO: this restriction is temporary until non-cont support is implemented
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_rope);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_rope);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_pool2d(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_pool2d(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_pool2d);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_pool2d);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_im2col(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_im2col(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_im2col);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_im2col);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_sum(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_sum(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_ASSERT(ggml_is_contiguous(src0));
|
GGML_ASSERT(ggml_is_contiguous(dst->src[0]));
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sum);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sum);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_sum_rows(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_sum_rows(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_ASSERT(ggml_is_contiguous(src0));
|
GGML_ASSERT(ggml_is_contiguous(dst->src[0]));
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_sum_rows);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_sum_rows);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_argsort(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_argsort(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_ASSERT(ggml_is_contiguous(src0));
|
GGML_ASSERT(ggml_is_contiguous(dst->src[0]));
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_argsort);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_argsort);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_argmax(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_sycl_argmax(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
GGML_ASSERT(ggml_is_contiguous(src0));
|
GGML_ASSERT(ggml_is_contiguous(dst->src[0]));
|
||||||
ggml_sycl_op_flatten(ctx, src0, src1, dst, ggml_sycl_op_argmax);
|
ggml_sycl_op_flatten(ctx, dst->src[0], dst->src[1], dst, ggml_sycl_op_argmax);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_sycl_nop(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
|
||||||
GGML_UNUSED(src0);
|
|
||||||
GGML_UNUSED(src1);
|
|
||||||
GGML_UNUSED(dst);
|
|
||||||
GGML_UNUSED(ctx);
|
|
||||||
}
|
|
||||||
|
|
||||||
void ggml_sycl_set_main_device(const int main_device) try {
|
void ggml_sycl_set_main_device(const int main_device) try {
|
||||||
if (dpct::get_current_device_id() == static_cast<unsigned int> (main_device)) {
|
if (dpct::get_current_device_id() == static_cast<unsigned int> (main_device)) {
|
||||||
@ -3871,191 +3861,189 @@ catch (sycl::exception const &exc) {
|
|||||||
std::exit(1);
|
std::exit(1);
|
||||||
}
|
}
|
||||||
|
|
||||||
bool ggml_sycl_compute_forward(ggml_backend_sycl_context & ctx, struct ggml_tensor * tensor) {
|
bool ggml_sycl_compute_forward(ggml_backend_sycl_context & ctx, struct ggml_tensor * dst) {
|
||||||
if (!g_sycl_loaded) return false;
|
if (!g_sycl_loaded) return false;
|
||||||
|
|
||||||
ggml_sycl_func_t func;
|
if (dst->src[0] != nullptr && ggml_backend_buffer_is_sycl_split(dst->src[0]->buffer)) {
|
||||||
|
ggml_sycl_set_peer_access(dst->src[1]->ne[1], ctx.device);
|
||||||
|
}
|
||||||
|
|
||||||
switch (tensor->op) {
|
switch (dst->op) {
|
||||||
case GGML_OP_ARGMAX:
|
case GGML_OP_ARGMAX:
|
||||||
func = ggml_sycl_argmax;
|
ggml_sycl_argmax(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_CONV_TRANSPOSE_1D:
|
case GGML_OP_CONV_TRANSPOSE_1D:
|
||||||
func = ggml_sycl_op_conv_transpose_1d;
|
ggml_sycl_op_conv_transpose_1d(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_REPEAT:
|
case GGML_OP_REPEAT:
|
||||||
func = ggml_sycl_repeat;
|
ggml_sycl_repeat(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_GET_ROWS:
|
case GGML_OP_GET_ROWS:
|
||||||
func = ggml_sycl_get_rows;
|
ggml_sycl_get_rows(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_DUP:
|
case GGML_OP_DUP:
|
||||||
func = ggml_sycl_dup;
|
ggml_sycl_dup(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_ADD:
|
case GGML_OP_ADD:
|
||||||
case GGML_OP_ADD1: // TODO: more efficient implementation
|
case GGML_OP_ADD1: // TODO: more efficient implementation
|
||||||
func = ggml_sycl_add;
|
ggml_sycl_add(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_SUB:
|
case GGML_OP_SUB:
|
||||||
func = ggml_sycl_sub;
|
ggml_sycl_sub(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_ACC:
|
case GGML_OP_ACC:
|
||||||
func = ggml_sycl_acc;
|
ggml_sycl_acc(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_MUL:
|
case GGML_OP_MUL:
|
||||||
func = ggml_sycl_mul;
|
ggml_sycl_mul(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_LOG:
|
case GGML_OP_LOG:
|
||||||
func = ggml_sycl_log;
|
ggml_sycl_log(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_DIV:
|
case GGML_OP_DIV:
|
||||||
func = ggml_sycl_div;
|
ggml_sycl_div(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_UNARY:
|
case GGML_OP_UNARY:
|
||||||
switch (ggml_get_unary_op(tensor)) {
|
switch (ggml_get_unary_op(dst)) {
|
||||||
case GGML_UNARY_OP_NEG:
|
case GGML_UNARY_OP_NEG:
|
||||||
func = ggml_sycl_neg;
|
ggml_sycl_neg(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_UNARY_OP_STEP:
|
case GGML_UNARY_OP_STEP:
|
||||||
func = ggml_sycl_step;
|
ggml_sycl_step(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_UNARY_OP_GELU:
|
case GGML_UNARY_OP_GELU:
|
||||||
func = ggml_sycl_gelu;
|
ggml_sycl_gelu(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_UNARY_OP_SILU:
|
case GGML_UNARY_OP_SILU:
|
||||||
func = ggml_sycl_silu;
|
ggml_sycl_silu(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_UNARY_OP_GELU_QUICK:
|
case GGML_UNARY_OP_GELU_QUICK:
|
||||||
func = ggml_sycl_gelu_quick;
|
ggml_sycl_gelu_quick(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_UNARY_OP_TANH:
|
case GGML_UNARY_OP_TANH:
|
||||||
func = ggml_sycl_tanh;
|
ggml_sycl_tanh(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_UNARY_OP_RELU:
|
case GGML_UNARY_OP_RELU:
|
||||||
func = ggml_sycl_relu;
|
ggml_sycl_relu(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_UNARY_OP_SIGMOID:
|
case GGML_UNARY_OP_SIGMOID:
|
||||||
func = ggml_sycl_sigmoid;
|
ggml_sycl_sigmoid(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_UNARY_OP_HARDSIGMOID:
|
case GGML_UNARY_OP_HARDSIGMOID:
|
||||||
func = ggml_sycl_hardsigmoid;
|
ggml_sycl_hardsigmoid(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_UNARY_OP_HARDSWISH:
|
case GGML_UNARY_OP_HARDSWISH:
|
||||||
func = ggml_sycl_hardswish;
|
ggml_sycl_hardswish(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_UNARY_OP_EXP:
|
case GGML_UNARY_OP_EXP:
|
||||||
func = ggml_sycl_exp;
|
ggml_sycl_exp(ctx, dst);
|
||||||
break;
|
break;
|
||||||
default:
|
default:
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
break;
|
break;
|
||||||
case GGML_OP_NORM:
|
case GGML_OP_NORM:
|
||||||
func = ggml_sycl_norm;
|
ggml_sycl_norm(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_GROUP_NORM:
|
case GGML_OP_GROUP_NORM:
|
||||||
func = ggml_sycl_group_norm;
|
ggml_sycl_group_norm(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_CONCAT:
|
case GGML_OP_CONCAT:
|
||||||
func = ggml_sycl_op_concat;
|
ggml_sycl_op_concat(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_UPSCALE:
|
case GGML_OP_UPSCALE:
|
||||||
func = ggml_sycl_upscale;
|
ggml_sycl_upscale(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_PAD:
|
case GGML_OP_PAD:
|
||||||
func = ggml_sycl_pad;
|
ggml_sycl_pad(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_LEAKY_RELU:
|
case GGML_OP_LEAKY_RELU:
|
||||||
func = ggml_sycl_leaky_relu;
|
ggml_sycl_leaky_relu(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_RMS_NORM:
|
case GGML_OP_RMS_NORM:
|
||||||
func = ggml_sycl_rms_norm;
|
ggml_sycl_rms_norm(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_MUL_MAT:
|
case GGML_OP_MUL_MAT:
|
||||||
if (tensor->src[0]->ne[3] != tensor->src[1]->ne[3]) {
|
if (dst->src[0]->ne[3] != dst->src[1]->ne[3]) {
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
func = ggml_sycl_mul_mat;
|
/* ggml_sycl_mul_mat_id is dependent on ggml_sycl_mul_mat */
|
||||||
|
ggml_sycl_mul_mat(ctx, dst->src[0], dst->src[1], dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_MUL_MAT_ID:
|
case GGML_OP_MUL_MAT_ID:
|
||||||
if (tensor->src[0]->ne[3] != tensor->src[1]->ne[3]) {
|
if (dst->src[0]->ne[3] != dst->src[1]->ne[3]) {
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
func = ggml_sycl_mul_mat_id;
|
ggml_sycl_mul_mat_id(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_OUT_PROD:
|
case GGML_OP_OUT_PROD:
|
||||||
func = ggml_sycl_op_out_prod;
|
ggml_sycl_op_out_prod(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_SCALE:
|
case GGML_OP_SCALE:
|
||||||
func = ggml_sycl_scale;
|
ggml_sycl_scale(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_SQR:
|
case GGML_OP_SQR:
|
||||||
func = ggml_sycl_sqr;
|
ggml_sycl_sqr(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_SQRT:
|
case GGML_OP_SQRT:
|
||||||
func = ggml_sycl_sqrt;
|
ggml_sycl_sqrt(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_SIN:
|
case GGML_OP_SIN:
|
||||||
func = ggml_sycl_sin;
|
ggml_sycl_sin(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_COS:
|
case GGML_OP_COS:
|
||||||
func = ggml_sycl_cos;
|
ggml_sycl_cos(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_CLAMP:
|
case GGML_OP_CLAMP:
|
||||||
func = ggml_sycl_clamp;
|
ggml_sycl_clamp(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_CPY:
|
case GGML_OP_CPY:
|
||||||
func = ggml_sycl_cpy;
|
ggml_sycl_cpy(ctx, dst->src[0], dst->src[1], dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_CONT:
|
case GGML_OP_CONT:
|
||||||
func = ggml_sycl_dup;
|
ggml_sycl_dup(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_NONE:
|
case GGML_OP_NONE:
|
||||||
case GGML_OP_RESHAPE:
|
case GGML_OP_RESHAPE:
|
||||||
case GGML_OP_VIEW:
|
case GGML_OP_VIEW:
|
||||||
case GGML_OP_PERMUTE:
|
case GGML_OP_PERMUTE:
|
||||||
case GGML_OP_TRANSPOSE:
|
case GGML_OP_TRANSPOSE:
|
||||||
func = ggml_sycl_nop;
|
GGML_SYCL_DEBUG("%s: Tensor NO-OP\n", __func__);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_DIAG_MASK_INF:
|
case GGML_OP_DIAG_MASK_INF:
|
||||||
func = ggml_sycl_diag_mask_inf;
|
ggml_sycl_diag_mask_inf(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_SOFT_MAX:
|
case GGML_OP_SOFT_MAX:
|
||||||
func = ggml_sycl_soft_max;
|
ggml_sycl_soft_max(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_ROPE:
|
case GGML_OP_ROPE:
|
||||||
func = ggml_sycl_rope;
|
ggml_sycl_rope(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_IM2COL:
|
case GGML_OP_IM2COL:
|
||||||
func = ggml_sycl_im2col;
|
ggml_sycl_im2col(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_POOL_2D:
|
case GGML_OP_POOL_2D:
|
||||||
func = ggml_sycl_pool2d;
|
ggml_sycl_pool2d(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_SUM:
|
case GGML_OP_SUM:
|
||||||
func = ggml_sycl_sum;
|
ggml_sycl_sum(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_SUM_ROWS:
|
case GGML_OP_SUM_ROWS:
|
||||||
func = ggml_sycl_sum_rows;
|
ggml_sycl_sum_rows(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_ARGSORT:
|
case GGML_OP_ARGSORT:
|
||||||
func = ggml_sycl_argsort;
|
ggml_sycl_argsort(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_TIMESTEP_EMBEDDING:
|
case GGML_OP_TIMESTEP_EMBEDDING:
|
||||||
func = ggml_sycl_op_timestep_embedding;
|
ggml_sycl_op_timestep_embedding(ctx, dst);
|
||||||
break;
|
break;
|
||||||
case GGML_OP_RWKV_WKV6:
|
case GGML_OP_RWKV_WKV6:
|
||||||
func = ggml_sycl_op_rwkv_wkv6;
|
ggml_sycl_op_rwkv_wkv6(ctx, dst);
|
||||||
break;
|
break;
|
||||||
default:
|
default:
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (tensor->src[0] != nullptr && ggml_backend_buffer_is_sycl_split(tensor->src[0]->buffer)) {
|
|
||||||
ggml_sycl_set_peer_access(tensor->src[1]->ne[1], ctx.device);
|
|
||||||
}
|
|
||||||
|
|
||||||
func(ctx, tensor->src[0], tensor->src[1], tensor);
|
|
||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -3,9 +3,9 @@
|
|||||||
#include "outprod.hpp"
|
#include "outprod.hpp"
|
||||||
|
|
||||||
|
|
||||||
void ggml_sycl_op_out_prod(ggml_backend_sycl_context& ctx, const ggml_tensor* src0,
|
void ggml_sycl_op_out_prod(ggml_backend_sycl_context& ctx, ggml_tensor* dst) {
|
||||||
const ggml_tensor* src1, ggml_tensor* dst) {
|
const ggml_tensor *src0 = dst->src[0];
|
||||||
|
const ggml_tensor *src1 = dst->src[1];
|
||||||
|
|
||||||
GGML_ASSERT(src0->type == GGML_TYPE_F32);
|
GGML_ASSERT(src0->type == GGML_TYPE_F32);
|
||||||
GGML_ASSERT(src1->type == GGML_TYPE_F32);
|
GGML_ASSERT(src1->type == GGML_TYPE_F32);
|
||||||
|
@ -3,8 +3,7 @@
|
|||||||
|
|
||||||
#include "common.hpp"
|
#include "common.hpp"
|
||||||
|
|
||||||
void ggml_sycl_op_out_prod(ggml_backend_sycl_context& ctx, const ggml_tensor* src0,
|
void ggml_sycl_op_out_prod(ggml_backend_sycl_context& ctx, ggml_tensor* dst);
|
||||||
const ggml_tensor* src1, ggml_tensor* dst);
|
|
||||||
|
|
||||||
|
|
||||||
#endif // GGML_SYCL_OUTPROD_HPP
|
#endif // GGML_SYCL_OUTPROD_HPP
|
||||||
|
@ -55,8 +55,9 @@ static void timestep_embedding_f32_sycl(
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_op_timestep_embedding(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
|
void ggml_sycl_op_timestep_embedding(ggml_backend_sycl_context & ctx, ggml_tensor * dst) {
|
||||||
const ggml_tensor *src1, ggml_tensor * dst) {
|
const ggml_tensor *src0 = dst->src[0];
|
||||||
|
const ggml_tensor *src1 = dst->src[1];
|
||||||
const float * src0_d = (const float *)src0->data;
|
const float * src0_d = (const float *)src0->data;
|
||||||
float * dst_d = (float *)dst->data;
|
float * dst_d = (float *)dst->data;
|
||||||
dpct::queue_ptr stream = ctx.stream();
|
dpct::queue_ptr stream = ctx.stream();
|
||||||
|
@ -15,7 +15,6 @@
|
|||||||
|
|
||||||
#include "common.hpp"
|
#include "common.hpp"
|
||||||
|
|
||||||
void ggml_sycl_op_timestep_embedding(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
|
void ggml_sycl_op_timestep_embedding(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
const ggml_tensor *src1, ggml_tensor * dst);
|
|
||||||
|
|
||||||
#endif // GGML_SYCL_TSEMBD_HPP
|
#endif // GGML_SYCL_TSEMBD_HPP
|
||||||
|
@ -95,8 +95,10 @@ static void rwkv_wkv_f32_kernel(
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_sycl_op_rwkv_wkv6(ggml_backend_sycl_context& ctx, const ggml_tensor* src0,
|
void ggml_sycl_op_rwkv_wkv6(ggml_backend_sycl_context& ctx, ggml_tensor* dst) {
|
||||||
const ggml_tensor* src1, ggml_tensor* dst) {
|
|
||||||
|
const ggml_tensor *src0 = dst->src[0];
|
||||||
|
const ggml_tensor *src1 = dst->src[1];
|
||||||
|
|
||||||
const float* k_d = (const float*)dst->src[0]->data;
|
const float* k_d = (const float*)dst->src[0]->data;
|
||||||
const float* v_d = (const float*)dst->src[1]->data;
|
const float* v_d = (const float*)dst->src[1]->data;
|
||||||
@ -107,9 +109,9 @@ void ggml_sycl_op_rwkv_wkv6(ggml_backend_sycl_context& ctx, const ggml_tensor* s
|
|||||||
float* dst_d = (float*)dst->data;
|
float* dst_d = (float*)dst->data;
|
||||||
|
|
||||||
const int64_t B = dst->src[5]->ne[1];
|
const int64_t B = dst->src[5]->ne[1];
|
||||||
const int64_t T = dst->src[0]->ne[3];
|
const int64_t T = dst->src[0]->ne[2];
|
||||||
const int64_t C = dst->ne[0];
|
const int64_t C = dst->ne[0];
|
||||||
const int64_t H = dst->src[0]->ne[2];
|
const int64_t H = dst->src[0]->ne[1];
|
||||||
|
|
||||||
GGML_ASSERT(dst->src[5]->type == GGML_TYPE_F32);
|
GGML_ASSERT(dst->src[5]->type == GGML_TYPE_F32);
|
||||||
GGML_ASSERT(C % H == 0);
|
GGML_ASSERT(C % H == 0);
|
||||||
|
@ -3,8 +3,7 @@
|
|||||||
|
|
||||||
#include "common.hpp"
|
#include "common.hpp"
|
||||||
|
|
||||||
void ggml_sycl_op_rwkv_wkv6(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
|
void ggml_sycl_op_rwkv_wkv6(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
|
||||||
const ggml_tensor *src1, ggml_tensor * dst);
|
|
||||||
|
|
||||||
|
|
||||||
#endif // GGML_SYCL_WKV6_HPP
|
#endif // GGML_SYCL_WKV6_HPP
|
||||||
|
@ -2277,6 +2277,7 @@ static vk_device ggml_vk_get_device(size_t idx) {
|
|||||||
if (device->subgroup_size_control) {
|
if (device->subgroup_size_control) {
|
||||||
device->subgroup_min_size = subgroup_size_control_props.minSubgroupSize;
|
device->subgroup_min_size = subgroup_size_control_props.minSubgroupSize;
|
||||||
device->subgroup_max_size = subgroup_size_control_props.maxSubgroupSize;
|
device->subgroup_max_size = subgroup_size_control_props.maxSubgroupSize;
|
||||||
|
device_extensions.push_back("VK_EXT_subgroup_size_control");
|
||||||
}
|
}
|
||||||
|
|
||||||
device->subgroup_size_control = device->subgroup_size_control &&
|
device->subgroup_size_control = device->subgroup_size_control &&
|
||||||
@ -2285,7 +2286,6 @@ static vk_device ggml_vk_get_device(size_t idx) {
|
|||||||
|
|
||||||
if (device->subgroup_size_control) {
|
if (device->subgroup_size_control) {
|
||||||
device->subgroup_require_full_support = subgroup_size_control_features.computeFullSubgroups;
|
device->subgroup_require_full_support = subgroup_size_control_features.computeFullSubgroups;
|
||||||
device_extensions.push_back("VK_EXT_subgroup_size_control");
|
|
||||||
}
|
}
|
||||||
|
|
||||||
#if defined(VK_KHR_cooperative_matrix)
|
#if defined(VK_KHR_cooperative_matrix)
|
||||||
@ -5633,9 +5633,9 @@ static void ggml_vk_op_f32_rwkv6(ggml_backend_vk_context * ctx, vk_context& subc
|
|||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_vk_rwkv_wkv6(ggml_backend_vk_context * ctx, vk_context& subctx, ggml_tensor * dst, bool dryrun = false) {
|
static void ggml_vk_rwkv_wkv6(ggml_backend_vk_context * ctx, vk_context& subctx, ggml_tensor * dst, bool dryrun = false) {
|
||||||
const size_t seq_length = dst->src[0]->ne[3];
|
const size_t seq_length = dst->src[0]->ne[2];
|
||||||
const size_t n_embed = dst->ne[0];
|
const size_t n_embed = dst->ne[0];
|
||||||
const size_t n_heads = dst->src[0]->ne[2];
|
const size_t n_heads = dst->src[0]->ne[1];
|
||||||
const size_t n_seqs = dst->src[5]->ne[1];
|
const size_t n_seqs = dst->src[5]->ne[1];
|
||||||
|
|
||||||
ggml_vk_op_f32_rwkv6(
|
ggml_vk_op_f32_rwkv6(
|
||||||
|
@ -1,9 +1,6 @@
|
|||||||
#version 450
|
#version 450
|
||||||
|
|
||||||
#ifdef FLOAT16
|
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
|
||||||
#extension GL_EXT_shader_explicit_arithmetic_types_float16 : require
|
|
||||||
#endif
|
|
||||||
#extension GL_EXT_shader_explicit_arithmetic_types : require
|
|
||||||
|
|
||||||
#include "mul_mat_vec_base.comp"
|
#include "mul_mat_vec_base.comp"
|
||||||
|
|
||||||
@ -27,8 +24,8 @@ void iter(inout FLOAT_TYPE temp[NUM_COLS][NUM_ROWS], const uint first_row, const
|
|||||||
|
|
||||||
#if K_PER_ITER == 8
|
#if K_PER_ITER == 8
|
||||||
#if QUANT_R == 2
|
#if QUANT_R == 2
|
||||||
const B_TYPE_VEC4 bv02 = data_b_v4[(j*p.batch_stride_b + b_offset + iybs + iqs) / 4];
|
const vec4 bv02 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + iybs + iqs) / 4]);
|
||||||
const B_TYPE_VEC4 bv13 = data_b_v4[(j*p.batch_stride_b + b_offset + iybs + iqs + y_offset) / 4];
|
const vec4 bv13 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + iybs + iqs + y_offset) / 4]);
|
||||||
const vec4 bv0 = vec4(bv02.x, bv13.x, bv02.y, bv13.y);
|
const vec4 bv0 = vec4(bv02.x, bv13.x, bv02.y, bv13.y);
|
||||||
const vec4 bv1 = vec4(bv02.z, bv13.z, bv02.w, bv13.w);
|
const vec4 bv1 = vec4(bv02.z, bv13.z, bv02.w, bv13.w);
|
||||||
#else
|
#else
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
#version 450
|
#version 450
|
||||||
#extension GL_EXT_shader_explicit_arithmetic_types : require
|
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
|
||||||
|
|
||||||
#include "mul_mat_vec_base.comp"
|
#include "mul_mat_vec_base.comp"
|
||||||
|
|
||||||
@ -40,9 +40,9 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
|
|||||||
|
|
||||||
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
|
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
|
||||||
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row;
|
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row;
|
||||||
f16vec2 d = data_a[ib0 + i].d;
|
vec2 d = vec2(data_a[ib0 + i].d);
|
||||||
const FLOAT_TYPE dall = d.x;
|
const FLOAT_TYPE dall = FLOAT_TYPE(d.x);
|
||||||
const FLOAT_TYPE dmin = d.y;
|
const FLOAT_TYPE dmin = FLOAT_TYPE(d.y);
|
||||||
|
|
||||||
uint32_t s0_u32 = data_a_packed32[ib0 + i].scales[s_offset / 4 + 0];
|
uint32_t s0_u32 = data_a_packed32[ib0 + i].scales[s_offset / 4 + 0];
|
||||||
uint32_t s4_u32 = data_a_packed32[ib0 + i].scales[s_offset / 4 + 1];
|
uint32_t s4_u32 = data_a_packed32[ib0 + i].scales[s_offset / 4 + 1];
|
||||||
@ -63,14 +63,14 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
|
|||||||
uvec2 qs16 = uvec2(unpack8(qs16_u16));
|
uvec2 qs16 = uvec2(unpack8(qs16_u16));
|
||||||
|
|
||||||
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
|
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
|
||||||
B_TYPE_VEC2 b0 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 0];
|
vec2 b0 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 0]);
|
||||||
B_TYPE_VEC2 b16 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 8];
|
vec2 b16 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 8]);
|
||||||
B_TYPE_VEC2 b32 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 16];
|
vec2 b32 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 16]);
|
||||||
B_TYPE_VEC2 b48 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 24];
|
vec2 b48 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 24]);
|
||||||
B_TYPE_VEC2 b64 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 32];
|
vec2 b64 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 32]);
|
||||||
B_TYPE_VEC2 b80 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 40];
|
vec2 b80 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 40]);
|
||||||
B_TYPE_VEC2 b96 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 48];
|
vec2 b96 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 48]);
|
||||||
B_TYPE_VEC2 b112 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 56];
|
vec2 b112 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 56]);
|
||||||
|
|
||||||
FLOAT_TYPE sum1 = FLOAT_TYPE(0.0);
|
FLOAT_TYPE sum1 = FLOAT_TYPE(0.0);
|
||||||
FLOAT_TYPE sum2 = FLOAT_TYPE(0.0);
|
FLOAT_TYPE sum2 = FLOAT_TYPE(0.0);
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
#version 450
|
#version 450
|
||||||
#extension GL_EXT_shader_explicit_arithmetic_types : require
|
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
|
||||||
|
|
||||||
#include "mul_mat_vec_base.comp"
|
#include "mul_mat_vec_base.comp"
|
||||||
|
|
||||||
@ -60,14 +60,14 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
|
|||||||
|
|
||||||
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
|
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
|
||||||
|
|
||||||
B_TYPE_VEC2 b0 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 0];
|
vec2 b0 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 0]);
|
||||||
B_TYPE_VEC2 b16 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 8];
|
vec2 b16 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 8]);
|
||||||
B_TYPE_VEC2 b32 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 16];
|
vec2 b32 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 16]);
|
||||||
B_TYPE_VEC2 b48 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 24];
|
vec2 b48 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 24]);
|
||||||
B_TYPE_VEC2 b64 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 32];
|
vec2 b64 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 32]);
|
||||||
B_TYPE_VEC2 b80 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 40];
|
vec2 b80 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 40]);
|
||||||
B_TYPE_VEC2 b96 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 48];
|
vec2 b96 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 48]);
|
||||||
B_TYPE_VEC2 b112 = data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 56];
|
vec2 b112 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y_idx) / 2 + 56]);
|
||||||
|
|
||||||
FLOAT_TYPE sum = FLOAT_TYPE(0.0);
|
FLOAT_TYPE sum = FLOAT_TYPE(0.0);
|
||||||
[[unroll]] for (int l = 0; l < 2; ++l) {
|
[[unroll]] for (int l = 0; l < 2; ++l) {
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
#version 450
|
#version 450
|
||||||
|
|
||||||
#extension GL_EXT_shader_explicit_arithmetic_types : require
|
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
|
||||||
|
|
||||||
#include "mul_mat_vec_base.comp"
|
#include "mul_mat_vec_base.comp"
|
||||||
|
|
||||||
@ -45,7 +45,7 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
|
|||||||
|
|
||||||
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
|
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
|
||||||
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row;
|
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row;
|
||||||
f16vec2 d = data_a[ib0 + i].d;
|
vec2 d = vec2(data_a[ib0 + i].d);
|
||||||
const FLOAT_TYPE dall = FLOAT_TYPE(d.x);
|
const FLOAT_TYPE dall = FLOAT_TYPE(d.x);
|
||||||
const FLOAT_TYPE dmin = FLOAT_TYPE(d.y);
|
const FLOAT_TYPE dmin = FLOAT_TYPE(d.y);
|
||||||
|
|
||||||
@ -96,10 +96,10 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
|
|||||||
const uint32_t q4_15 = qs64_hi4.w;
|
const uint32_t q4_15 = qs64_hi4.w;
|
||||||
|
|
||||||
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
|
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
|
||||||
B_TYPE_VEC4 by10 = data_b_v4[(j*p.batch_stride_b + b_offset + y1_idx) / 4];
|
vec4 by10 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y1_idx) / 4 ]);
|
||||||
B_TYPE_VEC4 by132 = data_b_v4[(j*p.batch_stride_b + b_offset + y1_idx) / 4 + 8];
|
vec4 by132 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y1_idx) / 4 + 8]);
|
||||||
B_TYPE_VEC4 by20 = data_b_v4[(j*p.batch_stride_b + b_offset + y2_idx) / 4];
|
vec4 by20 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y2_idx) / 4 ]);
|
||||||
B_TYPE_VEC4 by232 = data_b_v4[(j*p.batch_stride_b + b_offset + y2_idx) / 4 + 8];
|
vec4 by232 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y2_idx) / 4 + 8]);
|
||||||
|
|
||||||
const FLOAT_TYPE sx = fma(FLOAT_TYPE(by10.x), q4_0, fma(FLOAT_TYPE(by10.y), q4_1, fma(FLOAT_TYPE(by10.z), q4_2, FLOAT_TYPE(by10.w) * q4_3)));
|
const FLOAT_TYPE sx = fma(FLOAT_TYPE(by10.x), q4_0, fma(FLOAT_TYPE(by10.y), q4_1, fma(FLOAT_TYPE(by10.z), q4_2, FLOAT_TYPE(by10.w) * q4_3)));
|
||||||
const FLOAT_TYPE sy = fma(FLOAT_TYPE(by132.x), q4_4, fma(FLOAT_TYPE(by132.y), q4_5, fma(FLOAT_TYPE(by132.z), q4_6, FLOAT_TYPE(by132.w) * q4_7)));
|
const FLOAT_TYPE sy = fma(FLOAT_TYPE(by132.x), q4_4, fma(FLOAT_TYPE(by132.y), q4_5, fma(FLOAT_TYPE(by132.z), q4_6, FLOAT_TYPE(by132.w) * q4_7)));
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
#version 450
|
#version 450
|
||||||
|
|
||||||
#extension GL_EXT_shader_explicit_arithmetic_types : require
|
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
|
||||||
|
|
||||||
#include "mul_mat_vec_base.comp"
|
#include "mul_mat_vec_base.comp"
|
||||||
|
|
||||||
@ -42,7 +42,7 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
|
|||||||
|
|
||||||
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
|
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
|
||||||
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row;
|
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row;
|
||||||
f16vec2 d = data_a[ib0 + i].d;
|
vec2 d = vec2(data_a[ib0 + i].d);
|
||||||
const FLOAT_TYPE dall = FLOAT_TYPE(d.x);
|
const FLOAT_TYPE dall = FLOAT_TYPE(d.x);
|
||||||
const FLOAT_TYPE dmin = FLOAT_TYPE(d.y);
|
const FLOAT_TYPE dmin = FLOAT_TYPE(d.y);
|
||||||
|
|
||||||
@ -105,14 +105,14 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
|
|||||||
const uint32_t q4_15 = qs64_80_hi4.w;
|
const uint32_t q4_15 = qs64_80_hi4.w;
|
||||||
|
|
||||||
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
|
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
|
||||||
B_TYPE_VEC2 by10 = data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2];
|
vec2 by10 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 ]);
|
||||||
B_TYPE_VEC2 by116 = data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 + 8];
|
vec2 by116 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 + 8]);
|
||||||
B_TYPE_VEC2 by132 = data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 + 16];
|
vec2 by132 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 + 16]);
|
||||||
B_TYPE_VEC2 by148 = data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 + 24];
|
vec2 by148 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y1_idx) / 2 + 24]);
|
||||||
B_TYPE_VEC2 by20 = data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2];
|
vec2 by20 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 ]);
|
||||||
B_TYPE_VEC2 by216 = data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 + 8];
|
vec2 by216 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 + 8]);
|
||||||
B_TYPE_VEC2 by232 = data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 + 16];
|
vec2 by232 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 + 16]);
|
||||||
B_TYPE_VEC2 by248 = data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 + 24];
|
vec2 by248 = vec2(data_b_v2[(j*p.batch_stride_b + b_offset + y2_idx) / 2 + 24]);
|
||||||
|
|
||||||
const FLOAT_TYPE sx =
|
const FLOAT_TYPE sx =
|
||||||
fma(FLOAT_TYPE(by10.x), q4_0,
|
fma(FLOAT_TYPE(by10.x), q4_0,
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
#version 450
|
#version 450
|
||||||
|
|
||||||
#extension GL_EXT_shader_explicit_arithmetic_types : require
|
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
|
||||||
|
|
||||||
#include "mul_mat_vec_base.comp"
|
#include "mul_mat_vec_base.comp"
|
||||||
|
|
||||||
@ -77,10 +77,10 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
|
|||||||
uvec4 q3 = uvec4(unpack8(q3_u32));
|
uvec4 q3 = uvec4(unpack8(q3_u32));
|
||||||
|
|
||||||
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
|
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
|
||||||
B_TYPE_VEC4 by0 = data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4];
|
vec4 by0 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 ]);
|
||||||
B_TYPE_VEC4 by32 = data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 8];
|
vec4 by32 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 8]);
|
||||||
B_TYPE_VEC4 by64 = data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 16];
|
vec4 by64 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 16]);
|
||||||
B_TYPE_VEC4 by96 = data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 24];
|
vec4 by96 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 24]);
|
||||||
|
|
||||||
FLOAT_TYPE sum = FLOAT_TYPE(0.0);
|
FLOAT_TYPE sum = FLOAT_TYPE(0.0);
|
||||||
[[unroll]] for (int l = 0; l < 4; ++l) {
|
[[unroll]] for (int l = 0; l < 4; ++l) {
|
||||||
|
@ -1,6 +1,5 @@
|
|||||||
#version 450
|
#version 450
|
||||||
|
|
||||||
#extension GL_EXT_shader_explicit_arithmetic_types_float16 : require
|
|
||||||
#extension GL_EXT_control_flow_attributes : enable
|
#extension GL_EXT_control_flow_attributes : enable
|
||||||
|
|
||||||
layout (push_constant) uniform parameter
|
layout (push_constant) uniform parameter
|
||||||
|
@ -2,7 +2,10 @@
|
|||||||
#if !defined(GGML_TYPES_COMP)
|
#if !defined(GGML_TYPES_COMP)
|
||||||
#define GGML_TYPES_COMP
|
#define GGML_TYPES_COMP
|
||||||
|
|
||||||
#extension GL_EXT_shader_explicit_arithmetic_types : require
|
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
|
||||||
|
#extension GL_EXT_shader_explicit_arithmetic_types_int16 : require
|
||||||
|
#extension GL_EXT_shader_explicit_arithmetic_types_int8 : require
|
||||||
|
#extension GL_EXT_shader_16bit_storage : require
|
||||||
|
|
||||||
#if defined(DATA_A_F32)
|
#if defined(DATA_A_F32)
|
||||||
#define QUANT_K 1
|
#define QUANT_K 1
|
||||||
|
@ -968,6 +968,7 @@ static const char * GGML_OP_NAME[GGML_OP_COUNT] = {
|
|||||||
"GET_REL_POS",
|
"GET_REL_POS",
|
||||||
"ADD_REL_POS",
|
"ADD_REL_POS",
|
||||||
"RWKV_WKV6",
|
"RWKV_WKV6",
|
||||||
|
"GATED_LINEAR_ATTN",
|
||||||
|
|
||||||
"UNARY",
|
"UNARY",
|
||||||
|
|
||||||
@ -987,7 +988,7 @@ static const char * GGML_OP_NAME[GGML_OP_COUNT] = {
|
|||||||
"OPT_STEP_ADAMW",
|
"OPT_STEP_ADAMW",
|
||||||
};
|
};
|
||||||
|
|
||||||
static_assert(GGML_OP_COUNT == 82, "GGML_OP_COUNT != 82");
|
static_assert(GGML_OP_COUNT == 83, "GGML_OP_COUNT != 83");
|
||||||
|
|
||||||
static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
|
static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
|
||||||
"none",
|
"none",
|
||||||
@ -1064,6 +1065,7 @@ static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
|
|||||||
"get_rel_pos(x)",
|
"get_rel_pos(x)",
|
||||||
"add_rel_pos(x)",
|
"add_rel_pos(x)",
|
||||||
"rwkv_wkv6(k, v, r, tf, td, s)",
|
"rwkv_wkv6(k, v, r, tf, td, s)",
|
||||||
|
"gated_linear_attn(k, v, q, gate, s)",
|
||||||
|
|
||||||
"unary(x)",
|
"unary(x)",
|
||||||
|
|
||||||
@ -1083,7 +1085,7 @@ static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
|
|||||||
"adamw(x)",
|
"adamw(x)",
|
||||||
};
|
};
|
||||||
|
|
||||||
static_assert(GGML_OP_COUNT == 82, "GGML_OP_COUNT != 82");
|
static_assert(GGML_OP_COUNT == 83, "GGML_OP_COUNT != 83");
|
||||||
|
|
||||||
static_assert(GGML_OP_POOL_COUNT == 2, "GGML_OP_POOL_COUNT != 2");
|
static_assert(GGML_OP_POOL_COUNT == 2, "GGML_OP_POOL_COUNT != 2");
|
||||||
|
|
||||||
@ -4629,15 +4631,13 @@ struct ggml_tensor * ggml_rwkv_wkv6(
|
|||||||
GGML_ASSERT(ggml_is_contiguous(state));
|
GGML_ASSERT(ggml_is_contiguous(state));
|
||||||
|
|
||||||
const int64_t S = k->ne[0];
|
const int64_t S = k->ne[0];
|
||||||
const int64_t H = k->ne[2];
|
const int64_t H = k->ne[1];
|
||||||
const int64_t n_tokens = k->ne[3];
|
const int64_t n_tokens = k->ne[2];
|
||||||
const int64_t n_seqs = state->ne[1];
|
const int64_t n_seqs = state->ne[1];
|
||||||
{
|
{
|
||||||
GGML_ASSERT(k->ne[1] == 1);
|
GGML_ASSERT(v->ne[0] == S && v->ne[1] == H && v->ne[2] == n_tokens);
|
||||||
GGML_ASSERT(v->ne[0] == 1 && v->ne[1] == S && v->ne[2] == H && v->ne[3] == n_tokens);
|
GGML_ASSERT(r->ne[0] == S && r->ne[1] == H && r->ne[2] == n_tokens);
|
||||||
GGML_ASSERT(r->ne[0] == 1 && r->ne[1] == S && r->ne[2] == H && r->ne[3] == n_tokens);
|
GGML_ASSERT(td->ne[0] == S && td->ne[1] == H && td->ne[2] == n_tokens);
|
||||||
// TODO: RWKV v4 and v5
|
|
||||||
GGML_ASSERT(td->ne[0] == 1 && td->ne[1] == S && td->ne[2] == H && td->ne[3] == n_tokens);
|
|
||||||
GGML_ASSERT(ggml_nelements(state) == S * S * H * n_seqs);
|
GGML_ASSERT(ggml_nelements(state) == S * S * H * n_seqs);
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -4656,6 +4656,49 @@ struct ggml_tensor * ggml_rwkv_wkv6(
|
|||||||
return result;
|
return result;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ggml_gated_linear_attn
|
||||||
|
|
||||||
|
struct ggml_tensor * ggml_gated_linear_attn(
|
||||||
|
struct ggml_context * ctx,
|
||||||
|
struct ggml_tensor * k,
|
||||||
|
struct ggml_tensor * v,
|
||||||
|
struct ggml_tensor * q,
|
||||||
|
struct ggml_tensor * g,
|
||||||
|
struct ggml_tensor * state,
|
||||||
|
float scale) {
|
||||||
|
GGML_ASSERT(ggml_is_contiguous(k));
|
||||||
|
GGML_ASSERT(ggml_is_contiguous(v));
|
||||||
|
GGML_ASSERT(ggml_is_contiguous(q));
|
||||||
|
GGML_ASSERT(ggml_is_contiguous(g));
|
||||||
|
GGML_ASSERT(ggml_is_contiguous(state));
|
||||||
|
|
||||||
|
const int64_t S = k->ne[0];
|
||||||
|
const int64_t H = k->ne[1];
|
||||||
|
const int64_t n_tokens = k->ne[2];
|
||||||
|
const int64_t n_seqs = state->ne[1];
|
||||||
|
{
|
||||||
|
GGML_ASSERT(v->ne[0] == S && v->ne[1] == H && v->ne[2] == n_tokens);
|
||||||
|
GGML_ASSERT(q->ne[0] == S && q->ne[1] == H && q->ne[2] == n_tokens);
|
||||||
|
GGML_ASSERT(g->ne[0] == S && g->ne[1] == H && g->ne[2] == n_tokens);
|
||||||
|
GGML_ASSERT(ggml_nelements(state) == S * S * H * n_seqs);
|
||||||
|
}
|
||||||
|
|
||||||
|
// concat output and new_state
|
||||||
|
const int64_t ne[4] = { S * H, n_tokens + S * n_seqs, 1, 1 };
|
||||||
|
struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
|
||||||
|
|
||||||
|
ggml_set_op_params_f32(result, 0, scale);
|
||||||
|
|
||||||
|
result->op = GGML_OP_GATED_LINEAR_ATTN;
|
||||||
|
result->src[0] = k;
|
||||||
|
result->src[1] = v;
|
||||||
|
result->src[2] = q;
|
||||||
|
result->src[3] = g;
|
||||||
|
result->src[4] = state;
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
// ggml_unary
|
// ggml_unary
|
||||||
|
|
||||||
static struct ggml_tensor * ggml_unary_impl(
|
static struct ggml_tensor * ggml_unary_impl(
|
||||||
|
@ -15,13 +15,15 @@ pip install gguf
|
|||||||
|
|
||||||
[examples/writer.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/examples/writer.py) — Generates `example.gguf` in the current directory to demonstrate generating a GGUF file. Note that this file cannot be used as a model.
|
[examples/writer.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/examples/writer.py) — Generates `example.gguf` in the current directory to demonstrate generating a GGUF file. Note that this file cannot be used as a model.
|
||||||
|
|
||||||
[scripts/gguf_dump.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf_dump.py) — Dumps a GGUF file's metadata to the console.
|
[examples/reader.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/examples/reader.py) — Extracts and displays key-value pairs and tensor details from a GGUF file in a readable format.
|
||||||
|
|
||||||
[scripts/gguf_set_metadata.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf_set_metadata.py) — Allows changing simple metadata values in a GGUF file by key.
|
[gguf/scripts/gguf_dump.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/scripts/gguf_dump.py) — Dumps a GGUF file's metadata to the console.
|
||||||
|
|
||||||
[scripts/gguf_convert_endian.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf_convert_endian.py) — Allows converting the endianness of GGUF files.
|
[gguf/scripts/gguf_set_metadata.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/scripts/gguf_set_metadata.py) — Allows changing simple metadata values in a GGUF file by key.
|
||||||
|
|
||||||
[scripts/gguf_new_metadata.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf_new_metadata.py) — Copies a GGUF file with added/modified/removed metadata values.
|
[gguf/scripts/gguf_convert_endian.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/scripts/gguf_convert_endian.py) — Allows converting the endianness of GGUF files.
|
||||||
|
|
||||||
|
[gguf/scripts/gguf_new_metadata.py](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/scripts/gguf_new_metadata.py) — Copies a GGUF file with added/modified/removed metadata values.
|
||||||
|
|
||||||
## Development
|
## Development
|
||||||
Maintainers who participate in development of this package are advised to install it in editable mode:
|
Maintainers who participate in development of this package are advised to install it in editable mode:
|
||||||
|
@ -115,6 +115,7 @@ class Keys:
|
|||||||
TIME_DECAY_EXTRA_DIM = "{arch}.time_decay_extra_dim"
|
TIME_DECAY_EXTRA_DIM = "{arch}.time_decay_extra_dim"
|
||||||
RESIDUAL_SCALE = "{arch}.residual_scale"
|
RESIDUAL_SCALE = "{arch}.residual_scale"
|
||||||
EMBEDDING_SCALE = "{arch}.embedding_scale"
|
EMBEDDING_SCALE = "{arch}.embedding_scale"
|
||||||
|
TOKEN_SHIFT_COUNT = "{arch}.token_shift_count"
|
||||||
|
|
||||||
class Attention:
|
class Attention:
|
||||||
HEAD_COUNT = "{arch}.attention.head_count"
|
HEAD_COUNT = "{arch}.attention.head_count"
|
||||||
@ -183,7 +184,6 @@ class Keys:
|
|||||||
UNK_ID = "tokenizer.ggml.unknown_token_id"
|
UNK_ID = "tokenizer.ggml.unknown_token_id"
|
||||||
SEP_ID = "tokenizer.ggml.seperator_token_id"
|
SEP_ID = "tokenizer.ggml.seperator_token_id"
|
||||||
PAD_ID = "tokenizer.ggml.padding_token_id"
|
PAD_ID = "tokenizer.ggml.padding_token_id"
|
||||||
CLS_ID = "tokenizer.ggml.cls_token_id"
|
|
||||||
MASK_ID = "tokenizer.ggml.mask_token_id"
|
MASK_ID = "tokenizer.ggml.mask_token_id"
|
||||||
ADD_BOS = "tokenizer.ggml.add_bos_token"
|
ADD_BOS = "tokenizer.ggml.add_bos_token"
|
||||||
ADD_EOS = "tokenizer.ggml.add_eos_token"
|
ADD_EOS = "tokenizer.ggml.add_eos_token"
|
||||||
@ -244,6 +244,7 @@ class MODEL_ARCH(IntEnum):
|
|||||||
QWEN2VL = auto()
|
QWEN2VL = auto()
|
||||||
PHI2 = auto()
|
PHI2 = auto()
|
||||||
PHI3 = auto()
|
PHI3 = auto()
|
||||||
|
PHIMOE = auto()
|
||||||
PLAMO = auto()
|
PLAMO = auto()
|
||||||
CODESHELL = auto()
|
CODESHELL = auto()
|
||||||
ORION = auto()
|
ORION = auto()
|
||||||
@ -254,6 +255,7 @@ class MODEL_ARCH(IntEnum):
|
|||||||
GEMMA2 = auto()
|
GEMMA2 = auto()
|
||||||
STARCODER2 = auto()
|
STARCODER2 = auto()
|
||||||
RWKV6 = auto()
|
RWKV6 = auto()
|
||||||
|
RWKV6QWEN2 = auto()
|
||||||
MAMBA = auto()
|
MAMBA = auto()
|
||||||
XVERSE = auto()
|
XVERSE = auto()
|
||||||
COMMAND_R = auto()
|
COMMAND_R = auto()
|
||||||
@ -333,6 +335,7 @@ class MODEL_TENSOR(IntEnum):
|
|||||||
TIME_MIX_LERP_V = auto()
|
TIME_MIX_LERP_V = auto()
|
||||||
TIME_MIX_LERP_R = auto()
|
TIME_MIX_LERP_R = auto()
|
||||||
TIME_MIX_LERP_G = auto()
|
TIME_MIX_LERP_G = auto()
|
||||||
|
TIME_MIX_LERP_FUSED = auto()
|
||||||
TIME_MIX_LERP_W = auto()
|
TIME_MIX_LERP_W = auto()
|
||||||
TIME_MIX_FIRST = auto()
|
TIME_MIX_FIRST = auto()
|
||||||
TIME_MIX_DECAY = auto()
|
TIME_MIX_DECAY = auto()
|
||||||
@ -428,6 +431,7 @@ MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
|
|||||||
MODEL_ARCH.QWEN2VL: "qwen2vl",
|
MODEL_ARCH.QWEN2VL: "qwen2vl",
|
||||||
MODEL_ARCH.PHI2: "phi2",
|
MODEL_ARCH.PHI2: "phi2",
|
||||||
MODEL_ARCH.PHI3: "phi3",
|
MODEL_ARCH.PHI3: "phi3",
|
||||||
|
MODEL_ARCH.PHIMOE: "phimoe",
|
||||||
MODEL_ARCH.PLAMO: "plamo",
|
MODEL_ARCH.PLAMO: "plamo",
|
||||||
MODEL_ARCH.CODESHELL: "codeshell",
|
MODEL_ARCH.CODESHELL: "codeshell",
|
||||||
MODEL_ARCH.ORION: "orion",
|
MODEL_ARCH.ORION: "orion",
|
||||||
@ -438,6 +442,7 @@ MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
|
|||||||
MODEL_ARCH.GEMMA2: "gemma2",
|
MODEL_ARCH.GEMMA2: "gemma2",
|
||||||
MODEL_ARCH.STARCODER2: "starcoder2",
|
MODEL_ARCH.STARCODER2: "starcoder2",
|
||||||
MODEL_ARCH.RWKV6: "rwkv6",
|
MODEL_ARCH.RWKV6: "rwkv6",
|
||||||
|
MODEL_ARCH.RWKV6QWEN2: "rwkv6qwen2",
|
||||||
MODEL_ARCH.MAMBA: "mamba",
|
MODEL_ARCH.MAMBA: "mamba",
|
||||||
MODEL_ARCH.XVERSE: "xverse",
|
MODEL_ARCH.XVERSE: "xverse",
|
||||||
MODEL_ARCH.COMMAND_R: "command-r",
|
MODEL_ARCH.COMMAND_R: "command-r",
|
||||||
@ -517,6 +522,7 @@ TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
|
|||||||
MODEL_TENSOR.TIME_MIX_LERP_V: "blk.{bid}.time_mix_lerp_v",
|
MODEL_TENSOR.TIME_MIX_LERP_V: "blk.{bid}.time_mix_lerp_v",
|
||||||
MODEL_TENSOR.TIME_MIX_LERP_R: "blk.{bid}.time_mix_lerp_r",
|
MODEL_TENSOR.TIME_MIX_LERP_R: "blk.{bid}.time_mix_lerp_r",
|
||||||
MODEL_TENSOR.TIME_MIX_LERP_G: "blk.{bid}.time_mix_lerp_g",
|
MODEL_TENSOR.TIME_MIX_LERP_G: "blk.{bid}.time_mix_lerp_g",
|
||||||
|
MODEL_TENSOR.TIME_MIX_LERP_FUSED: "blk.{bid}.time_mix_lerp_fused",
|
||||||
MODEL_TENSOR.TIME_MIX_LERP_W: "blk.{bid}.time_mix_lerp_w",
|
MODEL_TENSOR.TIME_MIX_LERP_W: "blk.{bid}.time_mix_lerp_w",
|
||||||
MODEL_TENSOR.TIME_MIX_FIRST: "blk.{bid}.time_mix_first",
|
MODEL_TENSOR.TIME_MIX_FIRST: "blk.{bid}.time_mix_first",
|
||||||
MODEL_TENSOR.TIME_MIX_DECAY: "blk.{bid}.time_mix_decay",
|
MODEL_TENSOR.TIME_MIX_DECAY: "blk.{bid}.time_mix_decay",
|
||||||
@ -940,6 +946,24 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
|
|||||||
MODEL_TENSOR.FFN_DOWN,
|
MODEL_TENSOR.FFN_DOWN,
|
||||||
MODEL_TENSOR.FFN_UP,
|
MODEL_TENSOR.FFN_UP,
|
||||||
],
|
],
|
||||||
|
MODEL_ARCH.PHIMOE: [
|
||||||
|
MODEL_TENSOR.TOKEN_EMBD,
|
||||||
|
MODEL_TENSOR.OUTPUT_NORM,
|
||||||
|
MODEL_TENSOR.OUTPUT,
|
||||||
|
MODEL_TENSOR.ROPE_FACTORS_LONG,
|
||||||
|
MODEL_TENSOR.ROPE_FACTORS_SHORT,
|
||||||
|
MODEL_TENSOR.ATTN_NORM,
|
||||||
|
MODEL_TENSOR.ATTN_QKV,
|
||||||
|
MODEL_TENSOR.ATTN_Q,
|
||||||
|
MODEL_TENSOR.ATTN_K,
|
||||||
|
MODEL_TENSOR.ATTN_V,
|
||||||
|
MODEL_TENSOR.ATTN_OUT,
|
||||||
|
MODEL_TENSOR.FFN_NORM,
|
||||||
|
MODEL_TENSOR.FFN_GATE_INP,
|
||||||
|
MODEL_TENSOR.FFN_GATE_EXP,
|
||||||
|
MODEL_TENSOR.FFN_DOWN_EXP,
|
||||||
|
MODEL_TENSOR.FFN_UP_EXP,
|
||||||
|
],
|
||||||
MODEL_ARCH.CODESHELL: [
|
MODEL_ARCH.CODESHELL: [
|
||||||
MODEL_TENSOR.TOKEN_EMBD,
|
MODEL_TENSOR.TOKEN_EMBD,
|
||||||
MODEL_TENSOR.POS_EMBD,
|
MODEL_TENSOR.POS_EMBD,
|
||||||
@ -1083,6 +1107,7 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
|
|||||||
MODEL_TENSOR.TIME_MIX_LERP_R,
|
MODEL_TENSOR.TIME_MIX_LERP_R,
|
||||||
MODEL_TENSOR.TIME_MIX_LERP_G,
|
MODEL_TENSOR.TIME_MIX_LERP_G,
|
||||||
MODEL_TENSOR.TIME_MIX_LERP_W,
|
MODEL_TENSOR.TIME_MIX_LERP_W,
|
||||||
|
MODEL_TENSOR.TIME_MIX_LERP_FUSED,
|
||||||
MODEL_TENSOR.TIME_MIX_FIRST,
|
MODEL_TENSOR.TIME_MIX_FIRST,
|
||||||
MODEL_TENSOR.TIME_MIX_DECAY,
|
MODEL_TENSOR.TIME_MIX_DECAY,
|
||||||
MODEL_TENSOR.TIME_MIX_DECAY_W1,
|
MODEL_TENSOR.TIME_MIX_DECAY_W1,
|
||||||
@ -1099,6 +1124,35 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
|
|||||||
MODEL_TENSOR.CHANNEL_MIX_RECEPTANCE,
|
MODEL_TENSOR.CHANNEL_MIX_RECEPTANCE,
|
||||||
MODEL_TENSOR.CHANNEL_MIX_VALUE,
|
MODEL_TENSOR.CHANNEL_MIX_VALUE,
|
||||||
],
|
],
|
||||||
|
MODEL_ARCH.RWKV6QWEN2: [
|
||||||
|
MODEL_TENSOR.TOKEN_EMBD,
|
||||||
|
MODEL_TENSOR.OUTPUT_NORM,
|
||||||
|
MODEL_TENSOR.OUTPUT,
|
||||||
|
MODEL_TENSOR.ATTN_NORM,
|
||||||
|
MODEL_TENSOR.TIME_MIX_W1,
|
||||||
|
MODEL_TENSOR.TIME_MIX_W2,
|
||||||
|
MODEL_TENSOR.TIME_MIX_LERP_X,
|
||||||
|
MODEL_TENSOR.TIME_MIX_LERP_K,
|
||||||
|
MODEL_TENSOR.TIME_MIX_LERP_V,
|
||||||
|
MODEL_TENSOR.TIME_MIX_LERP_R,
|
||||||
|
MODEL_TENSOR.TIME_MIX_LERP_G,
|
||||||
|
MODEL_TENSOR.TIME_MIX_LERP_W,
|
||||||
|
MODEL_TENSOR.TIME_MIX_LERP_FUSED,
|
||||||
|
MODEL_TENSOR.TIME_MIX_FIRST,
|
||||||
|
MODEL_TENSOR.TIME_MIX_DECAY,
|
||||||
|
MODEL_TENSOR.TIME_MIX_DECAY_W1,
|
||||||
|
MODEL_TENSOR.TIME_MIX_DECAY_W2,
|
||||||
|
MODEL_TENSOR.TIME_MIX_KEY,
|
||||||
|
MODEL_TENSOR.TIME_MIX_VALUE,
|
||||||
|
MODEL_TENSOR.TIME_MIX_RECEPTANCE,
|
||||||
|
MODEL_TENSOR.TIME_MIX_GATE,
|
||||||
|
MODEL_TENSOR.TIME_MIX_LN,
|
||||||
|
MODEL_TENSOR.TIME_MIX_OUTPUT,
|
||||||
|
MODEL_TENSOR.FFN_NORM,
|
||||||
|
MODEL_TENSOR.FFN_GATE,
|
||||||
|
MODEL_TENSOR.FFN_DOWN,
|
||||||
|
MODEL_TENSOR.FFN_UP,
|
||||||
|
],
|
||||||
MODEL_ARCH.MAMBA: [
|
MODEL_ARCH.MAMBA: [
|
||||||
MODEL_TENSOR.TOKEN_EMBD,
|
MODEL_TENSOR.TOKEN_EMBD,
|
||||||
MODEL_TENSOR.OUTPUT_NORM,
|
MODEL_TENSOR.OUTPUT_NORM,
|
||||||
@ -1782,7 +1836,6 @@ KEY_TOKENIZER_EOM_ID = Keys.Tokenizer.EOM_ID
|
|||||||
KEY_TOKENIZER_UNK_ID = Keys.Tokenizer.UNK_ID
|
KEY_TOKENIZER_UNK_ID = Keys.Tokenizer.UNK_ID
|
||||||
KEY_TOKENIZER_SEP_ID = Keys.Tokenizer.SEP_ID
|
KEY_TOKENIZER_SEP_ID = Keys.Tokenizer.SEP_ID
|
||||||
KEY_TOKENIZER_PAD_ID = Keys.Tokenizer.PAD_ID
|
KEY_TOKENIZER_PAD_ID = Keys.Tokenizer.PAD_ID
|
||||||
KEY_TOKENIZER_CLS_ID = Keys.Tokenizer.CLS_ID
|
|
||||||
KEY_TOKENIZER_MASK_ID = Keys.Tokenizer.MASK_ID
|
KEY_TOKENIZER_MASK_ID = Keys.Tokenizer.MASK_ID
|
||||||
KEY_TOKENIZER_HF_JSON = Keys.Tokenizer.HF_JSON
|
KEY_TOKENIZER_HF_JSON = Keys.Tokenizer.HF_JSON
|
||||||
KEY_TOKENIZER_RWKV = Keys.Tokenizer.RWKV
|
KEY_TOKENIZER_RWKV = Keys.Tokenizer.RWKV
|
||||||
|
@ -743,6 +743,9 @@ class GGUFWriter:
|
|||||||
def add_wkv_head_size(self, size: int) -> None:
|
def add_wkv_head_size(self, size: int) -> None:
|
||||||
self.add_uint32(Keys.WKV.HEAD_SIZE.format(arch=self.arch), size)
|
self.add_uint32(Keys.WKV.HEAD_SIZE.format(arch=self.arch), size)
|
||||||
|
|
||||||
|
def add_token_shift_count(self, count: int) -> None:
|
||||||
|
self.add_uint32(Keys.LLM.TOKEN_SHIFT_COUNT.format(arch=self.arch), count)
|
||||||
|
|
||||||
def add_layer_norm_eps(self, value: float) -> None:
|
def add_layer_norm_eps(self, value: float) -> None:
|
||||||
self.add_float32(Keys.Attention.LAYERNORM_EPS.format(arch=self.arch), value)
|
self.add_float32(Keys.Attention.LAYERNORM_EPS.format(arch=self.arch), value)
|
||||||
|
|
||||||
@ -854,9 +857,6 @@ class GGUFWriter:
|
|||||||
def add_pad_token_id(self, id: int) -> None:
|
def add_pad_token_id(self, id: int) -> None:
|
||||||
self.add_uint32(Keys.Tokenizer.PAD_ID, id)
|
self.add_uint32(Keys.Tokenizer.PAD_ID, id)
|
||||||
|
|
||||||
def add_cls_token_id(self, id: int) -> None:
|
|
||||||
self.add_uint32(Keys.Tokenizer.CLS_ID, id)
|
|
||||||
|
|
||||||
def add_mask_token_id(self, id: int) -> None:
|
def add_mask_token_id(self, id: int) -> None:
|
||||||
self.add_uint32(Keys.Tokenizer.MASK_ID, id)
|
self.add_uint32(Keys.Tokenizer.MASK_ID, id)
|
||||||
|
|
||||||
|
@ -11,8 +11,8 @@ from pathlib import Path
|
|||||||
import numpy as np
|
import numpy as np
|
||||||
|
|
||||||
# Necessary to load the local gguf package
|
# Necessary to load the local gguf package
|
||||||
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent / 'gguf-py').exists():
|
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent.parent / 'gguf-py').exists():
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
|
||||||
|
|
||||||
import gguf
|
import gguf
|
||||||
|
|
@ -12,8 +12,8 @@ from typing import Any
|
|||||||
import numpy as np
|
import numpy as np
|
||||||
|
|
||||||
# Necessary to load the local gguf package
|
# Necessary to load the local gguf package
|
||||||
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent / 'gguf-py').exists():
|
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent.parent / 'gguf-py').exists():
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
|
||||||
|
|
||||||
from gguf import GGUFReader, GGUFValueType, ReaderTensor # noqa: E402
|
from gguf import GGUFReader, GGUFValueType, ReaderTensor # noqa: E402
|
||||||
|
|
@ -13,8 +13,8 @@ from pathlib import Path
|
|||||||
from tqdm import tqdm
|
from tqdm import tqdm
|
||||||
|
|
||||||
# Necessary to load the local gguf package
|
# Necessary to load the local gguf package
|
||||||
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent / 'gguf-py').exists():
|
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent.parent / 'gguf-py').exists():
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
|
||||||
|
|
||||||
from gguf import GGUFReader # noqa: E402
|
from gguf import GGUFReader # noqa: E402
|
||||||
|
|
@ -13,8 +13,8 @@ from tqdm import tqdm
|
|||||||
from typing import Any, Sequence, NamedTuple
|
from typing import Any, Sequence, NamedTuple
|
||||||
|
|
||||||
# Necessary to load the local gguf package
|
# Necessary to load the local gguf package
|
||||||
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent / 'gguf-py').exists():
|
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent.parent / 'gguf-py').exists():
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
|
||||||
|
|
||||||
import gguf
|
import gguf
|
||||||
|
|
@ -6,8 +6,8 @@ import sys
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
# Necessary to load the local gguf package
|
# Necessary to load the local gguf package
|
||||||
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent / 'gguf-py').exists():
|
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent.parent / 'gguf-py').exists():
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
|
||||||
|
|
||||||
from gguf import GGUFReader # noqa: E402
|
from gguf import GGUFReader # noqa: E402
|
||||||
|
|
@ -13,7 +13,7 @@ class TensorNameMap:
|
|||||||
"transformer.wte", # gpt2 gpt-j mpt refact qwen dbrx jais exaone
|
"transformer.wte", # gpt2 gpt-j mpt refact qwen dbrx jais exaone
|
||||||
"transformer.word_embeddings", # falcon
|
"transformer.word_embeddings", # falcon
|
||||||
"word_embeddings", # bloom
|
"word_embeddings", # bloom
|
||||||
"model.embed_tokens", # llama-hf nemotron olmoe olmo2
|
"model.embed_tokens", # llama-hf nemotron olmoe olmo2 rwkv6qwen2
|
||||||
"tok_embeddings", # llama-pth
|
"tok_embeddings", # llama-pth
|
||||||
"embeddings.word_embeddings", # bert nomic-bert
|
"embeddings.word_embeddings", # bert nomic-bert
|
||||||
"language_model.embedding.word_embeddings", # persimmon
|
"language_model.embedding.word_embeddings", # persimmon
|
||||||
@ -55,7 +55,7 @@ class TensorNameMap:
|
|||||||
# Output
|
# Output
|
||||||
MODEL_TENSOR.OUTPUT: (
|
MODEL_TENSOR.OUTPUT: (
|
||||||
"embed_out", # gptneox
|
"embed_out", # gptneox
|
||||||
"lm_head", # gpt2 mpt falcon llama-hf baichuan qwen mamba dbrx jais nemotron exaone olmoe olmo2
|
"lm_head", # gpt2 mpt falcon llama-hf baichuan qwen mamba dbrx jais nemotron exaone olmoe olmo2 phimoe
|
||||||
"output", # llama-pth bloom internlm2
|
"output", # llama-pth bloom internlm2
|
||||||
"word_embeddings_for_head", # persimmon
|
"word_embeddings_for_head", # persimmon
|
||||||
"lm_head.linear", # phi2
|
"lm_head.linear", # phi2
|
||||||
@ -68,7 +68,7 @@ class TensorNameMap:
|
|||||||
MODEL_TENSOR.OUTPUT_NORM: (
|
MODEL_TENSOR.OUTPUT_NORM: (
|
||||||
"gpt_neox.final_layer_norm", # gptneox
|
"gpt_neox.final_layer_norm", # gptneox
|
||||||
"transformer.ln_f", # gpt2 gpt-j falcon jais exaone
|
"transformer.ln_f", # gpt2 gpt-j falcon jais exaone
|
||||||
"model.norm", # llama-hf baichuan internlm2 olmoe olmo2
|
"model.norm", # llama-hf baichuan internlm2 olmoe olmo2 phimoe
|
||||||
"norm", # llama-pth
|
"norm", # llama-pth
|
||||||
"transformer.norm_f", # mpt dbrx
|
"transformer.norm_f", # mpt dbrx
|
||||||
"ln_f", # refact bloom qwen gpt2
|
"ln_f", # refact bloom qwen gpt2
|
||||||
@ -108,7 +108,7 @@ class TensorNameMap:
|
|||||||
"transformer.h.{bid}.input_layernorm", # falcon7b
|
"transformer.h.{bid}.input_layernorm", # falcon7b
|
||||||
"h.{bid}.input_layernorm", # bloom
|
"h.{bid}.input_layernorm", # bloom
|
||||||
"transformer.h.{bid}.ln_mlp", # falcon40b
|
"transformer.h.{bid}.ln_mlp", # falcon40b
|
||||||
"model.layers.{bid}.input_layernorm", # llama-hf nemotron olmoe
|
"model.layers.{bid}.input_layernorm", # llama-hf nemotron olmoe phimoe
|
||||||
"layers.{bid}.attention_norm", # llama-pth
|
"layers.{bid}.attention_norm", # llama-pth
|
||||||
"language_model.encoder.layers.{bid}.input_layernorm", # persimmon
|
"language_model.encoder.layers.{bid}.input_layernorm", # persimmon
|
||||||
"model.layers.{bid}.ln1", # yi
|
"model.layers.{bid}.ln1", # yi
|
||||||
@ -152,7 +152,7 @@ class TensorNameMap:
|
|||||||
|
|
||||||
# Attention query
|
# Attention query
|
||||||
MODEL_TENSOR.ATTN_Q: (
|
MODEL_TENSOR.ATTN_Q: (
|
||||||
"model.layers.{bid}.self_attn.q_proj", # llama-hf nemotron olmoe olmo2
|
"model.layers.{bid}.self_attn.q_proj", # llama-hf nemotron olmoe olmo2 phimoe
|
||||||
"model.layers.{bid}.self_attn.q_proj_no_perm", # llama-custom
|
"model.layers.{bid}.self_attn.q_proj_no_perm", # llama-custom
|
||||||
"layers.{bid}.attention.wq", # llama-pth
|
"layers.{bid}.attention.wq", # llama-pth
|
||||||
"encoder.layer.{bid}.attention.self.query", # bert
|
"encoder.layer.{bid}.attention.self.query", # bert
|
||||||
@ -165,7 +165,7 @@ class TensorNameMap:
|
|||||||
|
|
||||||
# Attention key
|
# Attention key
|
||||||
MODEL_TENSOR.ATTN_K: (
|
MODEL_TENSOR.ATTN_K: (
|
||||||
"model.layers.{bid}.self_attn.k_proj", # llama-hf nemotron olmoe olmo2
|
"model.layers.{bid}.self_attn.k_proj", # llama-hf nemotron olmoe olmo2 phimoe
|
||||||
"model.layers.{bid}.self_attn.k_proj_no_perm", # llama-custom
|
"model.layers.{bid}.self_attn.k_proj_no_perm", # llama-custom
|
||||||
"layers.{bid}.attention.wk", # llama-pth
|
"layers.{bid}.attention.wk", # llama-pth
|
||||||
"encoder.layer.{bid}.attention.self.key", # bert
|
"encoder.layer.{bid}.attention.self.key", # bert
|
||||||
@ -179,7 +179,7 @@ class TensorNameMap:
|
|||||||
|
|
||||||
# Attention value
|
# Attention value
|
||||||
MODEL_TENSOR.ATTN_V: (
|
MODEL_TENSOR.ATTN_V: (
|
||||||
"model.layers.{bid}.self_attn.v_proj", # llama-hf nemotron olmoe olmo2
|
"model.layers.{bid}.self_attn.v_proj", # llama-hf nemotron olmoe olmo2 phimoe
|
||||||
"layers.{bid}.attention.wv", # llama-pth
|
"layers.{bid}.attention.wv", # llama-pth
|
||||||
"encoder.layer.{bid}.attention.self.value", # bert
|
"encoder.layer.{bid}.attention.self.value", # bert
|
||||||
"transformer.h.{bid}.attn.v_proj", # gpt-j
|
"transformer.h.{bid}.attn.v_proj", # gpt-j
|
||||||
@ -197,7 +197,7 @@ class TensorNameMap:
|
|||||||
"transformer.blocks.{bid}.attn.out_proj", # mpt
|
"transformer.blocks.{bid}.attn.out_proj", # mpt
|
||||||
"transformer.h.{bid}.self_attention.dense", # falcon
|
"transformer.h.{bid}.self_attention.dense", # falcon
|
||||||
"h.{bid}.self_attention.dense", # bloom
|
"h.{bid}.self_attention.dense", # bloom
|
||||||
"model.layers.{bid}.self_attn.o_proj", # llama-hf nemotron olmoe olmo2
|
"model.layers.{bid}.self_attn.o_proj", # llama-hf nemotron olmoe olmo2 phimoe
|
||||||
"model.layers.{bid}.self_attn.linear_attn", # deci
|
"model.layers.{bid}.self_attn.linear_attn", # deci
|
||||||
"layers.{bid}.attention.wo", # llama-pth
|
"layers.{bid}.attention.wo", # llama-pth
|
||||||
"encoder.layer.{bid}.attention.output.dense", # bert
|
"encoder.layer.{bid}.attention.output.dense", # bert
|
||||||
@ -242,7 +242,7 @@ class TensorNameMap:
|
|||||||
"transformer.h.{bid}.ln_2", # gpt2 refact qwen jais exaone
|
"transformer.h.{bid}.ln_2", # gpt2 refact qwen jais exaone
|
||||||
"h.{bid}.post_attention_layernorm", # bloom
|
"h.{bid}.post_attention_layernorm", # bloom
|
||||||
"transformer.blocks.{bid}.norm_2", # mpt
|
"transformer.blocks.{bid}.norm_2", # mpt
|
||||||
"model.layers.{bid}.post_attention_layernorm", # llama-hf nemotron olmoe
|
"model.layers.{bid}.post_attention_layernorm", # llama-hf nemotron olmoe phimoe
|
||||||
"layers.{bid}.ffn_norm", # llama-pth
|
"layers.{bid}.ffn_norm", # llama-pth
|
||||||
"language_model.encoder.layers.{bid}.post_attention_layernorm", # persimmon
|
"language_model.encoder.layers.{bid}.post_attention_layernorm", # persimmon
|
||||||
"model.layers.{bid}.ln2", # yi
|
"model.layers.{bid}.ln2", # yi
|
||||||
@ -265,7 +265,7 @@ class TensorNameMap:
|
|||||||
|
|
||||||
MODEL_TENSOR.FFN_GATE_INP: (
|
MODEL_TENSOR.FFN_GATE_INP: (
|
||||||
"layers.{bid}.feed_forward.gate", # mixtral
|
"layers.{bid}.feed_forward.gate", # mixtral
|
||||||
"model.layers.{bid}.block_sparse_moe.gate", # mixtral
|
"model.layers.{bid}.block_sparse_moe.gate", # mixtral phimoe
|
||||||
"model.layers.{bid}.mlp.gate", # qwen2moe olmoe
|
"model.layers.{bid}.mlp.gate", # qwen2moe olmoe
|
||||||
"transformer.decoder_layer.{bid}.router", # Grok
|
"transformer.decoder_layer.{bid}.router", # Grok
|
||||||
"transformer.blocks.{bid}.ffn.router.layer", # dbrx
|
"transformer.blocks.{bid}.ffn.router.layer", # dbrx
|
||||||
@ -310,10 +310,11 @@ class TensorNameMap:
|
|||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.FFN_UP_EXP: (
|
MODEL_TENSOR.FFN_UP_EXP: (
|
||||||
"layers.{bid}.feed_forward.experts.w3", # mixtral (merged)
|
"layers.{bid}.feed_forward.experts.w3", # mixtral (merged)
|
||||||
"transformer.decoder_layer.{bid}.moe.linear_v", # Grok (merged)
|
"transformer.decoder_layer.{bid}.moe.linear_v", # Grok (merged)
|
||||||
"transformer.blocks.{bid}.ffn.experts.mlp.v1", # dbrx
|
"transformer.blocks.{bid}.ffn.experts.mlp.v1", # dbrx
|
||||||
"model.layers.{bid}.mlp.experts.up_proj", # qwen2moe olmoe (merged)
|
"model.layers.{bid}.mlp.experts.up_proj", # qwen2moe olmoe (merged)
|
||||||
|
"model.layers.{bid}.block_sparse_moe.experts.w3", # phimoe (merged)
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.FFN_UP_SHEXP: (
|
MODEL_TENSOR.FFN_UP_SHEXP: (
|
||||||
@ -342,10 +343,11 @@ class TensorNameMap:
|
|||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.FFN_GATE_EXP: (
|
MODEL_TENSOR.FFN_GATE_EXP: (
|
||||||
"layers.{bid}.feed_forward.experts.w1", # mixtral (merged)
|
"layers.{bid}.feed_forward.experts.w1", # mixtral (merged)
|
||||||
"transformer.decoder_layer.{bid}.moe.linear", # Grok (merged)
|
"transformer.decoder_layer.{bid}.moe.linear", # Grok (merged)
|
||||||
"transformer.blocks.{bid}.ffn.experts.mlp.w1", # dbrx
|
"transformer.blocks.{bid}.ffn.experts.mlp.w1", # dbrx
|
||||||
"model.layers.{bid}.mlp.experts.gate_proj", # qwen2moe olmoe (merged)
|
"model.layers.{bid}.mlp.experts.gate_proj", # qwen2moe olmoe (merged)
|
||||||
|
"model.layers.{bid}.block_sparse_moe.experts.w1", # phimoe (merged)
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.FFN_GATE_SHEXP: (
|
MODEL_TENSOR.FFN_GATE_SHEXP: (
|
||||||
@ -387,6 +389,7 @@ class TensorNameMap:
|
|||||||
"transformer.blocks.{bid}.ffn.experts.mlp.w2", # dbrx
|
"transformer.blocks.{bid}.ffn.experts.mlp.w2", # dbrx
|
||||||
"model.layers.{bid}.mlp.experts.down_proj", # qwen2moe olmoe (merged)
|
"model.layers.{bid}.mlp.experts.down_proj", # qwen2moe olmoe (merged)
|
||||||
"model.layers.{bid}.block_sparse_moe.output_linear", # granitemoe
|
"model.layers.{bid}.block_sparse_moe.output_linear", # granitemoe
|
||||||
|
"model.layers.{bid}.block_sparse_moe.experts.w2", # phimoe (merged)
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.FFN_DOWN_SHEXP: (
|
MODEL_TENSOR.FFN_DOWN_SHEXP: (
|
||||||
@ -461,34 +464,42 @@ class TensorNameMap:
|
|||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_W1: (
|
MODEL_TENSOR.TIME_MIX_W1: (
|
||||||
"rwkv.blocks.{bid}.attention.time_maa_w1", # rwkv v6
|
"rwkv.blocks.{bid}.attention.time_maa_w1", # rwkv v6
|
||||||
|
"model.layers.{bid}.self_attn.time_maa_w1", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_W2: (
|
MODEL_TENSOR.TIME_MIX_W2: (
|
||||||
"rwkv.blocks.{bid}.attention.time_maa_w2", # rwkv v6
|
"rwkv.blocks.{bid}.attention.time_maa_w2", # rwkv v6
|
||||||
|
"model.layers.{bid}.self_attn.time_maa_w2", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_LERP_X: (
|
MODEL_TENSOR.TIME_MIX_LERP_X: (
|
||||||
"rwkv.blocks.{bid}.attention.time_maa_x", # rwkv v6
|
"rwkv.blocks.{bid}.attention.time_maa_x", # rwkv v6
|
||||||
|
"model.layers.{bid}.self_attn.time_maa_x", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_LERP_K: (
|
MODEL_TENSOR.TIME_MIX_LERP_K: (
|
||||||
"rwkv.blocks.{bid}.attention.time_maa_k", # rwkv v6
|
"rwkv.blocks.{bid}.attention.time_maa_k", # rwkv v6
|
||||||
|
"model.layers.{bid}.self_attn.time_maa_k", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_LERP_V: (
|
MODEL_TENSOR.TIME_MIX_LERP_V: (
|
||||||
"rwkv.blocks.{bid}.attention.time_maa_v", # rwkv v6
|
"rwkv.blocks.{bid}.attention.time_maa_v", # rwkv v6
|
||||||
|
"model.layers.{bid}.self_attn.time_maa_v", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_LERP_R: (
|
MODEL_TENSOR.TIME_MIX_LERP_R: (
|
||||||
"rwkv.blocks.{bid}.attention.time_maa_r", # rwkv v6
|
"rwkv.blocks.{bid}.attention.time_maa_r", # rwkv v6
|
||||||
|
"model.layers.{bid}.self_attn.time_maa_r", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_LERP_G: (
|
MODEL_TENSOR.TIME_MIX_LERP_G: (
|
||||||
"rwkv.blocks.{bid}.attention.time_maa_g", # rwkv v6
|
"rwkv.blocks.{bid}.attention.time_maa_g", # rwkv v6
|
||||||
|
"model.layers.{bid}.self_attn.time_maa_g", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_LERP_W: (
|
MODEL_TENSOR.TIME_MIX_LERP_W: (
|
||||||
"rwkv.blocks.{bid}.attention.time_maa_w", # rwkv v6
|
"rwkv.blocks.{bid}.attention.time_maa_w", # rwkv v6
|
||||||
|
"model.layers.{bid}.self_attn.time_maa_w", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_FIRST: (
|
MODEL_TENSOR.TIME_MIX_FIRST: (
|
||||||
@ -497,30 +508,37 @@ class TensorNameMap:
|
|||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_DECAY: (
|
MODEL_TENSOR.TIME_MIX_DECAY: (
|
||||||
"rwkv.blocks.{bid}.attention.time_decay", # rwkv v6
|
"rwkv.blocks.{bid}.attention.time_decay", # rwkv v6
|
||||||
|
"model.layers.{bid}.self_attn.time_decay", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_DECAY_W1: (
|
MODEL_TENSOR.TIME_MIX_DECAY_W1: (
|
||||||
"rwkv.blocks.{bid}.attention.time_decay_w1", # rwkv v6
|
"rwkv.blocks.{bid}.attention.time_decay_w1", # rwkv v6
|
||||||
|
"model.layers.{bid}.self_attn.time_decay_w1", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_DECAY_W2: (
|
MODEL_TENSOR.TIME_MIX_DECAY_W2: (
|
||||||
"rwkv.blocks.{bid}.attention.time_decay_w2", # rwkv v6
|
"rwkv.blocks.{bid}.attention.time_decay_w2", # rwkv v6
|
||||||
|
"model.layers.{bid}.self_attn.time_decay_w2", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_KEY: (
|
MODEL_TENSOR.TIME_MIX_KEY: (
|
||||||
"rwkv.blocks.{bid}.attention.key", # rwkv
|
"rwkv.blocks.{bid}.attention.key", # rwkv
|
||||||
|
"model.layers.{bid}.self_attn.k_proj", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_VALUE: (
|
MODEL_TENSOR.TIME_MIX_VALUE: (
|
||||||
"rwkv.blocks.{bid}.attention.value", # rwkv
|
"rwkv.blocks.{bid}.attention.value", # rwkv
|
||||||
|
"model.layers.{bid}.self_attn.v_proj", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_RECEPTANCE: (
|
MODEL_TENSOR.TIME_MIX_RECEPTANCE: (
|
||||||
"rwkv.blocks.{bid}.attention.receptance", # rwkv
|
"rwkv.blocks.{bid}.attention.receptance", # rwkv
|
||||||
|
"model.layers.{bid}.self_attn.q_proj", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_GATE: (
|
MODEL_TENSOR.TIME_MIX_GATE: (
|
||||||
"rwkv.blocks.{bid}.attention.gate", # rwkv
|
"rwkv.blocks.{bid}.attention.gate", # rwkv
|
||||||
|
"model.layers.{bid}.self_attn.gate", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_LN: (
|
MODEL_TENSOR.TIME_MIX_LN: (
|
||||||
@ -528,7 +546,8 @@ class TensorNameMap:
|
|||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.TIME_MIX_OUTPUT: (
|
MODEL_TENSOR.TIME_MIX_OUTPUT: (
|
||||||
"rwkv.blocks.{bid}.attention.output", # rwkv
|
"rwkv.blocks.{bid}.attention.output", # rwkv
|
||||||
|
"model.layers.{bid}.self_attn.o_proj", # rwkv6qwen2
|
||||||
),
|
),
|
||||||
|
|
||||||
MODEL_TENSOR.CHANNEL_MIX_LERP_K: (
|
MODEL_TENSOR.CHANNEL_MIX_LERP_K: (
|
||||||
|
@ -1,12 +1,11 @@
|
|||||||
[tool.poetry]
|
[tool.poetry]
|
||||||
name = "gguf"
|
name = "gguf"
|
||||||
version = "0.13.0"
|
version = "0.15.0"
|
||||||
description = "Read and write ML models in GGUF for GGML"
|
description = "Read and write ML models in GGUF for GGML"
|
||||||
authors = ["GGML <ggml@ggml.ai>"]
|
authors = ["GGML <ggml@ggml.ai>"]
|
||||||
packages = [
|
packages = [
|
||||||
{include = "gguf"},
|
{include = "gguf"},
|
||||||
{include = "gguf/py.typed"},
|
{include = "gguf/py.typed"},
|
||||||
{include = "scripts"},
|
|
||||||
]
|
]
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
homepage = "https://ggml.ai"
|
homepage = "https://ggml.ai"
|
||||||
@ -33,7 +32,7 @@ requires = ["poetry-core>=1.0.0"]
|
|||||||
build-backend = "poetry.core.masonry.api"
|
build-backend = "poetry.core.masonry.api"
|
||||||
|
|
||||||
[tool.poetry.scripts]
|
[tool.poetry.scripts]
|
||||||
gguf-convert-endian = "scripts:gguf_convert_endian_entrypoint"
|
gguf-convert-endian = "gguf.scripts:gguf_convert_endian_entrypoint"
|
||||||
gguf-dump = "scripts:gguf_dump_entrypoint"
|
gguf-dump = "gguf.scripts:gguf_dump_entrypoint"
|
||||||
gguf-set-metadata = "scripts:gguf_set_metadata_entrypoint"
|
gguf-set-metadata = "gguf.scripts:gguf_set_metadata_entrypoint"
|
||||||
gguf-new-metadata = "scripts:gguf_new_metadata_entrypoint"
|
gguf-new-metadata = "gguf.scripts:gguf_new_metadata_entrypoint"
|
||||||
|
@ -20,11 +20,11 @@ struct llama_sampler_deleter {
|
|||||||
void operator()(llama_sampler * sampler) { llama_sampler_free(sampler); }
|
void operator()(llama_sampler * sampler) { llama_sampler_free(sampler); }
|
||||||
};
|
};
|
||||||
|
|
||||||
struct llama_lora_adapter_deleter {
|
struct llama_adapter_lora_deleter {
|
||||||
void operator()(llama_lora_adapter * lora_adapter) { llama_lora_adapter_free(lora_adapter); }
|
void operator()(llama_adapter_lora * adapter) { llama_adapter_lora_free(adapter); }
|
||||||
};
|
};
|
||||||
|
|
||||||
typedef std::unique_ptr<llama_model, llama_model_deleter> llama_model_ptr;
|
typedef std::unique_ptr<llama_model, llama_model_deleter> llama_model_ptr;
|
||||||
typedef std::unique_ptr<llama_context, llama_context_deleter> llama_context_ptr;
|
typedef std::unique_ptr<llama_context, llama_context_deleter> llama_context_ptr;
|
||||||
typedef std::unique_ptr<llama_sampler, llama_sampler_deleter> llama_sampler_ptr;
|
typedef std::unique_ptr<llama_sampler, llama_sampler_deleter> llama_sampler_ptr;
|
||||||
typedef std::unique_ptr<llama_lora_adapter, llama_lora_adapter_deleter> llama_lora_adapter_ptr;
|
typedef std::unique_ptr<llama_adapter_lora, llama_adapter_lora_deleter> llama_adapter_lora_ptr;
|
||||||
|
175
include/llama.h
175
include/llama.h
@ -56,7 +56,7 @@ extern "C" {
|
|||||||
// TODO: show sample usage
|
// TODO: show sample usage
|
||||||
//
|
//
|
||||||
|
|
||||||
// struct llama_vocab; // TODO: add in the future
|
struct llama_vocab;
|
||||||
struct llama_model;
|
struct llama_model;
|
||||||
struct llama_context;
|
struct llama_context;
|
||||||
struct llama_sampler;
|
struct llama_sampler;
|
||||||
@ -385,8 +385,7 @@ extern "C" {
|
|||||||
} llama_chat_message;
|
} llama_chat_message;
|
||||||
|
|
||||||
// lora adapter
|
// lora adapter
|
||||||
// TODO: rename to llama_adapter_lora
|
struct llama_adapter_lora;
|
||||||
struct llama_lora_adapter;
|
|
||||||
|
|
||||||
// Helpers for getting default parameters
|
// Helpers for getting default parameters
|
||||||
// TODO: update API to start accepting pointers to params structs (https://github.com/ggerganov/llama.cpp/discussions/9172)
|
// TODO: update API to start accepting pointers to params structs (https://github.com/ggerganov/llama.cpp/discussions/9172)
|
||||||
@ -400,18 +399,19 @@ extern "C" {
|
|||||||
// Call once at the start of the program
|
// Call once at the start of the program
|
||||||
LLAMA_API void llama_backend_init(void);
|
LLAMA_API void llama_backend_init(void);
|
||||||
|
|
||||||
|
// Call once at the end of the program - currently only used for MPI
|
||||||
|
LLAMA_API void llama_backend_free(void);
|
||||||
|
|
||||||
//optional:
|
//optional:
|
||||||
LLAMA_API void llama_numa_init(enum ggml_numa_strategy numa);
|
LLAMA_API void llama_numa_init(enum ggml_numa_strategy numa);
|
||||||
|
|
||||||
// Optional: an auto threadpool gets created in ggml if not passed explicitly
|
// Optional: an auto threadpool gets created in ggml if not passed explicitly
|
||||||
LLAMA_API void llama_attach_threadpool(
|
LLAMA_API void llama_attach_threadpool(
|
||||||
struct llama_context * ctx,
|
struct llama_context * ctx,
|
||||||
ggml_threadpool_t threadpool,
|
ggml_threadpool_t threadpool,
|
||||||
ggml_threadpool_t threadpool_batch);
|
ggml_threadpool_t threadpool_batch);
|
||||||
LLAMA_API void llama_detach_threadpool(struct llama_context * ctx);
|
|
||||||
|
|
||||||
// Call once at the end of the program - currently only used for MPI
|
LLAMA_API void llama_detach_threadpool(struct llama_context * ctx);
|
||||||
LLAMA_API void llama_backend_free(void);
|
|
||||||
|
|
||||||
DEPRECATED(LLAMA_API struct llama_model * llama_load_model_from_file(
|
DEPRECATED(LLAMA_API struct llama_model * llama_load_model_from_file(
|
||||||
const char * path_model,
|
const char * path_model,
|
||||||
@ -427,11 +427,15 @@ extern "C" {
|
|||||||
|
|
||||||
LLAMA_API void llama_model_free(struct llama_model * model);
|
LLAMA_API void llama_model_free(struct llama_model * model);
|
||||||
|
|
||||||
// TODO: rename to llama_init_from_model
|
LLAMA_API struct llama_context * llama_init_from_model(
|
||||||
LLAMA_API struct llama_context * llama_new_context_with_model(
|
|
||||||
struct llama_model * model,
|
struct llama_model * model,
|
||||||
struct llama_context_params params);
|
struct llama_context_params params);
|
||||||
|
|
||||||
|
DEPRECATED(LLAMA_API struct llama_context * llama_new_context_with_model(
|
||||||
|
struct llama_model * model,
|
||||||
|
struct llama_context_params params),
|
||||||
|
"use llama_init_from_model instead");
|
||||||
|
|
||||||
// Frees all allocated memory
|
// Frees all allocated memory
|
||||||
LLAMA_API void llama_free(struct llama_context * ctx);
|
LLAMA_API void llama_free(struct llama_context * ctx);
|
||||||
|
|
||||||
@ -449,20 +453,30 @@ extern "C" {
|
|||||||
LLAMA_API uint32_t llama_n_ubatch (const struct llama_context * ctx);
|
LLAMA_API uint32_t llama_n_ubatch (const struct llama_context * ctx);
|
||||||
LLAMA_API uint32_t llama_n_seq_max (const struct llama_context * ctx);
|
LLAMA_API uint32_t llama_n_seq_max (const struct llama_context * ctx);
|
||||||
|
|
||||||
LLAMA_API int32_t llama_n_vocab (const struct llama_model * model);
|
DEPRECATED(LLAMA_API int32_t llama_n_ctx_train(const struct llama_model * model), "use llama_model_n_ctx_train instead");
|
||||||
LLAMA_API int32_t llama_n_ctx_train(const struct llama_model * model);
|
DEPRECATED(LLAMA_API int32_t llama_n_embd (const struct llama_model * model), "use llama_model_n_embd instead");
|
||||||
LLAMA_API int32_t llama_n_embd (const struct llama_model * model);
|
DEPRECATED(LLAMA_API int32_t llama_n_layer (const struct llama_model * model), "use llama_model_n_layer instead");
|
||||||
LLAMA_API int32_t llama_n_layer (const struct llama_model * model);
|
DEPRECATED(LLAMA_API int32_t llama_n_head (const struct llama_model * model), "use llama_model_n_head instead");
|
||||||
LLAMA_API int32_t llama_n_head (const struct llama_model * model);
|
|
||||||
|
|
||||||
LLAMA_API const struct llama_model * llama_get_model(const struct llama_context * ctx);
|
DEPRECATED(LLAMA_API int32_t llama_n_vocab (const struct llama_vocab * vocab), "use llama_vocab_n_tokens instead");
|
||||||
|
|
||||||
LLAMA_API enum llama_pooling_type llama_pooling_type(const struct llama_context * ctx);
|
LLAMA_API const struct llama_model * llama_get_model (const struct llama_context * ctx);
|
||||||
LLAMA_API enum llama_vocab_type llama_vocab_type (const struct llama_model * model);
|
LLAMA_API enum llama_pooling_type llama_pooling_type(const struct llama_context * ctx);
|
||||||
LLAMA_API enum llama_rope_type llama_rope_type (const struct llama_model * model);
|
|
||||||
|
LLAMA_API const struct llama_vocab * llama_model_get_vocab(const struct llama_model * model);
|
||||||
|
LLAMA_API enum llama_rope_type llama_model_rope_type(const struct llama_model * model);
|
||||||
|
|
||||||
|
LLAMA_API int32_t llama_model_n_ctx_train(const struct llama_model * model);
|
||||||
|
LLAMA_API int32_t llama_model_n_embd (const struct llama_model * model);
|
||||||
|
LLAMA_API int32_t llama_model_n_layer (const struct llama_model * model);
|
||||||
|
LLAMA_API int32_t llama_model_n_head (const struct llama_model * model);
|
||||||
|
|
||||||
// Get the model's RoPE frequency scaling factor
|
// Get the model's RoPE frequency scaling factor
|
||||||
LLAMA_API float llama_rope_freq_scale_train(const struct llama_model * model);
|
LLAMA_API float llama_model_rope_freq_scale_train(const struct llama_model * model);
|
||||||
|
|
||||||
|
LLAMA_API enum llama_vocab_type llama_vocab_type(const struct llama_vocab * vocab);
|
||||||
|
|
||||||
|
LLAMA_API int32_t llama_vocab_n_tokens(const struct llama_vocab * vocab);
|
||||||
|
|
||||||
// Functions to access the model's GGUF metadata scalar values
|
// Functions to access the model's GGUF metadata scalar values
|
||||||
// - The functions return the length of the string on success, or -1 on failure
|
// - The functions return the length of the string on success, or -1 on failure
|
||||||
@ -488,6 +502,9 @@ extern "C" {
|
|||||||
// Returns the total size of all the tensors in the model in bytes
|
// Returns the total size of all the tensors in the model in bytes
|
||||||
LLAMA_API uint64_t llama_model_size(const struct llama_model * model);
|
LLAMA_API uint64_t llama_model_size(const struct llama_model * model);
|
||||||
|
|
||||||
|
// Get the default chat template. Returns nullptr if not available
|
||||||
|
LLAMA_API const char * llama_model_chat_template(const struct llama_model * model);
|
||||||
|
|
||||||
// Returns the total number of parameters in the model
|
// Returns the total number of parameters in the model
|
||||||
LLAMA_API uint64_t llama_model_n_params(const struct llama_model * model);
|
LLAMA_API uint64_t llama_model_n_params(const struct llama_model * model);
|
||||||
|
|
||||||
@ -515,34 +532,31 @@ extern "C" {
|
|||||||
//
|
//
|
||||||
|
|
||||||
// Load a LoRA adapter from file
|
// Load a LoRA adapter from file
|
||||||
// TODO: rename to llama_adapter_lora_init
|
LLAMA_API struct llama_adapter_lora * llama_adapter_lora_init(
|
||||||
LLAMA_API struct llama_lora_adapter * llama_lora_adapter_init(
|
|
||||||
struct llama_model * model,
|
struct llama_model * model,
|
||||||
const char * path_lora);
|
const char * path_lora);
|
||||||
|
|
||||||
|
// Manually free a LoRA adapter
|
||||||
|
// Note: loaded adapters will be free when the associated model is deleted
|
||||||
|
LLAMA_API void llama_adapter_lora_free(struct llama_adapter_lora * adapter);
|
||||||
|
|
||||||
|
// The following functions operate on a llama_context, hence the naming: llama_verb_...
|
||||||
|
|
||||||
// Add a loaded LoRA adapter to given context
|
// Add a loaded LoRA adapter to given context
|
||||||
// This will not modify model's weight
|
// This will not modify model's weight
|
||||||
// TODO: rename to llama_set_adapter_lora
|
LLAMA_API int32_t llama_set_adapter_lora(
|
||||||
LLAMA_API int32_t llama_lora_adapter_set(
|
|
||||||
struct llama_context * ctx,
|
struct llama_context * ctx,
|
||||||
struct llama_lora_adapter * adapter,
|
struct llama_adapter_lora * adapter,
|
||||||
float scale);
|
float scale);
|
||||||
|
|
||||||
// Remove a specific LoRA adapter from given context
|
// Remove a specific LoRA adapter from given context
|
||||||
// Return -1 if the adapter is not present in the context
|
// Return -1 if the adapter is not present in the context
|
||||||
// TODO: rename to llama_rm_adapter_lora
|
LLAMA_API int32_t llama_rm_adapter_lora(
|
||||||
LLAMA_API int32_t llama_lora_adapter_remove(
|
|
||||||
struct llama_context * ctx,
|
struct llama_context * ctx,
|
||||||
struct llama_lora_adapter * adapter);
|
struct llama_adapter_lora * adapter);
|
||||||
|
|
||||||
// Remove all LoRA adapters from given context
|
// Remove all LoRA adapters from given context
|
||||||
// TODO: rename to llama_clear_adapter_lora
|
LLAMA_API void llama_clear_adapter_lora(struct llama_context * ctx);
|
||||||
LLAMA_API void llama_lora_adapter_clear(struct llama_context * ctx);
|
|
||||||
|
|
||||||
// Manually free a LoRA adapter
|
|
||||||
// Note: loaded adapters will be free when the associated model is deleted
|
|
||||||
// TODO: rename to llama_adapter_lora_free
|
|
||||||
LLAMA_API void llama_lora_adapter_free(struct llama_lora_adapter * adapter);
|
|
||||||
|
|
||||||
// Apply a loaded control vector to a llama_context, or if data is NULL, clear
|
// Apply a loaded control vector to a llama_context, or if data is NULL, clear
|
||||||
// the currently loaded vector.
|
// the currently loaded vector.
|
||||||
@ -550,9 +564,8 @@ extern "C" {
|
|||||||
// to an n_embd x n_layers buffer starting from layer 1.
|
// to an n_embd x n_layers buffer starting from layer 1.
|
||||||
// il_start and il_end are the layer range the vector should apply to (both inclusive)
|
// il_start and il_end are the layer range the vector should apply to (both inclusive)
|
||||||
// See llama_control_vector_load in common to load a control vector.
|
// See llama_control_vector_load in common to load a control vector.
|
||||||
// TODO: rename to llama_adapter_cvec_apply
|
LLAMA_API int32_t llama_apply_adapter_cvec(
|
||||||
LLAMA_API int32_t llama_control_vector_apply(
|
struct llama_context * ctx,
|
||||||
struct llama_context * lctx,
|
|
||||||
const float * data,
|
const float * data,
|
||||||
size_t len,
|
size_t len,
|
||||||
int32_t n_embd,
|
int32_t n_embd,
|
||||||
@ -908,41 +921,60 @@ extern "C" {
|
|||||||
// Vocab
|
// Vocab
|
||||||
//
|
//
|
||||||
|
|
||||||
LLAMA_API const char * llama_token_get_text(const struct llama_model * model, llama_token token);
|
LLAMA_API const char * llama_vocab_get_text(const struct llama_vocab * vocab, llama_token token);
|
||||||
|
|
||||||
LLAMA_API float llama_token_get_score(const struct llama_model * model, llama_token token);
|
LLAMA_API float llama_vocab_get_score(const struct llama_vocab * vocab, llama_token token);
|
||||||
|
|
||||||
LLAMA_API enum llama_token_attr llama_token_get_attr(const struct llama_model * model, llama_token token);
|
LLAMA_API enum llama_token_attr llama_vocab_get_attr(const struct llama_vocab * vocab, llama_token token);
|
||||||
|
|
||||||
// Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.)
|
// Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.)
|
||||||
LLAMA_API bool llama_token_is_eog(const struct llama_model * model, llama_token token);
|
LLAMA_API bool llama_vocab_is_eog(const struct llama_vocab * vocab, llama_token token);
|
||||||
|
|
||||||
// Identify if Token Id is a control token or a render-able token
|
// Identify if Token Id is a control token or a render-able token
|
||||||
LLAMA_API bool llama_token_is_control(const struct llama_model * model, llama_token token);
|
LLAMA_API bool llama_vocab_is_control(const struct llama_vocab * vocab, llama_token token);
|
||||||
|
|
||||||
// Special tokens
|
// Special tokens
|
||||||
LLAMA_API llama_token llama_token_bos(const struct llama_model * model); // beginning-of-sentence
|
LLAMA_API llama_token llama_vocab_bos(const struct llama_vocab * vocab); // beginning-of-sentence
|
||||||
LLAMA_API llama_token llama_token_eos(const struct llama_model * model); // end-of-sentence
|
LLAMA_API llama_token llama_vocab_eos(const struct llama_vocab * vocab); // end-of-sentence
|
||||||
LLAMA_API llama_token llama_token_eot(const struct llama_model * model); // end-of-turn
|
LLAMA_API llama_token llama_vocab_eot(const struct llama_vocab * vocab); // end-of-turn
|
||||||
LLAMA_API llama_token llama_token_cls(const struct llama_model * model); // classification
|
LLAMA_API llama_token llama_vocab_sep(const struct llama_vocab * vocab); // sentence separator
|
||||||
LLAMA_API llama_token llama_token_sep(const struct llama_model * model); // sentence separator
|
LLAMA_API llama_token llama_vocab_nl (const struct llama_vocab * vocab); // next-line
|
||||||
LLAMA_API llama_token llama_token_nl (const struct llama_model * model); // next-line
|
LLAMA_API llama_token llama_vocab_pad(const struct llama_vocab * vocab); // padding
|
||||||
LLAMA_API llama_token llama_token_pad(const struct llama_model * model); // padding
|
|
||||||
|
|
||||||
LLAMA_API bool llama_add_bos_token(const struct llama_model * model);
|
LLAMA_API bool llama_vocab_get_add_bos(const struct llama_vocab * vocab);
|
||||||
LLAMA_API bool llama_add_eos_token(const struct llama_model * model);
|
LLAMA_API bool llama_vocab_get_add_eos(const struct llama_vocab * vocab);
|
||||||
|
|
||||||
// infill tokens
|
LLAMA_API llama_token llama_vocab_fim_pre(const struct llama_vocab * vocab);
|
||||||
DEPRECATED(LLAMA_API llama_token llama_token_prefix(const struct llama_model * model), "use llama_token_fim_pre instead");
|
LLAMA_API llama_token llama_vocab_fim_suf(const struct llama_vocab * vocab);
|
||||||
DEPRECATED(LLAMA_API llama_token llama_token_middle(const struct llama_model * model), "use llama_token_fim_mid instead");
|
LLAMA_API llama_token llama_vocab_fim_mid(const struct llama_vocab * vocab);
|
||||||
DEPRECATED(LLAMA_API llama_token llama_token_suffix(const struct llama_model * model), "use llama_token_fim_suf instead");
|
LLAMA_API llama_token llama_vocab_fim_pad(const struct llama_vocab * vocab);
|
||||||
|
LLAMA_API llama_token llama_vocab_fim_rep(const struct llama_vocab * vocab);
|
||||||
|
LLAMA_API llama_token llama_vocab_fim_sep(const struct llama_vocab * vocab);
|
||||||
|
|
||||||
LLAMA_API llama_token llama_token_fim_pre(const struct llama_model * model);
|
DEPRECATED(LLAMA_API const char * llama_token_get_text(const struct llama_vocab * vocab, llama_token token), "use llama_vocabable_get_text instead");
|
||||||
LLAMA_API llama_token llama_token_fim_suf(const struct llama_model * model);
|
DEPRECATED(LLAMA_API float llama_token_get_score(const struct llama_vocab * vocab, llama_token token), "use llama_vocab_get_score instead");
|
||||||
LLAMA_API llama_token llama_token_fim_mid(const struct llama_model * model);
|
DEPRECATED(LLAMA_API enum llama_token_attr llama_token_get_attr(const struct llama_vocab * vocab, llama_token token), "use llama_vocab_get_attr instead");
|
||||||
LLAMA_API llama_token llama_token_fim_pad(const struct llama_model * model);
|
DEPRECATED(LLAMA_API bool llama_token_is_eog(const struct llama_vocab * vocab, llama_token token), "use llama_vocab_is_eog instead");
|
||||||
LLAMA_API llama_token llama_token_fim_rep(const struct llama_model * model);
|
DEPRECATED(LLAMA_API bool llama_token_is_control(const struct llama_vocab * vocab, llama_token token), "use llama_vocab_is_control instead");
|
||||||
LLAMA_API llama_token llama_token_fim_sep(const struct llama_model * model);
|
DEPRECATED(LLAMA_API llama_token llama_token_bos(const struct llama_vocab * vocab), "use llama_vocab_bos instead");
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_token_eos(const struct llama_vocab * vocab), "use llama_vocab_eos instead");
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_token_eot(const struct llama_vocab * vocab), "use llama_vocab_eot instead");
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_token_cls(const struct llama_vocab * vocab), "use llama_vocab_cls instead");
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_token_sep(const struct llama_vocab * vocab), "use llama_vocab_sep instead");
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_token_nl (const struct llama_vocab * vocab), "use llama_vocab_nl instead");
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_token_pad(const struct llama_vocab * vocab), "use llama_vocab_pad instead");
|
||||||
|
DEPRECATED(LLAMA_API bool llama_add_bos_token(const struct llama_vocab * vocab), "use llama_vocab_get_add_bos instead");
|
||||||
|
DEPRECATED(LLAMA_API bool llama_add_eos_token(const struct llama_vocab * vocab), "use llama_vocab_get_add_eos instead");
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_token_fim_pre(const struct llama_vocab * vocab), "use llama_vocab_fim_pre instead");
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_token_fim_suf(const struct llama_vocab * vocab), "use llama_vocab_fim_suf instead");
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_token_fim_mid(const struct llama_vocab * vocab), "use llama_vocab_fim_mid instead");
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_token_fim_pad(const struct llama_vocab * vocab), "use llama_vocab_fim_pad instead");
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_token_fim_rep(const struct llama_vocab * vocab), "use llama_vocab_fim_rep instead");
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_token_fim_sep(const struct llama_vocab * vocab), "use llama_vocab_fim_sep instead");
|
||||||
|
|
||||||
|
// CLS is equivalent to BOS
|
||||||
|
DEPRECATED(LLAMA_API llama_token llama_vocab_cls(const struct llama_vocab * vocab), // classification
|
||||||
|
"use llama_vocab_bos instead");
|
||||||
|
|
||||||
//
|
//
|
||||||
// Tokenization
|
// Tokenization
|
||||||
@ -958,7 +990,7 @@ extern "C" {
|
|||||||
/// @param parse_special Allow tokenizing special and/or control tokens which otherwise are not exposed and treated
|
/// @param parse_special Allow tokenizing special and/or control tokens which otherwise are not exposed and treated
|
||||||
/// as plaintext. Does not insert a leading space.
|
/// as plaintext. Does not insert a leading space.
|
||||||
LLAMA_API int32_t llama_tokenize(
|
LLAMA_API int32_t llama_tokenize(
|
||||||
const struct llama_model * model,
|
const struct llama_vocab * vocab,
|
||||||
const char * text,
|
const char * text,
|
||||||
int32_t text_len,
|
int32_t text_len,
|
||||||
llama_token * tokens,
|
llama_token * tokens,
|
||||||
@ -972,7 +1004,7 @@ extern "C" {
|
|||||||
// User can skip up to 'lstrip' leading spaces before copying (useful when encoding/decoding multiple tokens with 'add_space_prefix')
|
// User can skip up to 'lstrip' leading spaces before copying (useful when encoding/decoding multiple tokens with 'add_space_prefix')
|
||||||
// @param special If true, special tokens are rendered in the output.
|
// @param special If true, special tokens are rendered in the output.
|
||||||
LLAMA_API int32_t llama_token_to_piece(
|
LLAMA_API int32_t llama_token_to_piece(
|
||||||
const struct llama_model * model,
|
const struct llama_vocab * vocab,
|
||||||
llama_token token,
|
llama_token token,
|
||||||
char * buf,
|
char * buf,
|
||||||
int32_t length,
|
int32_t length,
|
||||||
@ -986,7 +1018,7 @@ extern "C" {
|
|||||||
/// @param remove_special Allow to remove BOS and EOS tokens if model is configured to do so.
|
/// @param remove_special Allow to remove BOS and EOS tokens if model is configured to do so.
|
||||||
/// @param unparse_special If true, special tokens are rendered in the output.
|
/// @param unparse_special If true, special tokens are rendered in the output.
|
||||||
LLAMA_API int32_t llama_detokenize(
|
LLAMA_API int32_t llama_detokenize(
|
||||||
const struct llama_model * model,
|
const struct llama_vocab * vocab,
|
||||||
const llama_token * tokens,
|
const llama_token * tokens,
|
||||||
int32_t n_tokens,
|
int32_t n_tokens,
|
||||||
char * text,
|
char * text,
|
||||||
@ -1009,7 +1041,6 @@ extern "C" {
|
|||||||
/// @param length The size of the allocated buffer
|
/// @param length The size of the allocated buffer
|
||||||
/// @return The total number of bytes of the formatted prompt. If is it larger than the size of buffer, you may need to re-alloc it and then re-apply the template.
|
/// @return The total number of bytes of the formatted prompt. If is it larger than the size of buffer, you may need to re-alloc it and then re-apply the template.
|
||||||
LLAMA_API int32_t llama_chat_apply_template(
|
LLAMA_API int32_t llama_chat_apply_template(
|
||||||
const struct llama_model * model,
|
|
||||||
const char * tmpl,
|
const char * tmpl,
|
||||||
const struct llama_chat_message * chat,
|
const struct llama_chat_message * chat,
|
||||||
size_t n_msg,
|
size_t n_msg,
|
||||||
@ -1057,7 +1088,6 @@ extern "C" {
|
|||||||
// llama_sampler_free(smpl);
|
// llama_sampler_free(smpl);
|
||||||
//
|
//
|
||||||
// TODO: In the future, llama_sampler will be utilized to offload the sampling to the backends (e.g. GPU).
|
// TODO: In the future, llama_sampler will be utilized to offload the sampling to the backends (e.g. GPU).
|
||||||
// TODO: in the future, the entire sampling API that uses llama_model should start using llama_vocab
|
|
||||||
//
|
//
|
||||||
|
|
||||||
typedef void * llama_sampler_context_t;
|
typedef void * llama_sampler_context_t;
|
||||||
@ -1160,7 +1190,7 @@ extern "C" {
|
|||||||
float eta);
|
float eta);
|
||||||
|
|
||||||
LLAMA_API struct llama_sampler * llama_sampler_init_grammar(
|
LLAMA_API struct llama_sampler * llama_sampler_init_grammar(
|
||||||
const struct llama_model * model,
|
const struct llama_vocab * vocab,
|
||||||
const char * grammar_str,
|
const char * grammar_str,
|
||||||
const char * grammar_root);
|
const char * grammar_root);
|
||||||
|
|
||||||
@ -1172,8 +1202,9 @@ extern "C" {
|
|||||||
float penalty_present); // 0.0 = disabled
|
float penalty_present); // 0.0 = disabled
|
||||||
|
|
||||||
/// @details DRY sampler, designed by p-e-w, as described in: https://github.com/oobabooga/text-generation-webui/pull/5677, porting Koboldcpp implementation authored by pi6am: https://github.com/LostRuins/koboldcpp/pull/982
|
/// @details DRY sampler, designed by p-e-w, as described in: https://github.com/oobabooga/text-generation-webui/pull/5677, porting Koboldcpp implementation authored by pi6am: https://github.com/LostRuins/koboldcpp/pull/982
|
||||||
LLAMA_API struct llama_sampler * llama_sampler_init_dry(
|
LLAMA_API struct llama_sampler * llama_sampler_init_dry(
|
||||||
const struct llama_model * model,
|
const struct llama_vocab * vocab,
|
||||||
|
int32_t n_ctx_train,
|
||||||
float dry_multiplier,
|
float dry_multiplier,
|
||||||
float dry_base,
|
float dry_base,
|
||||||
int32_t dry_allowed_length,
|
int32_t dry_allowed_length,
|
||||||
@ -1207,7 +1238,7 @@ extern "C" {
|
|||||||
// 3. discard non-EOG tokens with low prob
|
// 3. discard non-EOG tokens with low prob
|
||||||
// 4. if no tokens are left -> pick EOT
|
// 4. if no tokens are left -> pick EOT
|
||||||
//
|
//
|
||||||
LLAMA_API struct llama_sampler * llama_sampler_init_infill(const struct llama_model * model);
|
LLAMA_API struct llama_sampler * llama_sampler_init_infill(const struct llama_vocab * vocab);
|
||||||
|
|
||||||
// Returns the seed used by the sampler if applicable, LLAMA_DEFAULT_SEED otherwise
|
// Returns the seed used by the sampler if applicable, LLAMA_DEFAULT_SEED otherwise
|
||||||
LLAMA_API uint32_t llama_sampler_get_seed(const struct llama_sampler * smpl);
|
LLAMA_API uint32_t llama_sampler_get_seed(const struct llama_sampler * smpl);
|
||||||
|
Binary file not shown.
Before Width: | Height: | Size: 195 KiB |
@ -1,5 +1,7 @@
|
|||||||
#include "llama-adapter.h"
|
#include "llama-adapter.h"
|
||||||
|
|
||||||
|
#include "llama-impl.h"
|
||||||
|
#include "llama-mmap.h"
|
||||||
#include "llama-model.h"
|
#include "llama-model.h"
|
||||||
|
|
||||||
#include <algorithm>
|
#include <algorithm>
|
||||||
@ -9,7 +11,7 @@
|
|||||||
|
|
||||||
// vec
|
// vec
|
||||||
|
|
||||||
struct ggml_tensor * llama_control_vector::tensor_for(int il) const {
|
struct ggml_tensor * llama_adapter_cvec::tensor_for(int il) const {
|
||||||
if (il < 0 || il < layer_start || il > layer_end || (size_t) il >= tensors.size()) {
|
if (il < 0 || il < layer_start || il > layer_end || (size_t) il >= tensors.size()) {
|
||||||
return nullptr;
|
return nullptr;
|
||||||
}
|
}
|
||||||
@ -17,7 +19,7 @@ struct ggml_tensor * llama_control_vector::tensor_for(int il) const {
|
|||||||
return tensors[il];
|
return tensors[il];
|
||||||
}
|
}
|
||||||
|
|
||||||
struct ggml_tensor * llama_control_vector::apply_to(struct ggml_context * ctx, struct ggml_tensor * cur, int il) const {
|
struct ggml_tensor * llama_adapter_cvec::apply_to(struct ggml_context * ctx, struct ggml_tensor * cur, int il) const {
|
||||||
ggml_tensor * layer_dir = tensor_for(il);
|
ggml_tensor * layer_dir = tensor_for(il);
|
||||||
if (layer_dir != nullptr) {
|
if (layer_dir != nullptr) {
|
||||||
cur = ggml_add(ctx, cur, layer_dir);
|
cur = ggml_add(ctx, cur, layer_dir);
|
||||||
@ -26,12 +28,12 @@ struct ggml_tensor * llama_control_vector::apply_to(struct ggml_context * ctx, s
|
|||||||
return cur;
|
return cur;
|
||||||
}
|
}
|
||||||
|
|
||||||
static bool llama_control_vector_init(struct llama_control_vector & cvec, const llama_model & model) {
|
bool llama_adapter_cvec::init(const llama_model & model) {
|
||||||
const auto & hparams = model.hparams;
|
const auto & hparams = model.hparams;
|
||||||
|
|
||||||
GGML_ASSERT(cvec.tensors.empty());
|
GGML_ASSERT(tensors.empty());
|
||||||
GGML_ASSERT(cvec.ctxs.empty());
|
GGML_ASSERT(ctxs.empty());
|
||||||
GGML_ASSERT(cvec.bufs.empty());
|
GGML_ASSERT(bufs.empty());
|
||||||
|
|
||||||
// create a context for each buffer type
|
// create a context for each buffer type
|
||||||
std::map<ggml_backend_buffer_type_t, ggml_context *> ctx_map;
|
std::map<ggml_backend_buffer_type_t, ggml_context *> ctx_map;
|
||||||
@ -50,7 +52,7 @@ static bool llama_control_vector_init(struct llama_control_vector & cvec, const
|
|||||||
}
|
}
|
||||||
|
|
||||||
ctx_map[buft] = ctx;
|
ctx_map[buft] = ctx;
|
||||||
cvec.ctxs.emplace_back(ctx);
|
ctxs.emplace_back(ctx);
|
||||||
|
|
||||||
return ctx;
|
return ctx;
|
||||||
}
|
}
|
||||||
@ -59,21 +61,21 @@ static bool llama_control_vector_init(struct llama_control_vector & cvec, const
|
|||||||
};
|
};
|
||||||
|
|
||||||
// make tensors
|
// make tensors
|
||||||
cvec.tensors.reserve(hparams.n_layer);
|
tensors.reserve(hparams.n_layer);
|
||||||
cvec.tensors.push_back(nullptr); // there's never a tensor for layer 0
|
tensors.push_back(nullptr); // there's never a tensor for layer 0
|
||||||
for (size_t il = 1; il < hparams.n_layer; il++) {
|
for (size_t il = 1; il < hparams.n_layer; il++) {
|
||||||
ggml_backend_buffer_type_t buft = llama_model_select_buft(model, il);
|
ggml_backend_buffer_type_t buft = model.select_buft(il);
|
||||||
ggml_context * ctx = ctx_for_buft(buft);
|
ggml_context * ctx = ctx_for_buft(buft);
|
||||||
if (!ctx) {
|
if (!ctx) {
|
||||||
LLAMA_LOG_ERROR("%s: failed to allocate context for control vector\n", __func__);
|
LLAMA_LOG_ERROR("%s: failed to allocate context for control vector\n", __func__);
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
ggml_tensor * tensor = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, hparams.n_embd);
|
ggml_tensor * tensor = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, hparams.n_embd);
|
||||||
cvec.tensors.push_back(tensor);
|
tensors.push_back(tensor);
|
||||||
}
|
}
|
||||||
|
|
||||||
// allocate tensors / buffers and zero
|
// allocate tensors / buffers and zero
|
||||||
cvec.bufs.reserve(ctx_map.size());
|
bufs.reserve(ctx_map.size());
|
||||||
for (auto it : ctx_map) {
|
for (auto it : ctx_map) {
|
||||||
ggml_backend_buffer_type_t buft = it.first;
|
ggml_backend_buffer_type_t buft = it.first;
|
||||||
ggml_context * ctx = it.second;
|
ggml_context * ctx = it.second;
|
||||||
@ -83,14 +85,13 @@ static bool llama_control_vector_init(struct llama_control_vector & cvec, const
|
|||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
ggml_backend_buffer_clear(buf, 0);
|
ggml_backend_buffer_clear(buf, 0);
|
||||||
cvec.bufs.emplace_back(buf);
|
bufs.emplace_back(buf);
|
||||||
}
|
}
|
||||||
|
|
||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
|
|
||||||
int32_t llama_control_vector_apply(
|
int32_t llama_adapter_cvec::apply(
|
||||||
struct llama_control_vector & cvec,
|
|
||||||
const llama_model & model,
|
const llama_model & model,
|
||||||
const float * data,
|
const float * data,
|
||||||
size_t len,
|
size_t len,
|
||||||
@ -101,8 +102,8 @@ int32_t llama_control_vector_apply(
|
|||||||
|
|
||||||
if (data == nullptr) {
|
if (data == nullptr) {
|
||||||
// disable the current control vector (but leave allocated for later)
|
// disable the current control vector (but leave allocated for later)
|
||||||
cvec.layer_start = -1;
|
layer_start = -1;
|
||||||
cvec.layer_end = -1;
|
layer_end = -1;
|
||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -111,21 +112,21 @@ int32_t llama_control_vector_apply(
|
|||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (cvec.tensors.empty()) {
|
if (tensors.empty()) {
|
||||||
if (!llama_control_vector_init(cvec, model)) {
|
if (!init(model)) {
|
||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
cvec.layer_start = il_start;
|
layer_start = il_start;
|
||||||
cvec.layer_end = il_end;
|
layer_end = il_end;
|
||||||
|
|
||||||
for (size_t il = 1; il < hparams.n_layer; il++) {
|
for (size_t il = 1; il < hparams.n_layer; il++) {
|
||||||
assert(cvec.tensors[il] != nullptr);
|
assert(tensors[il] != nullptr);
|
||||||
|
|
||||||
const size_t off = n_embd * (il - 1); // buffer doesn't have data for layer 0, since it's never present
|
const size_t off = n_embd * (il - 1); // buffer doesn't have data for layer 0, since it's never present
|
||||||
if (off + n_embd <= len) {
|
if (off + n_embd <= len) {
|
||||||
ggml_backend_tensor_set(cvec.tensors[il], data + off, 0, n_embd * ggml_element_size(cvec.tensors[il]));
|
ggml_backend_tensor_set(tensors[il], data + off, 0, n_embd * ggml_element_size(tensors[il]));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -134,7 +135,7 @@ int32_t llama_control_vector_apply(
|
|||||||
|
|
||||||
// lora
|
// lora
|
||||||
|
|
||||||
llama_lora_weight * llama_lora_adapter::get_weight(struct ggml_tensor * w) {
|
llama_adapter_lora_weight * llama_adapter_lora::get_weight(struct ggml_tensor * w) {
|
||||||
const std::string name(w->name);
|
const std::string name(w->name);
|
||||||
|
|
||||||
const auto pos = ab_map.find(name);
|
const auto pos = ab_map.find(name);
|
||||||
@ -145,11 +146,7 @@ llama_lora_weight * llama_lora_adapter::get_weight(struct ggml_tensor * w) {
|
|||||||
return nullptr;
|
return nullptr;
|
||||||
}
|
}
|
||||||
|
|
||||||
void llama_lora_adapter_free(struct llama_lora_adapter * adapter) {
|
static void llama_adapter_lora_init_impl(struct llama_model & model, const char * path_lora, struct llama_adapter_lora & adapter) {
|
||||||
delete adapter;
|
|
||||||
}
|
|
||||||
|
|
||||||
static void llama_lora_adapter_init_impl(struct llama_model & model, const char * path_lora, struct llama_lora_adapter & adapter) {
|
|
||||||
LLAMA_LOG_INFO("%s: loading lora adapter from '%s' ...\n", __func__, path_lora);
|
LLAMA_LOG_INFO("%s: loading lora adapter from '%s' ...\n", __func__, path_lora);
|
||||||
|
|
||||||
ggml_context * ctx_init;
|
ggml_context * ctx_init;
|
||||||
@ -221,7 +218,7 @@ static void llama_lora_adapter_init_impl(struct llama_model & model, const char
|
|||||||
};
|
};
|
||||||
|
|
||||||
// bundle lora_a and lora_b into pairs
|
// bundle lora_a and lora_b into pairs
|
||||||
std::map<std::string, llama_lora_weight> ab_map;
|
std::map<std::string, llama_adapter_lora_weight> ab_map;
|
||||||
auto str_endswith = [](const std::string & str, const std::string & suffix) {
|
auto str_endswith = [](const std::string & str, const std::string & suffix) {
|
||||||
return str.size() >= suffix.size() && str.compare(str.size()-suffix.size(), suffix.size(), suffix) == 0;
|
return str.size() >= suffix.size() && str.compare(str.size()-suffix.size(), suffix.size(), suffix) == 0;
|
||||||
};
|
};
|
||||||
@ -231,14 +228,14 @@ static void llama_lora_adapter_init_impl(struct llama_model & model, const char
|
|||||||
if (str_endswith(name, ".lora_a")) {
|
if (str_endswith(name, ".lora_a")) {
|
||||||
replace_all(name, ".lora_a", "");
|
replace_all(name, ".lora_a", "");
|
||||||
if (ab_map.find(name) == ab_map.end()) {
|
if (ab_map.find(name) == ab_map.end()) {
|
||||||
ab_map[name] = llama_lora_weight(cur, nullptr);
|
ab_map[name] = llama_adapter_lora_weight(cur, nullptr);
|
||||||
} else {
|
} else {
|
||||||
ab_map[name].a = cur;
|
ab_map[name].a = cur;
|
||||||
}
|
}
|
||||||
} else if (str_endswith(name, ".lora_b")) {
|
} else if (str_endswith(name, ".lora_b")) {
|
||||||
replace_all(name, ".lora_b", "");
|
replace_all(name, ".lora_b", "");
|
||||||
if (ab_map.find(name) == ab_map.end()) {
|
if (ab_map.find(name) == ab_map.end()) {
|
||||||
ab_map[name] = llama_lora_weight(nullptr, cur);
|
ab_map[name] = llama_adapter_lora_weight(nullptr, cur);
|
||||||
} else {
|
} else {
|
||||||
ab_map[name].b = cur;
|
ab_map[name].b = cur;
|
||||||
}
|
}
|
||||||
@ -254,7 +251,7 @@ static void llama_lora_adapter_init_impl(struct llama_model & model, const char
|
|||||||
// add tensors
|
// add tensors
|
||||||
for (auto & it : ab_map) {
|
for (auto & it : ab_map) {
|
||||||
const std::string & name = it.first;
|
const std::string & name = it.first;
|
||||||
llama_lora_weight & w = it.second;
|
llama_adapter_lora_weight & w = it.second;
|
||||||
bool is_token_embd = str_endswith(name, "token_embd.weight");
|
bool is_token_embd = str_endswith(name, "token_embd.weight");
|
||||||
|
|
||||||
if (!w.a || !w.b) {
|
if (!w.a || !w.b) {
|
||||||
@ -262,7 +259,7 @@ static void llama_lora_adapter_init_impl(struct llama_model & model, const char
|
|||||||
}
|
}
|
||||||
|
|
||||||
// device buft and device ctx
|
// device buft and device ctx
|
||||||
auto * model_tensor = llama_model_get_tensor(model, name.c_str());
|
const auto * model_tensor = model.get_tensor(name.c_str());
|
||||||
if (!model_tensor) {
|
if (!model_tensor) {
|
||||||
throw std::runtime_error("LoRA tensor '" + name + "' does not exist in base model (hint: maybe wrong base model?)");
|
throw std::runtime_error("LoRA tensor '" + name + "' does not exist in base model (hint: maybe wrong base model?)");
|
||||||
}
|
}
|
||||||
@ -288,7 +285,7 @@ static void llama_lora_adapter_init_impl(struct llama_model & model, const char
|
|||||||
struct ggml_tensor * tensor_b = ggml_dup_tensor(dev_ctx, w.b);
|
struct ggml_tensor * tensor_b = ggml_dup_tensor(dev_ctx, w.b);
|
||||||
ggml_set_name(tensor_a, w.a->name);
|
ggml_set_name(tensor_a, w.a->name);
|
||||||
ggml_set_name(tensor_b, w.b->name);
|
ggml_set_name(tensor_b, w.b->name);
|
||||||
adapter.ab_map[name] = llama_lora_weight(tensor_a, tensor_b);
|
adapter.ab_map[name] = llama_adapter_lora_weight(tensor_a, tensor_b);
|
||||||
}
|
}
|
||||||
|
|
||||||
// allocate tensors / buffers and zero
|
// allocate tensors / buffers and zero
|
||||||
@ -330,11 +327,11 @@ static void llama_lora_adapter_init_impl(struct llama_model & model, const char
|
|||||||
LLAMA_LOG_INFO("%s: loaded %zu tensors from lora file\n", __func__, adapter.ab_map.size()*2);
|
LLAMA_LOG_INFO("%s: loaded %zu tensors from lora file\n", __func__, adapter.ab_map.size()*2);
|
||||||
}
|
}
|
||||||
|
|
||||||
struct llama_lora_adapter * llama_lora_adapter_init(struct llama_model * model, const char * path_lora) {
|
struct llama_adapter_lora * llama_adapter_lora_init(struct llama_model * model, const char * path_lora) {
|
||||||
struct llama_lora_adapter * adapter = new llama_lora_adapter();
|
struct llama_adapter_lora * adapter = new llama_adapter_lora();
|
||||||
|
|
||||||
try {
|
try {
|
||||||
llama_lora_adapter_init_impl(*model, path_lora, *adapter);
|
llama_adapter_lora_init_impl(*model, path_lora, *adapter);
|
||||||
return adapter;
|
return adapter;
|
||||||
} catch (const std::exception & err) {
|
} catch (const std::exception & err) {
|
||||||
LLAMA_LOG_ERROR("%s: failed to apply lora adapter: %s\n", __func__, err.what());
|
LLAMA_LOG_ERROR("%s: failed to apply lora adapter: %s\n", __func__, err.what());
|
||||||
@ -344,3 +341,7 @@ struct llama_lora_adapter * llama_lora_adapter_init(struct llama_model * model,
|
|||||||
|
|
||||||
return nullptr;
|
return nullptr;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
void llama_adapter_lora_free(struct llama_adapter_lora * adapter) {
|
||||||
|
delete adapter;
|
||||||
|
}
|
||||||
|
@ -1,73 +1,74 @@
|
|||||||
#pragma once
|
#pragma once
|
||||||
|
|
||||||
#include "llama-impl.h"
|
#include "llama.h"
|
||||||
#include "llama-hparams.h"
|
|
||||||
|
|
||||||
#include "ggml-cpp.h"
|
#include "ggml-cpp.h"
|
||||||
|
|
||||||
|
#include <string>
|
||||||
#include <unordered_map>
|
#include <unordered_map>
|
||||||
#include <vector>
|
#include <vector>
|
||||||
|
|
||||||
|
// TODO: pimpl
|
||||||
|
|
||||||
//
|
//
|
||||||
// llama_adapter_cvec
|
// llama_adapter_cvec
|
||||||
//
|
//
|
||||||
|
|
||||||
// TODO: rename to llama_adapter_cvec
|
struct llama_adapter_cvec {
|
||||||
struct llama_control_vector {
|
struct ggml_tensor * tensor_for(int il) const;
|
||||||
std::vector<ggml_context_ptr> ctxs;
|
|
||||||
std::vector<ggml_backend_buffer_ptr> bufs;
|
|
||||||
|
|
||||||
std::vector<struct ggml_tensor *> tensors; // per layer
|
struct ggml_tensor * apply_to(struct ggml_context * ctx, struct ggml_tensor * cur, int il) const;
|
||||||
|
|
||||||
|
int32_t apply(
|
||||||
|
const llama_model & model,
|
||||||
|
const float * data,
|
||||||
|
size_t len,
|
||||||
|
int32_t n_embd,
|
||||||
|
int32_t il_start,
|
||||||
|
int32_t il_end);
|
||||||
|
|
||||||
|
private:
|
||||||
|
bool init(const llama_model & model);
|
||||||
|
|
||||||
int32_t layer_start = -1;
|
int32_t layer_start = -1;
|
||||||
int32_t layer_end = -1;
|
int32_t layer_end = -1;
|
||||||
|
|
||||||
struct ggml_tensor * tensor_for(int il) const;
|
std::vector<ggml_context_ptr> ctxs;
|
||||||
|
std::vector<ggml_backend_buffer_ptr> bufs;
|
||||||
|
|
||||||
struct ggml_tensor * apply_to(struct ggml_context * ctx, struct ggml_tensor * cur, int il) const;
|
std::vector<struct ggml_tensor *> tensors; // per layer
|
||||||
};
|
};
|
||||||
|
|
||||||
int32_t llama_control_vector_apply(
|
|
||||||
struct llama_control_vector & cvec,
|
|
||||||
const llama_model & model,
|
|
||||||
const float * data,
|
|
||||||
size_t len,
|
|
||||||
int32_t n_embd,
|
|
||||||
int32_t il_start,
|
|
||||||
int32_t il_end);
|
|
||||||
|
|
||||||
//
|
//
|
||||||
// llama_adapter_lora
|
// llama_adapter_lora
|
||||||
//
|
//
|
||||||
|
|
||||||
// TODO: rename to llama_adapter_lora_weight
|
struct llama_adapter_lora_weight {
|
||||||
struct llama_lora_weight {
|
|
||||||
struct ggml_tensor * a = nullptr;
|
struct ggml_tensor * a = nullptr;
|
||||||
struct ggml_tensor * b = nullptr;
|
struct ggml_tensor * b = nullptr;
|
||||||
|
|
||||||
// get actual scale based on rank and alpha
|
// get actual scale based on rank and alpha
|
||||||
float get_scale(float alpha, float adapter_scale) {
|
float get_scale(float alpha, float adapter_scale) const {
|
||||||
const float rank = (float) b->ne[0];
|
const float rank = (float) b->ne[0];
|
||||||
const float scale = alpha ? adapter_scale * alpha / rank : adapter_scale;
|
const float scale = alpha ? adapter_scale * alpha / rank : adapter_scale;
|
||||||
return scale;
|
return scale;
|
||||||
}
|
}
|
||||||
|
|
||||||
llama_lora_weight() = default;
|
llama_adapter_lora_weight() = default;
|
||||||
llama_lora_weight(struct ggml_tensor * a, struct ggml_tensor * b) : a(a), b(b) {}
|
llama_adapter_lora_weight(struct ggml_tensor * a, struct ggml_tensor * b) : a(a), b(b) {}
|
||||||
};
|
};
|
||||||
|
|
||||||
// TODO: rename to llama_adapter_lora
|
struct llama_adapter_lora {
|
||||||
struct llama_lora_adapter {
|
|
||||||
// map tensor name to lora_a_b
|
// map tensor name to lora_a_b
|
||||||
std::unordered_map<std::string, struct llama_lora_weight> ab_map;
|
std::unordered_map<std::string, struct llama_adapter_lora_weight> ab_map;
|
||||||
|
|
||||||
std::vector<ggml_context_ptr> ctxs;
|
std::vector<ggml_context_ptr> ctxs;
|
||||||
std::vector<ggml_backend_buffer_ptr> bufs;
|
std::vector<ggml_backend_buffer_ptr> bufs;
|
||||||
|
|
||||||
float alpha;
|
float alpha;
|
||||||
|
|
||||||
llama_lora_adapter() = default;
|
llama_adapter_lora() = default;
|
||||||
~llama_lora_adapter() = default;
|
~llama_adapter_lora() = default;
|
||||||
|
|
||||||
llama_lora_weight * get_weight(struct ggml_tensor * w);
|
llama_adapter_lora_weight * get_weight(struct ggml_tensor * w);
|
||||||
};
|
};
|
||||||
|
@ -27,6 +27,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
|
|||||||
{ LLM_ARCH_QWEN2VL, "qwen2vl" },
|
{ LLM_ARCH_QWEN2VL, "qwen2vl" },
|
||||||
{ LLM_ARCH_PHI2, "phi2" },
|
{ LLM_ARCH_PHI2, "phi2" },
|
||||||
{ LLM_ARCH_PHI3, "phi3" },
|
{ LLM_ARCH_PHI3, "phi3" },
|
||||||
|
{ LLM_ARCH_PHIMOE, "phimoe" },
|
||||||
{ LLM_ARCH_PLAMO, "plamo" },
|
{ LLM_ARCH_PLAMO, "plamo" },
|
||||||
{ LLM_ARCH_CODESHELL, "codeshell" },
|
{ LLM_ARCH_CODESHELL, "codeshell" },
|
||||||
{ LLM_ARCH_ORION, "orion" },
|
{ LLM_ARCH_ORION, "orion" },
|
||||||
@ -56,6 +57,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
|
|||||||
{ LLM_ARCH_NEMOTRON, "nemotron" },
|
{ LLM_ARCH_NEMOTRON, "nemotron" },
|
||||||
{ LLM_ARCH_EXAONE, "exaone" },
|
{ LLM_ARCH_EXAONE, "exaone" },
|
||||||
{ LLM_ARCH_RWKV6, "rwkv6" },
|
{ LLM_ARCH_RWKV6, "rwkv6" },
|
||||||
|
{ LLM_ARCH_RWKV6QWEN2, "rwkv6qwen2" },
|
||||||
{ LLM_ARCH_GRANITE, "granite" },
|
{ LLM_ARCH_GRANITE, "granite" },
|
||||||
{ LLM_ARCH_GRANITE_MOE, "granitemoe" },
|
{ LLM_ARCH_GRANITE_MOE, "granitemoe" },
|
||||||
{ LLM_ARCH_CHAMELEON, "chameleon" },
|
{ LLM_ARCH_CHAMELEON, "chameleon" },
|
||||||
@ -105,6 +107,7 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
|
|||||||
{ LLM_KV_TIME_DECAY_EXTRA_DIM, "%s.time_decay_extra_dim" },
|
{ LLM_KV_TIME_DECAY_EXTRA_DIM, "%s.time_decay_extra_dim" },
|
||||||
{ LLM_KV_RESIDUAL_SCALE, "%s.residual_scale" },
|
{ LLM_KV_RESIDUAL_SCALE, "%s.residual_scale" },
|
||||||
{ LLM_KV_EMBEDDING_SCALE, "%s.embedding_scale" },
|
{ LLM_KV_EMBEDDING_SCALE, "%s.embedding_scale" },
|
||||||
|
{ LLM_KV_TOKEN_SHIFT_COUNT, "%s.token_shift_count" },
|
||||||
|
|
||||||
{ LLM_KV_ATTENTION_HEAD_COUNT, "%s.attention.head_count" },
|
{ LLM_KV_ATTENTION_HEAD_COUNT, "%s.attention.head_count" },
|
||||||
{ LLM_KV_ATTENTION_HEAD_COUNT_KV, "%s.attention.head_count_kv" },
|
{ LLM_KV_ATTENTION_HEAD_COUNT_KV, "%s.attention.head_count_kv" },
|
||||||
@ -175,6 +178,7 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
|
|||||||
{ LLM_KV_TOKENIZER_PRECOMPILED_CHARSMAP, "tokenizer.ggml.precompiled_charsmap" },
|
{ LLM_KV_TOKENIZER_PRECOMPILED_CHARSMAP, "tokenizer.ggml.precompiled_charsmap" },
|
||||||
{ LLM_KV_TOKENIZER_HF_JSON, "tokenizer.huggingface.json" },
|
{ LLM_KV_TOKENIZER_HF_JSON, "tokenizer.huggingface.json" },
|
||||||
{ LLM_KV_TOKENIZER_RWKV, "tokenizer.rwkv.world" },
|
{ LLM_KV_TOKENIZER_RWKV, "tokenizer.rwkv.world" },
|
||||||
|
{ LLM_KV_TOKENIZER_CHAT_TEMPLATE, "tokenizer.chat_template" },
|
||||||
{ LLM_KV_TOKENIZER_FIM_PRE_ID, "tokenizer.ggml.fim_pre_token_id" },
|
{ LLM_KV_TOKENIZER_FIM_PRE_ID, "tokenizer.ggml.fim_pre_token_id" },
|
||||||
{ LLM_KV_TOKENIZER_FIM_SUF_ID, "tokenizer.ggml.fim_suf_token_id" },
|
{ LLM_KV_TOKENIZER_FIM_SUF_ID, "tokenizer.ggml.fim_suf_token_id" },
|
||||||
{ LLM_KV_TOKENIZER_FIM_MID_ID, "tokenizer.ggml.fim_mid_token_id" },
|
{ LLM_KV_TOKENIZER_FIM_MID_ID, "tokenizer.ggml.fim_mid_token_id" },
|
||||||
@ -584,6 +588,27 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
|
|||||||
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
|
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
LLM_ARCH_PHIMOE,
|
||||||
|
{
|
||||||
|
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
|
||||||
|
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
|
||||||
|
{ LLM_TENSOR_OUTPUT, "output" },
|
||||||
|
{ LLM_TENSOR_ROPE_FACTORS_LONG, "rope_factors_long" },
|
||||||
|
{ LLM_TENSOR_ROPE_FACTORS_SHORT, "rope_factors_short" },
|
||||||
|
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
|
||||||
|
{ LLM_TENSOR_ATTN_QKV, "blk.%d.attn_qkv" },
|
||||||
|
{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
|
||||||
|
{ LLM_TENSOR_ATTN_K, "blk.%d.attn_k" },
|
||||||
|
{ LLM_TENSOR_ATTN_V, "blk.%d.attn_v" },
|
||||||
|
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
|
||||||
|
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
|
||||||
|
{ LLM_TENSOR_FFN_GATE_INP, "blk.%d.ffn_gate_inp" },
|
||||||
|
{ LLM_TENSOR_FFN_GATE_EXPS, "blk.%d.ffn_gate_exps" },
|
||||||
|
{ LLM_TENSOR_FFN_DOWN_EXPS, "blk.%d.ffn_down_exps" },
|
||||||
|
{ LLM_TENSOR_FFN_UP_EXPS, "blk.%d.ffn_up_exps" },
|
||||||
|
},
|
||||||
|
},
|
||||||
{
|
{
|
||||||
LLM_ARCH_PLAMO,
|
LLM_ARCH_PLAMO,
|
||||||
{
|
{
|
||||||
@ -1144,6 +1169,7 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
|
|||||||
{ LLM_TENSOR_TIME_MIX_LERP_V, "blk.%d.time_mix_lerp_v" },
|
{ LLM_TENSOR_TIME_MIX_LERP_V, "blk.%d.time_mix_lerp_v" },
|
||||||
{ LLM_TENSOR_TIME_MIX_LERP_R, "blk.%d.time_mix_lerp_r" },
|
{ LLM_TENSOR_TIME_MIX_LERP_R, "blk.%d.time_mix_lerp_r" },
|
||||||
{ LLM_TENSOR_TIME_MIX_LERP_G, "blk.%d.time_mix_lerp_g" },
|
{ LLM_TENSOR_TIME_MIX_LERP_G, "blk.%d.time_mix_lerp_g" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_LERP_FUSED, "blk.%d.time_mix_lerp_fused" },
|
||||||
{ LLM_TENSOR_TIME_MIX_FIRST, "blk.%d.time_mix_first" },
|
{ LLM_TENSOR_TIME_MIX_FIRST, "blk.%d.time_mix_first" },
|
||||||
{ LLM_TENSOR_TIME_MIX_DECAY, "blk.%d.time_mix_decay" },
|
{ LLM_TENSOR_TIME_MIX_DECAY, "blk.%d.time_mix_decay" },
|
||||||
{ LLM_TENSOR_TIME_MIX_DECAY_W1, "blk.%d.time_mix_decay_w1" },
|
{ LLM_TENSOR_TIME_MIX_DECAY_W1, "blk.%d.time_mix_decay_w1" },
|
||||||
@ -1161,6 +1187,32 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
|
|||||||
{ LLM_TENSOR_CHANNEL_MIX_RECEPTANCE, "blk.%d.channel_mix_receptance" },
|
{ LLM_TENSOR_CHANNEL_MIX_RECEPTANCE, "blk.%d.channel_mix_receptance" },
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
LLM_ARCH_RWKV6QWEN2,
|
||||||
|
{
|
||||||
|
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
|
||||||
|
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
|
||||||
|
{ LLM_TENSOR_OUTPUT, "output" },
|
||||||
|
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_W1, "blk.%d.time_mix_w1" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_W2, "blk.%d.time_mix_w2" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_LERP_X, "blk.%d.time_mix_lerp_x" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_LERP_FUSED, "blk.%d.time_mix_lerp_fused" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_FIRST, "blk.%d.time_mix_first" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_DECAY, "blk.%d.time_mix_decay" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_DECAY_W1, "blk.%d.time_mix_decay_w1" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_DECAY_W2, "blk.%d.time_mix_decay_w2" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_KEY, "blk.%d.time_mix_key" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_VALUE, "blk.%d.time_mix_value" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_RECEPTANCE, "blk.%d.time_mix_receptance" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_GATE, "blk.%d.time_mix_gate" },
|
||||||
|
{ LLM_TENSOR_TIME_MIX_OUTPUT, "blk.%d.time_mix_output" },
|
||||||
|
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
|
||||||
|
{ LLM_TENSOR_FFN_GATE, "blk.%d.ffn_gate" },
|
||||||
|
{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },
|
||||||
|
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
|
||||||
|
},
|
||||||
|
},
|
||||||
{
|
{
|
||||||
LLM_ARCH_GRANITE,
|
LLM_ARCH_GRANITE,
|
||||||
{
|
{
|
||||||
@ -1343,6 +1395,7 @@ static const std::map<llm_tensor, llm_tensor_info> LLM_TENSOR_INFOS = {
|
|||||||
{LLM_TENSOR_TIME_MIX_LERP_V, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
|
{LLM_TENSOR_TIME_MIX_LERP_V, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
|
||||||
{LLM_TENSOR_TIME_MIX_LERP_R, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
|
{LLM_TENSOR_TIME_MIX_LERP_R, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
|
||||||
{LLM_TENSOR_TIME_MIX_LERP_G, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
|
{LLM_TENSOR_TIME_MIX_LERP_G, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
|
||||||
|
{LLM_TENSOR_TIME_MIX_LERP_FUSED, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
|
||||||
{LLM_TENSOR_TIME_MIX_DECAY, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
|
{LLM_TENSOR_TIME_MIX_DECAY, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
|
||||||
{LLM_TENSOR_TIME_MIX_FIRST, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_RWKV_WKV6}},
|
{LLM_TENSOR_TIME_MIX_FIRST, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_RWKV_WKV6}},
|
||||||
{LLM_TENSOR_ATTN_NORM, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
|
{LLM_TENSOR_ATTN_NORM, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
|
||||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user