* finetune: rename feed-forward tensors (w1/w2/w3)
This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate,
ffn_down and ffn_up respectively.
The motivation for this change is to make it easier to understand the
purpose of the tensors. This also seems to be inline with the names
used in the llama_layer struct in llama.cpp.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* train-text-from-scratch: rename ff tensors
This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate,
ffn_down and ffn_up respectively.
The motivation for this change is to make it easier to understand the
purpose of the tensors. This also seems to be inline with the names
used in the llama_layer struct in llama.cpp
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* batched embedding: pool outputs by sequence id. updated embedding example
* bring back non-causal attention
* embd : minor improvements
* llama : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* make: add error message for bad CUDA version
* Update Makefile
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* llava: remove prog parameter from ArgumentParser
This commit removes the `prog` parameter from `ArgumentParser`
so that it uses the default value which is the name of the script.
The motivation for this change is that currently the usage output looks
like this:
```console
$ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert_hf_to_gguf.py [-h] ...
```
And with this change it will look like this:
```console
$ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert-image-encoder-to-gguf.py [-h] ...
```
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* ci: add W503 to flake8 ignore list
This commit adds W503 to the ignore list for flake8. This is done to
avoid the following error:
W503 line break before binary operator
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* BERT model graph construction (build_bert)
* WordPiece tokenizer (llm_tokenize_wpm)
* Add flag for non-causal attention models
* Allow for models that only output embeddings
* Support conversion of BERT models to GGUF
* Based on prior work by @xyzhang626 and @skeskinen
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* vulkan: refactor guess_matmul_pipeline for vendor
Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.
Signed-off-by: Sergio Lopez <slp@redhat.com>
* vulkan: only use M-sized matmul on Apple GPUs
L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.
Signed-off-by: Sergio Lopez <slp@redhat.com>
---------
Signed-off-by: Sergio Lopez <slp@redhat.com>
* common: use enums for sampler types
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* minor : spaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: allow to specify tokens as strings in logit_bias
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"
On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"
On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"
On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
* ggml: update unit tests for the new vec_dot interface
* llama.cpp: add MATMUL_INT8 capability to system_info
A common default for the maximum number of open files is 256, which can
lead to `asyncio.gather(*tasks)` failing with Too many open files.
$ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc
ggml_vulkan: Generating and compiling shaders to SPIR-V
Traceback (most recent call last):
File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module>
asyncio.run(main())
File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main
await asyncio.gather(*tasks)
[...snip...]
OSError: [Errno 24] Too many open files
This change sets a reasonable concurrency limit for tasks (and therefore
open files), without significant impact on run time.
* llava: add requirements.txt and update README.md
This commit adds a `requirements.txt` file to the `examples/llava`
directory. This file contains the required Python packages to run the
scripts in the `examples/llava` directory.
The motivation of this to make it easier for users to run the scripts in
`examples/llava`. This will avoid users from having to possibly run into
missing package issues if the packages are not installed on their system.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* llava: fix typo in llava-surgery.py output
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
This commit adds the missing .py extension to the convert-image-encoder-to-gguf
script. It also fixes the paths for the `model` and `mmproj` options in the
example llava-cli command.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* fix bug for norm_rms_eps missing
* to align with the same order as convert.py for model write
* fix: undo HF models permute tensor
* update for flake8 lint
This commit fixes a typo in the README.md file for the llava example
which is causing the formatting to look a little off:
Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>