There are couple things in this architecture:
1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.
More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.
* iq4_nl: squash commits for easier rebase
* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels
* iq4_nl: Fix after merging with master
* iq4_nl: another fix after merging with master
* Use IQ4_NL instead of Q4_K when using k-quants is not possible
* Fix typo that makes several tests fail
* It was the ggml_vdotq thing missed inside the brackets
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit contains a suggestion for the README.md in the llava
example. The suggestion adds explicit instructions for how to convert
a llava-1.6 model and run it using llava-cli.
The motivation for this is that having explicit instructions similar to
the 1.5 instructions will make it easier for users to try this out.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* add build support for embedded metal library
* Update Makefile
---------
Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* support minLength and maxLength in JSON schema grammar converter
* Update examples/json-schema-to-grammar.py
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml : embed Metal library source (ggml-metal.metal) into binary
enable by setting WHISPER_EMBED_METAL_LIBRARY
* rename the build option
* rename the preprocessor directive
* generate Metal library embedding assembly on-fly during build process
This is a follup of Commit fc0c8d286a
("llava : update surgery script to not remove tensors") but this time
the change is to the BakLLaVA specific part of the surgery script.
I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works
as expected using the instructions in README.md.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* llama: add llama_chat_apply_template
* test-chat-template: remove dedundant vector
* chat_template: do not use std::string for buffer
* add clarification for llama_chat_apply_template
* llama_chat_apply_template: add zephyr template
* llama_chat_apply_template: correct docs
* llama_chat_apply_template: use term "chat" everywhere
* llama_chat_apply_template: change variable name to "tmpl"
* #ifdef out some code NUMA blocks for Android due to lack of support
* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper
* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc
* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways
---------
Co-authored-by: root <root@nenya.lothlorien.ca>
* build : pass all warning flags to nvcc via -Xcompiler
* make : fix apparent mis-merge from #3952
* make : fix incorrect GF_CC_VER for CUDA host compiler
* server: enrich health endpoint with available slots, return 503 if not slots are available
* server: document new status no slot available in the README.md