* Try IQ4_NL with blocks of 64 - does not look good
* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32
* iq4_xs: CUDA works - 133.2 t/s
* iq4_xs: AVX2 dot product
* iq4_xs: ARM_NEON dot product
* iq4_nl: Metal implementation
As usual, Metal / Apple Silicon don't like my quants.
* iq3_xs: minor fix
* iq4_xs: shrink by using IQ3_S for attn_k and attn_q
* iq4_xs: revert using IQ3_S for attn_k and attn_v
PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.
* Fix CI
* iq4_xs: Added forgotten check for 256 divisibility
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding IQ2_S and IQ2_M as a single cumulative commit
* Update examples/quantize/quantize.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: docs - refresh and tease a little bit more the http server
* Rephrase README.md server doc
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/server/README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/server/README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update README.md
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The system prompt is now decoded in batches.
* server : fix off-by-one n_past when start of prompt matches whole cache
The tokens right after the matching part would otherwise skip a pos value.
* [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility
vqtbl1q_u8 is not part of arm v7 neon library
* [android-example] Remove abi filter after arm v7a fix
* [github-workflows] Do not skip Android armeabi-v7a build
* server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id
* server : skip GH copilot requests from logging
* server : change message format of server_log()
* server : no need to repeat log in comment
* server : log style consistency
* server : fix compile warning
* server : fix tests regex patterns on M2 Ultra
* server: logs: PR feedback on log level
* server: logs: allow to choose log format in json or plain text
* server: tests: output server logs in text
* server: logs switch init logs to server logs macro
* server: logs ensure value json value does not raised error
* server: logs reduce level VERBOSE to VERB to max 4 chars
* server: logs lower case as other log messages
* server: logs avoid static in general
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: logs PR feedback: change text log format to: LEVEL [function_name] message | additional=data
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: monitoring - add /metrics prometheus compatible endpoint
* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified
* server: metrics - move to a dedicated struct
* Fix issues during StableLM models conversion
* Fix hard coded layer_norm_eps
* Support layer_norm_eps for LlavaStableLM
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Add missing parenthesis
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Support rotary_factor for LlavaStableLM
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* fix typo
* Add StableLMEpochForCausalLM for safety
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
* Add StableLMEpochForCausalLM for safety 2
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
* iq4_nl: squash commits for easier rebase
* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels
* Resurrecting iq3_xs
After all the experimentation, nothing was better than this.
* Minor PPL improvement via a block scale fudge factor
* Minor improvement via 3 neighbours
* iq3_xs: working scalar and AVX2 dot products
* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)
* iq3_xs: working Metal implementation
* Adding IQ3_M - IQ3_XS mix with mostly Q4_K
* iiq3_xs: a 3.4375 bpw variant
* iq3_xs: make CUDA work for new version
* iq3_xs: make scalar and AVX2 work for new version
* iq3_s: make ARM_NEON work with new version
* iq3_xs: make new version work on metal
Performance is very similar to Q3_K_S
* iq3_xs: tiny Metal speed improvement
* iq3_xs: tiny Metal speed improvement
* Fix stupid warning
* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS
* iq3_xs: rename to iq3_s
* iq3_s: make tests pass
* Move Q3_K_XS mix to 3.25 bpw
* Attempt to fix failing tests
* Another attempt to fix the Windows builds
* Attempt to fix ROCm
* ROCm again
* iq3_s: partial fix for QK_K = 64
* iq3_s: make it work on metal for QK_K = 64
Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.
* Will this fix ROCm?
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* server: tests: init scenarios
- health and slots endpoints
- completion endpoint
- OAI compatible chat completion requests w/ and without streaming
- completion multi users scenario
- multi users scenario on OAI compatible endpoint with streaming
- multi users with total number of tokens to predict exceeds the KV Cache size
- server wrong usage scenario, like in Infinite loop of "context shift" #3969
- slots shifting
- continuous batching
- embeddings endpoint
- multi users embedding endpoint: Segmentation fault #5655
- OpenAI-compatible embeddings API
- tokenize endpoint
- CORS and api key scenario
* server: CI GitHub workflow
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/)/[apptainer](https://apptainer.org/) and Docker images re-using llama.cpp's Nix expression.
Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.
GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs.
The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.
* server: fallback to chatml
* add new chat template
* server: add AlphaMonarch to test chat template
* server: only check model template if there is no custom tmpl
* remove TODO