1
0
mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-21 00:59:23 +01:00
Commit Graph

2287 Commits

Author SHA1 Message Date
Iwan Kawrakow
f0cbb6ddf6 iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work) 2024-02-28 08:28:10 +02:00
Iwan Kawrakow
47d52b2b24 Q2_K: fixed bug in imatrix quantization for QK_K = 64 2024-02-28 08:15:52 +02:00
Iwan Kawrakow
2540a290ed Make CUDA compile with QK_K = 64
Tests don't pass, plus we get misaligned access
2024-02-27 21:35:11 +02:00
Iwan Kawrakow
de64e061da QK_K = 64 tests pass on ARM_NEON and Metal
Sadly, that does not mean it actually works.
2024-02-27 20:12:54 +02:00
Iwan Kawrakow
28e6146c11 iq2_xs: attempt to fix AVX dot product for QK_K = 64
Tests pass, but I get gibberish.
2024-02-27 18:41:31 +02:00
Iwan Kawrakow
13ba37f1aa WIP: make i-quants work for QK_K = 64 2024-02-27 17:30:11 +02:00
Kawrakow
0becb22ac0
IQ4_XS: a 4.25 bpw quantization ()
* Try IQ4_NL with blocks of 64 - does not look good

* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

* iq4_xs: CUDA works - 133.2 t/s

* iq4_xs: AVX2 dot product

* iq4_xs: ARM_NEON dot product

* iq4_nl: Metal implementation

As usual, Metal / Apple Silicon don't like my quants.

* iq3_xs: minor fix

* iq4_xs: shrink by using IQ3_S for attn_k and attn_q

* iq4_xs: revert using IQ3_S for attn_k and attn_v

PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.

* Fix CI

* iq4_xs: Added forgotten check for 256 divisibility

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-27 16:34:24 +02:00
Engininja2
c24a2a6e60
cuda : replace remaining shfl_xor with calls to warp_reduce functions () 2024-02-27 14:22:45 +01:00
Engininja2
1f30b7a9f1
ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc () 2024-02-27 14:50:18 +02:00
Georgi Gerganov
9d533a77d0
llama : fix defrag bugs + add parameter ()
* llama : fix defrag bugs + enable by default

ggml-ci

* llama : add defrag_thold parameter

ggml-ci

* llama : cont

* llama : disable log message

ggml-ci

* llama : fix graph size check during defrag
2024-02-27 14:35:51 +02:00
le.chang
cbbd1efa06
Makefile: use variables for cublas ()
* make: use arch variable for cublas

* fix UNAME_M

* check opt first

---------

Co-authored-by: lindeer <le.chang118@gmail.com>
2024-02-27 03:03:06 +01:00
Xuan Son Nguyen
b11a93df41
fix server hangs on empty prompt () 2024-02-26 23:15:48 +01:00
Kawrakow
a33e6a0d2a
Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range ()
* Adding IQ2_S and IQ2_M as a single cumulative commit

* Update examples/quantize/quantize.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-26 18:28:38 +02:00
Johannes Gäßler
47bb7b48c7
CUDA: fix DEBUG_CUDA_MALLOC () 2024-02-26 15:36:38 +01:00
Artem
c4d7f81786
readme : update ui list ()
* Add LLMFarm (ui for iOS) to list
2024-02-26 16:15:28 +02:00
AidanBeltonS
e849078c6e
[SYCL] Add support for soft_max ALiBi ()
* Add support for bias

* Update pre-processor

* rm commented code

* fix format

* fix CI

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-26 19:32:11 +05:30
Georgi Gerganov
67fd33132f
unicode : reuse iterator () 2024-02-26 14:02:12 +02:00
Pierrick Hymbert
4804215cb8
server: CI fix trailing space () 2024-02-26 12:41:34 +02:00
Pierrick Hymbert
8a533f0d90
server: CI tests reduce build matrix () 2024-02-26 09:56:10 +01:00
Georgi Gerganov
269de86ba0
llama : fix Gemma rope type () 2024-02-26 08:30:17 +02:00
github-actions[bot]
c393733988 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)
  → 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
2024-02-25 22:24:22 +00:00
Pierrick Hymbert
e3965cf35a
server: tests - slow inference causes timeout on the CI ()
* server: tests - longer inference timeout for CI
2024-02-25 22:48:33 +01:00
Pierrick Hymbert
8b350356b2
server: docs - refresh and tease a little bit more the http server ()
* server: docs - refresh and tease a little bit more the http server

* Rephrase README.md server doc

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/server/README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/server/README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-25 21:46:29 +01:00
Georgi Gerganov
bf08e00643
llama : refactor k-shift implementation + KV defragmentation ()
* llama : refactor k-shift implementation

ggml-ci

* llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add

* llama : cont k-shift refactoring + normalize type names

ggml-ci

* minor : fix MPI builds

* llama : reuse n_rot from the build context

ggml-ci

* llama : revert enum name changes from this PR

ggml-ci

* llama : update llama_rope_type

* llama : add comment about rope values

* llama : fix build

* passkey : apply kv cache updates explicitly

ggml-ci

* llama : change name to llama_kv_cache_update()

* llama : add llama_kv_cache_seq_pos_max()

* passkey : fix llama_kv_cache_seq_pos_max() usage

* llama : some llama_kv_cell simplifications

* llama : add llama_kv_cache_compress (EXPERIMENTAL)

* llama : add alternative KV cache merging (EXPERIMENTAL)

* llama : add llama_kv_cache_defrag

* llama : comments

* llama : remove llama_kv_cache_compress

will add in a separate PR

ggml-ci

* llama : defragment via non-overlapping moves

* llama : ggml_graph based defrag implementation

ggml-ci

* llama : switch the loop order in build_defrag

* llama : add comments
2024-02-25 22:12:24 +02:00
compilade
f7625019c5
server : fix crash when system prompt is bigger than batch size ()
The system prompt is now decoded in batches.

* server : fix off-by-one n_past when start of prompt matches whole cache

The tokens right after the matching part would otherwise skip a pos value.
2024-02-25 20:43:50 +02:00
Radosław Gryta
abbabc5e51
ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility ()
* [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility

vqtbl1q_u8 is not part of arm v7 neon library

* [android-example] Remove abi filter after arm v7a fix

* [github-workflows] Do not skip Android armeabi-v7a build
2024-02-25 20:43:00 +02:00
kwin1412
f1a98c5254
make : fix nvcc version is empty ()
fix nvcc version is empty
2024-02-25 18:46:49 +02:00
Ashok Gelal
7d548a1827
readme : add Msty to UI list () 2024-02-25 17:57:34 +02:00
Pierrick Hymbert
930b178026
server: logs - unified format and --log-format option ()
* server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id

* server : skip GH copilot requests from logging

* server : change message format of server_log()

* server : no need to repeat log in comment

* server : log style consistency

* server : fix compile warning

* server : fix tests regex patterns on M2 Ultra

* server: logs: PR feedback on log level

* server: logs: allow to choose log format in json or plain text

* server: tests: output server logs in text

* server: logs switch init logs to server logs macro

* server: logs ensure value json value does not raised error

* server: logs reduce level VERBOSE to VERB to max 4 chars

* server: logs lower case as other log messages

* server: logs avoid static in general

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: logs PR feedback: change text log format to: LEVEL [function_name] message | additional=data

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-25 13:50:32 +01:00
Pierrick Hymbert
d52d7819b8
server: concurrency fix + monitoring - add /metrics prometheus compatible endpoint ()
* server: monitoring - add /metrics prometheus compatible endpoint

* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified

* server: metrics - move to a dedicated struct
2024-02-25 13:49:43 +01:00
Radosław Gryta
1289408817
cmake : fix compilation for Android armeabi-v7a () 2024-02-25 12:53:11 +02:00
Georgi Gerganov
ab336a9d5e
code : normalize enum names ()
* coda : normalize enum names

ggml-ci

* code : cont

* code : cont
2024-02-25 12:09:09 +02:00
Anas Ahouzi
69917dfa55
py : fix StableLM conversion after config.json changes ()
* Fix issues during StableLM models conversion

* Fix hard coded layer_norm_eps

* Support layer_norm_eps for LlavaStableLM

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Add missing parenthesis

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Support rotary_factor for LlavaStableLM

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* fix typo

* Add StableLMEpochForCausalLM for safety

Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>

* Add StableLMEpochForCausalLM for safety 2

Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
2024-02-25 11:54:04 +02:00
Pierrick Hymbert
9e359a4f47
server: continue to update other slots on embedding concurrent request ()
* server:  - continue to update other slots on embedding concurrent request.

* server: tests: add multi users embeddings as fixed

* server: tests: adding OAI compatible embedding concurrent endpoint

* server: tests: adding OAI compatible embedding with multiple inputs
2024-02-24 19:16:04 +01:00
Kawrakow
4c4cb30736
IQ3_S: a much better alternative to Q3_K ()
* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* Resurrecting iq3_xs

After all the experimentation, nothing was better than this.

* Minor PPL improvement via a block scale fudge factor

* Minor improvement via 3 neighbours

* iq3_xs: working scalar and AVX2 dot products

* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)

* iq3_xs: working Metal implementation

* Adding IQ3_M - IQ3_XS mix with mostly Q4_K

* iiq3_xs: a 3.4375 bpw variant

* iq3_xs: make CUDA work for new version

* iq3_xs: make scalar and AVX2 work for new version

* iq3_s: make ARM_NEON work with new version

* iq3_xs: make new version work on metal

Performance is very similar to Q3_K_S

* iq3_xs: tiny Metal speed improvement

* iq3_xs: tiny Metal speed improvement

* Fix stupid warning

* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS

* iq3_xs: rename to iq3_s

* iq3_s: make tests pass

* Move Q3_K_XS mix to 3.25 bpw

* Attempt to fix failing tests

* Another attempt to fix the Windows builds

* Attempt to fix ROCm

* ROCm again

* iq3_s: partial fix for QK_K = 64

* iq3_s: make it work on metal for QK_K = 64

Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.

* Will this fix ROCm?

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-24 16:23:52 +02:00
Pierrick Hymbert
525213d2f5
server: init functional tests ()
* server: tests: init scenarios
 - health and slots endpoints
 - completion endpoint
 - OAI compatible chat completion requests w/ and without streaming
 - completion multi users scenario
 - multi users scenario on OAI compatible endpoint with streaming
 - multi users with total number of tokens to predict exceeds the KV Cache size
 - server wrong usage scenario, like in Infinite loop of "context shift" 
 - slots shifting
 - continuous batching
 - embeddings endpoint
 - multi users embedding endpoint: Segmentation fault 
 - OpenAI-compatible embeddings API
 - tokenize endpoint
 - CORS and api key scenario

* server: CI GitHub workflow


---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-24 12:28:55 +01:00
AlpinDale
fd43d66f46
server : add KV cache quantization options () 2024-02-23 21:31:54 +02:00
Jared Van Bortel
54fbcd2ce6
convert : fix missing ftype for gemma () 2024-02-23 20:39:14 +02:00
Jared Van Bortel
15499eb942
mpt : do not duplicate token_embd.weight on disk () 2024-02-22 17:05:23 -05:00
Georgi Gerganov
96633eeca1
gemma : use more bits for the token_embd.weight tensor ()
* gemma : use Q8_0 for the token_embd.weight tensor

* llama : quantize token_embd.weight using output type
2024-02-22 23:23:46 +02:00
Georgi Gerganov
847eedbdb2
py : add Gemma conversion from HF models ()
* py : add gemma conversion from HF models

* Update convert-hf-to-gguf.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update convert-hf-to-gguf.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update convert-hf-to-gguf.py

Co-authored-by: Jared Van Bortel <jared@nomic.ai>

---------

Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-02-22 23:22:48 +02:00
Georgi Gerganov
7e4f339c40
ggml : always define ggml_fp16_t as uint16_t ()
* ggml : always define ggml_fp16_t as uint16_t

ggml-ci

* ggml : cont

ggml-ci

* ggml : cont

* ggml : cont

ggml-ci

* ggml : cont

ggml-ci

* cuda : no longer ggml headers last

ggml-ci

* ggml : fix q6_K FP16 -> FP32 conversion

ggml-ci

* ggml : more FP16 -> FP32 conversion fixes

ggml-ci
2024-02-22 23:21:39 +02:00
Georgi Gerganov
334f76fa38
sync : ggml 2024-02-22 23:21:05 +02:00
Georgi Gerganov
efd56b1c21
ggml : 32-bit arm compat (whisper/1891)
* ggml : 32-bit arm compat

* ggml : add ggml_vqtbl1q_s8 impl

* ggml : cont
2024-02-22 23:20:50 +02:00
Someone
201294ae17
nix: init singularity and docker images ()
Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/)/[apptainer](https://apptainer.org/) and Docker images re-using llama.cpp's Nix expression.

Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.
2024-02-22 11:44:10 -08:00
Georgi Gerganov
5a9e2f60ba
py : minor fixes () 2024-02-22 20:13:25 +02:00
Xuan Son Nguyen
373ee3fbba
Add Gemma chat template ()
* add gemma chat template

* gemma: only apply system_prompt on non-model message
2024-02-22 19:10:21 +01:00
Someone
4cb4d8b22d
workflows: nix: hardcode cachix ids, build unconditionally ()
GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs. 

The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.
2024-02-22 08:32:09 -08:00
Georgi Gerganov
3a03541ced
minor : fix trailing whitespace () 2024-02-22 13:54:03 +02:00
Georgi Gerganov
56d03d92be
readme : update hot topics 2024-02-22 10:35:54 +02:00