Commit Graph

355 Commits

Author SHA1 Message Date
Georgi Gerganov
c5650ed470
server : avoid context swaps by shifting the KV cache 2023-09-28 19:03:36 +03:00
Georgi Gerganov
ce2d995af2
server : clear the KV cache beyond n_past before llama_decode 2023-09-28 18:12:39 +03:00
Georgi Gerganov
2b8830af71
examples : do not eval prompt 2 times (close #3348) 2023-09-28 17:48:46 +03:00
Georgi Gerganov
a207561503
examples : add example for batched decoding 2023-09-28 17:32:04 +03:00
Georgi Gerganov
d008733e6b
examples : utilize new llama_get_logits_ith() 2023-09-28 16:05:37 +03:00
Georgi Gerganov
4ad0676927
parallel : fix crash when -n -1 2023-09-28 15:48:38 +03:00
Georgi Gerganov
25856900db
Merge branch 'master' into custom-attention-mask 2023-09-28 15:19:57 +03:00
Richard Roberson
ac43576124
make-ggml.py : compatibility with more models and GGUF (#3290)
* Resync my fork with new llama.cpp commits

* examples : rename to use dash instead of underscore

* New model conversions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-27 19:25:12 +03:00
Cebtenzzre
20c7e1e804
gguf : fix a few general keys (#3341) 2023-09-27 12:18:07 -04:00
Rickard Hallerbäck
dc6897404e
metal : reusing llama.cpp logging (#3152)
* metal : reusing llama.cpp logging

* cmake : build fix

* metal : logging callback

* metal : logging va_args memory fix

* metal : minor cleanup

* metal : setting function like logging macro to capital letters

* llama.cpp : trailing whitespace fix

* ggml : log level enum used by llama

* Makefile : cleanup ggml-metal recipe

* ggml : ggml_log_callback typedef

* ggml : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-27 18:48:33 +03:00
BarfingLemurs
ffe88a36a9
readme : add some recent perplexity and bpw measurements to READMES, link for k-quants (#3340)
* Update README.md

* Update README.md

* Update README.md with k-quants bpw measurements
2023-09-27 18:30:36 +03:00
slaren
c091cdfb24
llama-bench : add README (#3317)
* llama-bench : add README

* minor edit
2023-09-23 21:48:24 +02:00
Georgi Gerganov
8845160058
simple : add README.md 2023-09-21 20:10:14 +02:00
yuiseki
f56c418ab0
embedding : update README.md (#3224) 2023-09-21 11:57:40 +03:00
Georgi Gerganov
b2debf65f2
parallel : add disabled experimental batch chunking in powers of two 2023-09-20 20:14:05 +03:00
Cebtenzzre
a5661d7e71
llama : allow gguf RoPE keys to be overridden with defaults (#3240) 2023-09-20 12:12:47 -04:00
Georgi Gerganov
ded9b43cad
parallel : fix cases where the input prompts can overflow the batch 2023-09-20 19:09:25 +03:00
Cebtenzzre
65c2c1c5ab
benchmark-matmult : do not use integer abs() on a float (#3277) 2023-09-20 12:06:08 -04:00
Georgi Gerganov
ee1d670cc6 parallel : fix bug (extra BOS) + smaller token_prev array 2023-09-20 17:32:21 +03:00
Georgi Gerganov
2f3a46fccf
train : make KQ_pos memory buffer permanent via dummy scale op 2023-09-20 14:14:50 +03:00
Georgi Gerganov
db0fc2da06
simple : improve comments + free batch 2023-09-20 13:54:20 +03:00
Georgi Gerganov
b377bf2266
simple : add parallel decoding support 2023-09-20 13:06:34 +03:00
Georgi Gerganov
addae65fd4
llama : improve llama_batch API + simplify parallel example 2023-09-20 11:03:18 +03:00
Georgi Gerganov
d119c04c15
examples : fix benchmark-matmult (#1554)
The precision for Q4_0 has degraded since #1508
2023-09-20 10:02:39 +03:00
Georgi Gerganov
a1327c71c6
parallel : rename hot-plug to continuous-batching 2023-09-20 09:24:41 +03:00
Georgi Gerganov
7b7472ee26
parallel : minor 2023-09-20 00:35:10 +03:00
Georgi Gerganov
6028879f56 parallel : print misses on each request 2023-09-19 23:50:05 +03:00
Georgi Gerganov
eed3fd4234 parallel : count cache misses 2023-09-19 23:47:47 +03:00
Georgi Gerganov
8a9aca37c1
parallel : remove question with short answers 2023-09-19 23:34:30 +03:00
Georgi Gerganov
4b5f3cd6bf
parallel : process system prompt once + configurable paramters + llama API 2023-09-19 17:00:42 +03:00
Georgi Gerganov
82e20e9ba0 parallel : remove new line from prompt 2023-09-19 13:54:41 +03:00
Georgi Gerganov
16090a5dde
parallel : fix sequence termination criteria 2023-09-19 13:29:29 +03:00
Georgi Gerganov
806d397c1a
parallel : try smaller batches when the KV cache is fragmented 2023-09-19 13:21:36 +03:00
Georgi Gerganov
36714e16d0
parallel : various improvements 2023-09-19 12:29:37 +03:00
Georgi Gerganov
467e307931
simple : fix token counting 2023-09-19 11:45:33 +03:00
Georgi Gerganov
daf4c6d360
llama : fix worst case graph build 2023-09-19 11:05:08 +03:00
Georgi Gerganov
fa0e677820
llama : extend batch API to select which logits to output 2023-09-19 00:24:13 +03:00
Georgi Gerganov
897caccdf4
fixes : speculative KV cache + llama worst-case graph 2023-09-18 22:32:28 +03:00
Georgi Gerganov
466b513851
parallel : disable hot-plug to avoid cache fragmentation 2023-09-18 21:34:20 +03:00
Georgi Gerganov
0161372b9a
parallel : example for serving multiple users in parallel 2023-09-18 20:37:28 +03:00
Georgi Gerganov
1f17ea631c
speculative : fix KV cache management 2023-09-18 19:01:20 +03:00
Georgi Gerganov
0cbf3bfef8
llama : add llama_kv_cache_shift_seq + no more context swaps 2023-09-18 18:10:43 +03:00
Georgi Gerganov
f015b26689
llama : more robust cell_max heuristic + wip shift 2023-09-18 17:15:58 +03:00
Cebtenzzre
8781013ef6
make : restore build-info.h dependency for several targets (#3205) 2023-09-18 10:03:53 -04:00
Georgi Gerganov
4d76d762ef
llama : extend llama_kv_cache API 2023-09-18 15:53:03 +03:00
Georgi Gerganov
9f42e75489
llama : add new llama_decode() API that works with llama_batch 2023-09-18 14:23:52 +03:00
Georgi Gerganov
58bb5110ca
Merge branch 'master' into custom-attention-mask 2023-09-18 11:15:18 +03:00
Georgi Gerganov
d29e76937c
llama : unified KV cache + batch inference API 2023-09-18 11:08:15 +03:00
Georgi Gerganov
1fb033fd85
ggml : ggml_rope now takes a vector with positions instead of n_past 2023-09-17 21:17:10 +03:00
goerch
b08e75baea
Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 (#3170)
* Fix für #2721

* Reenable tokenizer test for LLaMa

* Add `console.cpp` dependency

* Fix dependency to `common`

* Fixing wrong fix.

* Make console usage platform specific

Work on compiler warnings.

* Adapting makefile

* Remove trailing whitespace

* Adapting the other parts of the makefile

* Fix typo.

* Fixing the last deviations from sentencepiece indicated by test-tokenizer-1

* Simplify logic

* Add missing change...

* Fix ugly compiler warning

* llama_tokenize should accept strings containing NUL now

* Adding huichen's test case
2023-09-16 13:41:33 +02:00