Georgi Gerganov
c5650ed470
server : avoid context swaps by shifting the KV cache
2023-09-28 19:03:36 +03:00
Georgi Gerganov
ce2d995af2
server : clear the KV cache beyond n_past before llama_decode
2023-09-28 18:12:39 +03:00
Georgi Gerganov
2b8830af71
examples : do not eval prompt 2 times ( close #3348 )
2023-09-28 17:48:46 +03:00
Georgi Gerganov
a207561503
examples : add example for batched decoding
2023-09-28 17:32:04 +03:00
Georgi Gerganov
d008733e6b
examples : utilize new llama_get_logits_ith()
2023-09-28 16:05:37 +03:00
Georgi Gerganov
4ad0676927
parallel : fix crash when -n -1
2023-09-28 15:48:38 +03:00
Georgi Gerganov
25856900db
Merge branch 'master' into custom-attention-mask
2023-09-28 15:19:57 +03:00
Richard Roberson
ac43576124
make-ggml.py : compatibility with more models and GGUF ( #3290 )
...
* Resync my fork with new llama.cpp commits
* examples : rename to use dash instead of underscore
* New model conversions
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-27 19:25:12 +03:00
Cebtenzzre
20c7e1e804
gguf : fix a few general keys ( #3341 )
2023-09-27 12:18:07 -04:00
Rickard Hallerbäck
dc6897404e
metal : reusing llama.cpp logging ( #3152 )
...
* metal : reusing llama.cpp logging
* cmake : build fix
* metal : logging callback
* metal : logging va_args memory fix
* metal : minor cleanup
* metal : setting function like logging macro to capital letters
* llama.cpp : trailing whitespace fix
* ggml : log level enum used by llama
* Makefile : cleanup ggml-metal recipe
* ggml : ggml_log_callback typedef
* ggml : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-27 18:48:33 +03:00
BarfingLemurs
ffe88a36a9
readme : add some recent perplexity and bpw measurements to READMES, link for k-quants ( #3340 )
...
* Update README.md
* Update README.md
* Update README.md with k-quants bpw measurements
2023-09-27 18:30:36 +03:00
slaren
c091cdfb24
llama-bench : add README ( #3317 )
...
* llama-bench : add README
* minor edit
2023-09-23 21:48:24 +02:00
Georgi Gerganov
8845160058
simple : add README.md
2023-09-21 20:10:14 +02:00
yuiseki
f56c418ab0
embedding : update README.md ( #3224 )
2023-09-21 11:57:40 +03:00
Georgi Gerganov
b2debf65f2
parallel : add disabled experimental batch chunking in powers of two
2023-09-20 20:14:05 +03:00
Cebtenzzre
a5661d7e71
llama : allow gguf RoPE keys to be overridden with defaults ( #3240 )
2023-09-20 12:12:47 -04:00
Georgi Gerganov
ded9b43cad
parallel : fix cases where the input prompts can overflow the batch
2023-09-20 19:09:25 +03:00
Cebtenzzre
65c2c1c5ab
benchmark-matmult : do not use integer abs() on a float ( #3277 )
2023-09-20 12:06:08 -04:00
Georgi Gerganov
ee1d670cc6
parallel : fix bug (extra BOS) + smaller token_prev array
2023-09-20 17:32:21 +03:00
Georgi Gerganov
2f3a46fccf
train : make KQ_pos memory buffer permanent via dummy scale op
2023-09-20 14:14:50 +03:00
Georgi Gerganov
db0fc2da06
simple : improve comments + free batch
2023-09-20 13:54:20 +03:00
Georgi Gerganov
b377bf2266
simple : add parallel decoding support
2023-09-20 13:06:34 +03:00
Georgi Gerganov
addae65fd4
llama : improve llama_batch API + simplify parallel example
2023-09-20 11:03:18 +03:00
Georgi Gerganov
d119c04c15
examples : fix benchmark-matmult ( #1554 )
...
The precision for Q4_0 has degraded since #1508
2023-09-20 10:02:39 +03:00
Georgi Gerganov
a1327c71c6
parallel : rename hot-plug to continuous-batching
2023-09-20 09:24:41 +03:00
Georgi Gerganov
7b7472ee26
parallel : minor
2023-09-20 00:35:10 +03:00
Georgi Gerganov
6028879f56
parallel : print misses on each request
2023-09-19 23:50:05 +03:00
Georgi Gerganov
eed3fd4234
parallel : count cache misses
2023-09-19 23:47:47 +03:00
Georgi Gerganov
8a9aca37c1
parallel : remove question with short answers
2023-09-19 23:34:30 +03:00
Georgi Gerganov
4b5f3cd6bf
parallel : process system prompt once + configurable paramters + llama API
2023-09-19 17:00:42 +03:00
Georgi Gerganov
82e20e9ba0
parallel : remove new line from prompt
2023-09-19 13:54:41 +03:00
Georgi Gerganov
16090a5dde
parallel : fix sequence termination criteria
2023-09-19 13:29:29 +03:00
Georgi Gerganov
806d397c1a
parallel : try smaller batches when the KV cache is fragmented
2023-09-19 13:21:36 +03:00
Georgi Gerganov
36714e16d0
parallel : various improvements
2023-09-19 12:29:37 +03:00
Georgi Gerganov
467e307931
simple : fix token counting
2023-09-19 11:45:33 +03:00
Georgi Gerganov
daf4c6d360
llama : fix worst case graph build
2023-09-19 11:05:08 +03:00
Georgi Gerganov
fa0e677820
llama : extend batch API to select which logits to output
2023-09-19 00:24:13 +03:00
Georgi Gerganov
897caccdf4
fixes : speculative KV cache + llama worst-case graph
2023-09-18 22:32:28 +03:00
Georgi Gerganov
466b513851
parallel : disable hot-plug to avoid cache fragmentation
2023-09-18 21:34:20 +03:00
Georgi Gerganov
0161372b9a
parallel : example for serving multiple users in parallel
2023-09-18 20:37:28 +03:00
Georgi Gerganov
1f17ea631c
speculative : fix KV cache management
2023-09-18 19:01:20 +03:00
Georgi Gerganov
0cbf3bfef8
llama : add llama_kv_cache_shift_seq + no more context swaps
2023-09-18 18:10:43 +03:00
Georgi Gerganov
f015b26689
llama : more robust cell_max heuristic + wip shift
2023-09-18 17:15:58 +03:00
Cebtenzzre
8781013ef6
make : restore build-info.h dependency for several targets ( #3205 )
2023-09-18 10:03:53 -04:00
Georgi Gerganov
4d76d762ef
llama : extend llama_kv_cache API
2023-09-18 15:53:03 +03:00
Georgi Gerganov
9f42e75489
llama : add new llama_decode() API that works with llama_batch
2023-09-18 14:23:52 +03:00
Georgi Gerganov
58bb5110ca
Merge branch 'master' into custom-attention-mask
2023-09-18 11:15:18 +03:00
Georgi Gerganov
d29e76937c
llama : unified KV cache + batch inference API
2023-09-18 11:08:15 +03:00
Georgi Gerganov
1fb033fd85
ggml : ggml_rope now takes a vector with positions instead of n_past
2023-09-17 21:17:10 +03:00
goerch
b08e75baea
Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 ( #3170 )
...
* Fix für #2721
* Reenable tokenizer test for LLaMa
* Add `console.cpp` dependency
* Fix dependency to `common`
* Fixing wrong fix.
* Make console usage platform specific
Work on compiler warnings.
* Adapting makefile
* Remove trailing whitespace
* Adapting the other parts of the makefile
* Fix typo.
* Fixing the last deviations from sentencepiece indicated by test-tokenizer-1
* Simplify logic
* Add missing change...
* Fix ugly compiler warning
* llama_tokenize should accept strings containing NUL now
* Adding huichen's test case
2023-09-16 13:41:33 +02:00