Georgi Gerganov
755a9b2bf0
llama : add infill sampler ( #9896 )
...
ggml-ci
2024-10-15 16:35:33 +03:00
Georgi Gerganov
11ac9800af
llama : improve infill support and special token detection ( #9798 )
...
* llama : improve infill support
ggml-ci
* llama : add more FIM token strings
ggml-ci
* server : update prompt on slot restore (#9800 )
* gguf : deprecate old FIM token KVs
2024-10-12 08:21:51 +03:00
Georgi Gerganov
f4d2b8846a
llama : add reranking support ( #9510 )
...
* py : add XLMRobertaForSequenceClassification [no ci]
* py : fix scalar-tensor conversion [no ci]
* py : fix position embeddings chop [no ci]
* llama : read new cls tensors [no ci]
* llama : add classigication head (wip) [no ci]
* llama : add "rank" pooling type
ggml-ci
* server : add rerank endpoint
ggml-ci
* llama : aboud ggml_repeat during classification
* rerank : cleanup + comments
* server : accept /rerank endpoint in addition to /v1/rerank [no ci]
* embedding : parse special tokens
* jina : support v1 reranker
* vocab : minor style
ggml-ci
* server : initiate tests for later
ggml-ci
* server : add docs
* llama : add comment [no ci]
* llama : fix uninitialized tensors
* ci : add rerank tests
ggml-ci
* add reranking test
* change test data
* Update examples/server/server.cpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* add `--reranking` argument
* update server docs
* llama : fix comment [no ci]
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-09-28 17:42:03 +03:00
Zhenwei Jin
6102037bbb
vocab : refactor tokenizer to reduce init overhead ( #9449 )
...
* refactor tokenizer
* llama : make llm_tokenizer more private
ggml-ci
* refactor tokenizer
* refactor tokenizer
* llama : make llm_tokenizer more private
ggml-ci
* remove unused files
* remove unused fileds to avoid unused filed build error
* avoid symbol link error
* Update src/llama.cpp
* Update src/llama.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-28 15:10:58 +03:00
nopperl
9a913110cf
llama : add support for Chameleon ( #8543 )
...
* convert chameleon hf to gguf
* add chameleon tokenizer tests
* fix lint
* implement chameleon graph
* add swin norm param
* return qk norm weights and biases to original format
* implement swin norm
* suppress image token output
* rem tabs
* add comment to conversion
* fix ci
* check for k norm separately
* adapt to new lora implementation
* fix layer input for swin norm
* move swin_norm in gguf writer
* add comment regarding special token regex in chameleon pre-tokenizer
* Update src/llama.cpp
Co-authored-by: compilade <git@compilade.net>
* fix punctuation regex in chameleon pre-tokenizer (@compilade)
Co-authored-by: compilade <git@compilade.net>
* fix lint
* trigger ci
---------
Co-authored-by: compilade <git@compilade.net>
2024-09-28 15:08:43 +03:00
Georgi Gerganov
31ac5834fe
llama : keep track of all EOG tokens in the vocab ( #9609 )
...
ggml-ci
2024-09-24 10:16:06 +03:00
Molly Sophia
8f1d81a0b6
llama : support RWKV v6 models ( #8980 )
...
* convert_hf_to_gguf: Add support for RWKV v6
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Add RWKV tokenization
* Fix build
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Do not use special tokens when matching in RWKV tokenizer
* Fix model loading
* Add (broken) placeholder graph builder for RWKV
* Add workaround for kv cache
* Add logits conversion to rwkv5
* Add rwkv5 layer norms
* Add time mix KVRG & correct merge mistake
* Add remaining time mix parameters
* Add time mix output loading
* Add placeholder llm_build_time_mix
* Fix build
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Load more tensors for rwkv v6
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix rwkv tokenizer
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* ggml: Add unary operator Exp
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* RWKV v6 graph building
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Add ``rescale_every_n_layers`` parameter
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Add ``wkv.head_size`` key for RWKV
so it doesn't reuse Mamba ssm parameters
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix offloading layers to CUDA
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix parallel inferencing for RWKV
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Remove trailing whitespaces
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* build_rwkv: Avoid using inplace operations
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* convert_hf_to_gguf: rwkv: Avoid using ``eval``
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* convert_hf_to_gguf: rwkv tokenizer: Don't escape sequences manually
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <git@compilade.net>
* ggml: Add backward computation for unary op ``exp``
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <git@compilade.net>
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <git@compilade.net>
* Use MODEL_ARCH.RWKV6 instead of MODEL_ARCH.RWKV
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* build_rwkv6: Simplify graph
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: rwkv6: Detect model.type
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: rwkv6: Fix tensor loading for 7B/14B models
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: rwkv6: Fix group_norm assertion failure with Metal
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: rwkv6: Clean up
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: rwkv6: Add quantization tensor exclusion
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: rwkv6: Use the new advanced batch splits
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update src/llama.cpp
Co-authored-by: compilade <git@compilade.net>
* llama: rwkv6: Use ``ggml_norm`` instead of ``ggml_group_norm``
Co-authored-by: compilade <git@compilade.net>
* llama: rwkv6: Apply code style and misc changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* converter: Use class name ``Rwkv6Model``
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: rwkv6: Make use of key ``feed_forward_length``
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: rwkv6: Add kv ``time_mix_extra_dim`` and ``time_decay_extra_dim``
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* converter: Match ``new_name`` instead of ``name`` for float32 explicit tensors
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: rwkv6: Keep ``time_mix_w1/w2`` as F32
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: rwkv6: Remove unused nodes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: rwkv6: Apply code format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: rwkv6: Add lora for some supported tensors
Currently att.key/receptance/value/gate/output, ffn.receptance/key/value, as well as head.weight
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* rwkv : speed-up tokenization using trie
* minor : style + indentation
* llama: rwkv6: Avoid division by zero
Co-authored-by: compilade <git@compilade.net>
* ggml: rwkv_wkv: Avoid copying the state
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Layl Bongers <3094382+LaylBongers@users.noreply.github.com>
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-01 17:38:17 +03:00
Daniel Bevenius
49271efbaf
llama : fix typo in xcda_array_view comment [no ci] ( #9132 )
2024-08-31 10:50:22 +03:00
Daniel Bevenius
8455340b87
llama : std::move llm_bigram_bpe from work_queue ( #9062 )
...
* llama : std::move llm_bigram_bpe from work_queue
This commit updates the retrieval of llm_bigram_bpe objects from
work_queue.top() by using std::move.
The motivation for this is to avoid the copying of the std::string
`text` member of the llm_bigram_bpe struct.
* squash! llama : std::move llm_bigram_bpe from work_queue
Introduced a MovablePriorityQueue class to allow moving elements
out of the priority queue for llm_bigram_bpe.
* squash! llama : std::move llm_bigram_bpe from work_queue
Rename MovablePriorityQueue to lama_priority_queue.
* squash! llama : std::move llm_bigram_bpe from work_queue
Rename lama_priority_queue -> llama_priority_queue.
2024-08-21 10:32:58 +03:00
Minsoo Cheong
c679e0cb5c
llama : add EXAONE model support ( #9025 )
...
* add exaone model support
* add chat template
* fix whitespace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add ftype
* add exaone pre-tokenizer in `llama-vocab.cpp`
Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>
* fix lint
Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>
* add `EXAONE` to supported models in `README.md`
* fix space
Co-authored-by: compilade <git@compilade.net>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
Co-authored-by: compilade <git@compilade.net>
2024-08-16 09:35:18 +03:00
Zhenwei Jin
4af8420afb
common : remove duplicate function llama_should_add_bos_token ( #8778 )
2024-08-15 10:23:23 +03:00
Esko Toivonen
6bda7ce6c3
llama : add pre-tokenizer regexes for BLOOM and gpt3-finnish ( #8850 )
2024-08-15 10:17:12 +03:00
Georgi Gerganov
45a55b91aa
llama : better replace_all (cont) ( #8926 )
...
* llama : better replace_all (cont)
ggml-ci
* code : deduplicate replace_all
ggml-ci
2024-08-09 18:23:52 +03:00
Douglas Hanley
cdd1889de6
convert : add support for XLMRoberta embedding models ( #8658 )
...
* add conversion for bge-m3; small fix in unigram tokenizer
* clean up and simplify XLMRoberta conversion
2024-08-06 10:20:54 +03:00
fairydreaming
d3f0c7166a
Stop the generation when <|eom_id|> token is encountered - needed for Llama 3.1 tool call support ( #8858 )
...
* gguf-py, llama : add constants and methods related to Llama-3.1 <|eom_id|> token
* llama : find Llama-3.1 <|eom_id|> token id during vocab loading
* llama-vocab : add Llama-3.1 <|eom_id|> token to the set of tokens stopping the generation
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-05 09:38:01 +02:00
slaren
2b1f616b20
ggml : reduce hash table reset cost ( #8698 )
...
* ggml : reduce hash table reset cost
* fix unreachable code warnings after GGML_ASSERT(false)
* GGML_ASSERT(false) -> GGML_ABORT("fatal error")
* GGML_ABORT use format string
2024-07-27 04:41:55 +02:00
Georgi Gerganov
938943cdbf
llama : move vocab, grammar and sampling into separate files ( #8508 )
...
* llama : move sampling code into llama-sampling
ggml-ci
* llama : move grammar code into llama-grammar
ggml-ci
* cont
ggml-ci
* cont : pre-fetch rules
* cont
ggml-ci
* llama : deprecate llama_sample_grammar
* llama : move tokenizers into llama-vocab
ggml-ci
* make : update llama.cpp deps [no ci]
* llama : redirect external API to internal APIs
ggml-ci
* llama : suffix the internal APIs with "_impl"
ggml-ci
* llama : clean-up
2024-07-23 13:10:17 +03:00