* * add multiprompt support
* * cleanup
* * more cleanup
* * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests
* * remove all references to mutex_multitasks
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* * change to set
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* cmake : fix joining of REAL_GIT_DIR
* fix includes with help from include-what-you-use
* make : remove unneeded deps and add test-rope target
* fix C includes in C++ source files
* Revert "fix includes with help from include-what-you-use"
This reverts commit 635e9fadfd.
* ShareGPT4 compatibility (vision encoder only loading)
Load only a CLIP vision encoder (as supplied by ShareGPT finetunes)
Corrects the argument parsing for --img_mean and --img_std (which were previously not parsed but attempted to access)
Defines defaults for img_mean and img_std which are equal to the llava 1.5 CLIP encoder, so you do not have to provide them
* Update convert-image-encoder-to-gguf.py
* llama: fix alignment of general.name in print meta
This commit fixes the alignment of the general.name field in the
llm_load_print_meta function.
Currently the output looks like this:
```console
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name = LLaMA v2
```
And with this commit it looks like this:
```console
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name = LLaMA v2
```
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* llama: fix alignment of special tokens
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Typical sampling was broken because after copying new_candidates into canditates, the "sorted" bool is left at "true", but the new data is no longer sorted according to probability. Patch to set "sorted" to false.
Test: Generating with temp=0.0001 (approx. argmax) should generate the same sequence at typical>=1.0 and typical=0.9999 (approx. disabled, but enters the typical sampling codepath).
* fix oai proxy
fix generation not stoped while bot stop talking in chat mode
fix possible `slot_id` not exist
response for cors (and pre flight)
* oai proxy: workaround for some client (such as Chatbox)
* use stop as separator to replace hardcoded `\n`
* ggml : use blas even if src0 is not F32
* llama : use n_threads_batch only when n_tokens >= 32
ggml-ci
* llama : revert n_threads_batch logic
ggml-ci
* copy to llama.cpp as subdir
* attempt enabling metal, fails
* ggml metal compiles!
* Update README.md
* initial conversion to new format, utf8 errors?
* bug fixes, but now has an invalid memory access :(
* added O3, now has insufficient memory access
* begin sync with master
* update to match latest code, new errors
* fixed it!
* fix for loop conditionals, increase result size
* fix current workflow errors
* attempt a llama.swiftui workflow
* Update .github/workflows/build.yml
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add openai-compatible POST /v1/chat/completions API endpoint to server example
* fix code style
* Update server README.md
* Improve server README.md
* Fix server.cpp code style according to review
* server : some style changes
* server : indentation
* server : enable special tokens during tokenization by default
* server : minor code style
* server : change random string generator
* straightforward /v1/models endpoint
---------
Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com>
Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>
* llama : keep track of used KV cells + better KV cache management
* llama : zero KV cache used upon clear
ggml-ci
* llama : allow exporting a view of the KV cache (#4180)
* Allow exporting a view of the KV cache
* Allow dumping the sequences per cell in common
* Track max contiguous cells value and position as well
* Fix max contiguous empty cells index calculation
Make dump functions deal with lengths or sequences counts > 10 better
* Fix off by one error in dump_kv_cache_view
* Add doc comments for KV cache view functions
Eliminate cell sequence struct; use llama_seq_id directly
Minor cleanups
* common : add -dkvc arg for enabling kv cache dumps
---------
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
* Update README.md
* Update README.md
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Disabled rules:
* E203 Whitespace before ':' - disabled because we often use 'C' Style where values are aligned
* E211 Whitespace before '(' (E211) - disabled because we often use 'C' Style where values are aligned
* E221 Multiple spaces before operator - disabled because we often use 'C' Style where values are aligned
* E225 Missing whitespace around operator - disabled because it's broken so often it seems like a standard
* E231 Missing whitespace after ',', ';', or ':' - disabled because we often use 'C' Style where values are aligned
* E241 Multiple spaces after ',' - disabled because we often use 'C' Style where values are aligned
* E251 Unexpected spaces around keyword / parameter equals - disabled because it's broken so often it seems like a standard
* E261 At least two spaces before inline comment - disabled because it's broken so often it seems like a standard
* E266 Too many leading '#' for block comment - sometimes used as "section" separator
* E501 Line too long - disabled because it's broken so often it seems like a standard
* E701 Multiple statements on one line (colon) - broken only in convert.py when defining abstract methods (we can use# noqa instead)
* E704 Multiple statements on one line - broken only in convert.py when defining abstract methods (we can use# noqa instead)
* Support special tokens and not adding BOS to prompt in speculative
* Adapt to new should_add_bos function
* Ensure tgt and dft have same add_bos setting