* llama : ggml-backend integration
* ggml-backend : add names to buffers
* fix unmap after loading
* batched-bench : add tensor_split param
* llama : check for null tensor_split
* ggml-backend : increase GGML_MAX_BACKENDS
* improve graph splitting, partial fix for --no-kv-offload
* cuda : add ggml-backend split buffer support
* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)
* ggml : fix null backend dereference (#4807)
* ggml : fix null backend dereference
* ggml : also check ggml_backend_is_cpu
* test-backend-ops : check buffer allocation failures
* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)
* ggml : fix mul_mat_id work size
* llama : rewrite session kv load/set without graphs
* minor
* llama : only initialize used backends, free backends on context free
* llama : abort ctx if cuda backend init fails
* llama : rewrite lora with ggml-backend and compute on CPU
ggml-ci
* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer
* opencl : add ggml-backend buffer type
* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)
* llama : on Metal, by default offload the full model
ggml-ci
* metal : page align the data ptr (#4854)
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cuda : fix split buffer free
* address review comments
* llama-bench : add split-mode parameter
* fix whitespace
* opencl : fix double initialization
* server : add --split-mode parameter
* use async copy and compute to improve multi-gpu performance
ggml-ci
* use async memcpys to copy the graph outputs to the CPU
* fix opencl
* use a host buffer for the cpu compute buffer for faster copies to the gpu
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* server: added support for multiple api keys, added loading api keys from file
* minor: fix whitespace
* added file error handling to --api-key-file, changed code to better
reflect current style
* server: update README.md for --api-key-file
---------
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
* added /health endpoint to the server
* added comments on the additional /health endpoint
* Better handling of server state
When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.
* initialized server_state
* fixed a typo
* starting http server before initializing the model
* Update server.cpp
* Update server.cpp
* fixes
* fixes
* fixes
* made ServerState atomic and turned two-line spaces into one-line
* updated `server` readme to document the `/health` endpoint too
* used LOG_INFO after successful model loading
* added /health endpoint to the server
* added comments on the additional /health endpoint
* Better handling of server state
When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.
* initialized server_state
* fixed a typo
* starting http server before initializing the model
* Update server.cpp
* Update server.cpp
* fixes
* fixes
* fixes
* made ServerState atomic and turned two-line spaces into one-line
* Changes to server to allow metadata override
* documentation
* flake.nix: expose full scope in legacyPackages
* flake.nix: rocm not yet supported on aarch64, so hide the output
* flake.nix: expose checks
* workflows: nix-ci: init; build flake outputs
* workflows: nix-ci: add a job for eval
* workflows: weekly `nix flake update`
* workflows: nix-flakestry: drop tag filters
...and add a job for flakehub.com
* workflows: nix-ci: add a qemu job for jetsons
* flake.nix: suggest the binary caches
* flake.lock: update
to a commit recently cached by nixpkgs-cuda-ci
---------
Co-authored-by: John <john@jLap.lan>
Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
The server currently schedules tasks using a sleep(5ms) busy loop. This
adds unnecessary latency since most sleep implementations do a round up
to the system scheduling quantum (usually 10ms). Other libc sleep impls
spin for smaller time intervals which results in the server's busy loop
consuming all available cpu. Having the explicit notify() / wait() code
also helps aid in the readability of the server code.
See mozilla-Ocho/llamafile@711344b
The default values for tfs_z and typical_p were being set to zero, which
caused the token candidates array to get shrunk down to one element thus
preventing any sampling. Note this only applies to OpenAI API compatible
HTTP server requests.
The solution is to use the default values that OpenAI documents, as well
as ensuring we use the llama.cpp defaults for the rest. I've tested this
change still ensures deterministic output by default. If a "temperature"
greater than 0 is explicitly passed, then output is unique each time. If
"seed" is specified in addition to "temperature" then the output becomes
deterministic once more.
See mozilla-Ocho/llamafile#117
See mozilla-Ocho/llamafile@9e4bf29
* Add API key authentication for enhanced server-client security
* server : to snake_case
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* * add multiprompt support
* * cleanup
* * more cleanup
* * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests
* * remove all references to mutex_multitasks
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* * change to set
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Add openai-compatible POST /v1/chat/completions API endpoint to server example
* fix code style
* Update server README.md
* Improve server README.md
* Fix server.cpp code style according to review
* server : some style changes
* server : indentation
* server : enable special tokens during tokenization by default
* server : minor code style
* server : change random string generator
* straightforward /v1/models endpoint
---------
Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com>
Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.
* Respect add_bos_token GGUF metadata value
* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
* Update server.cpp with min_p after it was introduced in https://github.com/ggerganov/llama.cpp/pull/3841
* Use spaces instead of tabs
* Update index.html.hpp after running deps.sh
* Fix test - fix line ending
* cmake : fix build when .git does not exist
* cmake : simplify BUILD_INFO target
* cmake : add missing dependencies on BUILD_INFO
* build : link against build info instead of compiling against it
* zig : make build info a .cpp source instead of a header
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
* cmake : revert change to CMP0115
---------
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
* Extend llama_kv_cache_seq_rm to allow matichng any sequence
* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear
Use llama_kv_cache_clear for cache clearing
Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
* added `llama_model_token_*` variants to all the `llama_token_*` functions.
* added `LLAMA_API`
* formatting
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* removed old `llama_token` functions
* changed 3 more functions to take in model
- `llama_token_get_text`
- `llama_token_get_score`
- `llama_token_get_type`
* added back docs
* fixed main.cpp
* changed token functions to use new model variants
* changed token functions to use new model variants
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* implementing parallel decoding in server example
* crash fixed
* save dev progress
* refactored sampling function
* completion endpoint working
* multiple client support
* grammar + no stream completion
* cached prompt support
* chat.mjs support cached prompt + some fixes
* server ui now support multiple clients
* unused change reverted
* fixed timings per slot
* add context swap
* add changes to README.md
* llava multimodal integration
* fixed tokens probs
* add multimodal input - alfa
* refactor code + remove unused comments + improved README.md
* fix compilation errors with llvm
* notify the user from server ui that multimodality is unavialable
* some ci fixes
* fix ci make build undefined ref errors
* fix long prompt than ctx proposed in #3639
* fixed premature end due stop word
* context shift fixed
* fix llava implementation
* sync README.md changes
* readme change
* update api like OpenAI
* multimodal support enabled by default
* fix make bui;d errors
* fix multiple clients
* fix zig build
* new sampling API
* latest changes of sampling API
* server : coding-style normalization
* server : coding-style normalization (part 2)
* server : remove beam-search functionality
* server : bug fix in ingest_images
n_tokens is incremented internally by llama_batch_add
* server : use refs + use llama_batch_clear()
* server : snake case
* server : minor sync
* added thread safe pipeline
* server : bach has to be allocated for n_parallel sequences
* server : no need for atomic int - already using mutex
* server : logs + minor code style
* server : fix multibyte handle in partial response (#3706)
* fix image load + view image in chat
* make : silence stb warnings
* clip : link to ggml, not to llama
* server : fix switch fallthrough
* server : fix crash in Debug on macOS (I have no idea why this fixes it!?)
* server : refactor ctx_sampling init + n_ctx + names
* server : bug fix for prompt caching
* Do not save/load image_data to localStorage
* editorconfig : new line in index.html
* server : completion requests remember slot_id
* Update readme to document multimodal in server
* server : minor style
* Update readme to document multimodal in server
* server : hide ctx_sampling->prev behind API (#3696)
* server : apply fix from #3722
* server : fix slot reuse
* server : add comment about changing slot_state to bool
---------
Co-authored-by: FSSRepo <go778sgt@gmail.com>
Co-authored-by: Damian Stewart <d@damianstewart.com>
Co-authored-by: Steward Garcia <57494570+FSSRepo@users.noreply.github.com>
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>