* An empty tool_call_id is better than none!
* sync: minja (tool call name optional https://github.com/google/minja/pull/36)
* Force-disable parallel_tool_calls if template doesn't support it
* More debug logs
* Llama 3.x tools: accept / trigger on more varied spaced outputs
* Fix empty content for functionary v3.2 tool call
* Add proper tool call docs to server README
* readme: function calling *is* supported now
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit updates the help text for the metrics `requests_processing`
and `requests_deferred` to be more grammatically correct.
Currently the returned metrics look like this:
```console
\# HELP llamacpp:requests_processing Number of request processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of request deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
With this commit, the metrics will look like this:
```console
\# HELP llamacpp:requests_processing Number of requests processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of requests deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
This is also consistent with the description of the metrics in the
server examples [README.md](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#get-metrics-prometheus-compatible-metrics-exporter).
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
This commit replaces the two usages of `std::bind` in favor of lambdas for
the callback functions for `callback_new_task` and
`callback_update_slots`.
The motivation for this changes is consistency with the rest of the code
in server.cpp (lambdas are used for all other callbacks/handlers). Also
lambdas are more readable (perhaps this is subjective) but also they are
recommended over `std::bind` in modern C++.
Ref: https://github.com/LithoCoders/dailycpp/blob/master/EffectiveModernC%2B%2B/chapter6/Item34_Prefer_lambdas_to_std::bind.md
This commit updates some of JSON snippets in README.md file and
removes the `json` language tag from the code blocks.
The motivation for this changes is that if there is invalid json in a
code snippet these are highlighted in red which can make it somewhat
difficult to read and can be a little distracting.
* add /apply-template endpoint to server
* remove unnecessary line
* add /apply-template documentation
* return only "prompt" field in /apply-template
* use suggested idea instead of my overly verbose way
* server : update auto gen files comments
This commit updates the 'auto generated files' comments in server.cpp
and removes `deps.sh` from the comment.
The motivation for this change is that `deps.sh` was removed in
Commit 91c36c269b ("server : (web ui)
Various improvements, now use vite as bundler (#10599)").
* squash! server : update auto gen files comments [no ci]
Move comments about file generation to README.md.
* squash! server : update auto gen files comments [no ci]
Remove the comments in server.cpp that mention that information
can be found in the README.md file.
The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True
* webui : put DeepSeek R1 CoT in a collapsible <details> element
* webui: refactor split
* webui: don't use regex to split cot and response
* webui: format+qol
* webui: no loading icon if the model isn't generating
* ui fix, add configs
* add jsdoc types
* only filter </think> for assistant msg
* build
* update build
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server : add tooltips to settings and themes btn
This commit adds tooltips to the settings and themes buttons in the
webui. The tooltip will be displayed below the actual buttons when
hovered over.
The motivation for this change is to clarify the purpose of the themes
button.
* squash! server : add tooltips to settings and themes btn
This commit adds a tooltip to the '...' button when a chat has been
started. The tooltip is "Chat options" which think could be a good
description as the dropdown contains options to delete or download the
current chat.
* rm tooltip for 3 dots button
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server/bench:
- support openAI streaming standard output with [DONE]\n\n
- export k6 raw results in csv
- fix too many tcp idle connection in tcp_wait
- add metric time to emit first token
* server/bench:
- fix when prometheus not started
- wait for server to be ready before starting bench
* more perfo with llamafile tinyblas on x86_64.
- add bf16 suport
- change dispache strategie (thanks:
https://github.com/ikawrakow/ik_llama.cpp/pull/71 )
- reduce memory bandwidth
simple tinyblas dispache and more cache freindly
* tinyblas dynamic dispaching
* sgemm: add M blocs.
* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2
* remove not stable test
* server: avoid overwriting Authorization header
If no API key is set, leave the Authorization header as is. It may be
used by another part of the Web stack, such as an authenticating proxy.
Fixes https://github.com/ggerganov/llama.cpp/issues/10854
* rebuild
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server : add "tokens" output
ggml-ci
* server : output embeddings for all tokens when pooling = none
ggml-ci
* server : update readme [no ci]
* server : fix spacing [no ci]
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : be explicit about the pooling type in the tests
ggml-ci
* server : update /embeddings and /v1/embeddings endpoints
ggml-ci
* server : do not normalize embeddings when there is no pooling
ggml-ci
* server : update readme
ggml-ci
* server : fixes
* tests : update server tests
ggml-ci
* server : update readme [no ci]
* server : remove rebase artifact
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : add "tokens" output
ggml-ci
* server : update readme
ggml-ci
* server : return tokens ids only if requested
ggml-ci
* tests : improve "tokens" type check
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : remove "tokens" from the OAI endpoint
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>