Commit Graph

57 Commits

Author SHA1 Message Date
Pierrick Hymbert
d52d7819b8
server: concurrency fix + monitoring - add /metrics prometheus compatible endpoint (#5708)
* server: monitoring - add /metrics prometheus compatible endpoint

* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified

* server: metrics - move to a dedicated struct
2024-02-25 13:49:43 +01:00
Pierrick Hymbert
1ecea255eb
server: health: fix race condition on slots data using tasks queue (#5634)
* server: health: fix race condition on slots data using tasks queue

* server: health:
    * include_slots only if slots_endpoint
    * fix compile warning task.target_id not initialized.
2024-02-21 15:47:48 +01:00
Xuan Son Nguyen
9c405c9f9a
Server: use llama_chat_apply_template (#5593)
* server: use llama_chat_apply_template

* server: remove trailing space

* server: fix format_chat

* server: fix help message

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: fix formatted_chat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-20 15:58:27 +01:00
Daniel Hiltgen
66c1968f7a
server : graceful server shutdown (#5244)
This updates the server queue to support graceful shutdown of the server on signals.
2024-02-18 18:23:16 +02:00
Xuan Son Nguyen
907e08c110
server : add llama2 chat template (#5425)
* server: add mistral chat template

* server: fix typo

* server: rename template mistral to llama2

* server: format_llama2: remove BOS

* server: validate "--chat-template" argument

* server: clean up using_chatml variable

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-11 12:16:22 +02:00
Georgi Gerganov
753eafed0e
sync : ggml 2024-01-27 17:00:24 +02:00
Xuan Son Nguyen
48c857aa10
server : refactored the task processing logic (#5065)
* server: add llama_server_queue struct

* server: add llama_server_response_event

* server: add comments

* server: move all mutexes away from server.cpp

* server: correct multitask response

* server: only add back deferred tasks when one slot is available

* server: fix a race condition cause by "request_completion"
2024-01-26 14:42:20 +02:00