2024-03-07 10:41:53 +01:00
# include "utils.hpp"
2023-05-21 19:51:18 +02:00
# include "common.h"
json-schema-to-grammar improvements (+ added to server) (#5978)
* json: fix arrays (disallow `[,1]`)
* json: support tuple types (`[number, string]`)
* json: support additionalProperties (`{[k: string]: [string,number][]}`)
* json: support required / optional properties
* json: add support for pattern
* json: resolve $ref (and support https schema urls)
* json: fix $ref resolution
* join: support union types (mostly for nullable types I think)
* json: support allOf + nested anyOf
* json: support any (`{}` or `{type: object}`)
* json: fix merge
* json: temp fix for escapes
* json: spaces in output and unrestricted output spaces
* json: add typings
* json:fix typo
* Create ts-type-to-grammar.sh
* json: fix _format_literal (json.dumps already escapes quotes)
* json: merge lit sequences and handle negatives
{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}
* json: handle pattern repetitions
* Update json-schema-to-grammar.mjs
* Create regex-to-grammar.py
* json: extract repeated regexp patterns to subrule
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* json: handle schema from pydantic Optional fields
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update ts-type-to-grammar.sh
* Update ts-type-to-grammar.sh
* json: simplify nullable fields handling
* json: accept duplicate identical rules
* json: revert space to 1 at most
* json: reuse regexp pattern subrules
* json: handle uuid string format
* json: fix literal escapes
* json: add --allow-fetch
* json: simplify range escapes
* json: support negative ranges in patterns
* Delete commit.txt
* json: custom regex parser, adds dot support & JS-portable
* json: rm trailing spaces
* Update json-schema-to-grammar.mjs
* json: updated server & chat `( cd examples/server && ./deps.sh )`
* json: port fixes from mjs to python
* Update ts-type-to-grammar.sh
* json: support prefixItems alongside array items
* json: add date format + fix uuid
* json: add date, time, date-time formats
* json: preserve order of props from TS defs
* json: port schema converter to C++, wire in ./server
* json: nits
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* json: fix mjs implementation + align outputs
* Update json-schema-to-grammar.mjs.hpp
* json: test C++, JS & Python versions
* json: nits + regen deps
* json: cleanup test
* json: revert from c++17 to 11
* json: nit fixes
* json: dirty include for test
* json: fix zig build
* json: pass static command to std::system in tests (fixed temp files)
* json: fix top-level $refs
* json: don't use c++20 designated initializers
* nit
* json: basic support for reserved names `{number:{number:{root:number}}}`
* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)
* json: re-ran server deps.sh
* json: simplify test
* json: support mix of additional props & required/optional
* json: add tests for some expected failures
* json: fix type=const in c++, add failure expectations for non-str const&enum
* json: test (& simplify output of) empty schema
* json: check parsing in test + fix value & string refs
* json: add server tests for OAI JSON response_format
* json: test/fix top-level anyOf
* json: improve grammar parsing failures
* json: test/fix additional props corner cases
* json: fix string patterns (was missing quotes)
* json: ws nit
* json: fix json handling in server when there's no response_format
* json: catch schema conversion errors in server
* json: don't complain about unknown format type in server if unset
* json: cleaner build of test
* json: create examples/json-schema-pydantic-example.py
* json: fix date pattern
* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common
* json: indent 4 spaces
* json: fix naming of top-level c++ function (+ drop unused one)
* json: avoid using namespace std
* json: fix zig build
* Update server.feature
* json: iostream -> fprintf
* json: space before & refs for consistency
* json: nits
2024-03-21 12:50:43 +01:00
# include "json-schema-to-grammar.h"
2023-05-21 19:51:18 +02:00
# include "llama.h"
2024-05-08 21:53:08 +02:00
// Change JSON_ASSERT from assert() to GGML_ASSERT:
# define JSON_ASSERT GGML_ASSERT
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
# include "json.hpp"
2024-08-16 17:19:05 +02:00
// mime type for sending response
# define MIMETYPE_JSON "application / json; charset=utf-8"
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-07-04 16:05:27 +02:00
// auto generated files (update with ./deps.sh)
2024-06-01 21:31:48 +02:00
# include "colorthemes.css.hpp"
# include "style.css.hpp"
# include "theme-beeninorder.css.hpp"
# include "theme-ketivah.css.hpp"
# include "theme-mangotango.css.hpp"
# include "theme-playground.css.hpp"
# include "theme-polarnight.css.hpp"
# include "theme-snowstorm.css.hpp"
2023-07-04 16:05:27 +02:00
# include "index.html.hpp"
2024-06-01 21:31:48 +02:00
# include "index-new.html.hpp"
2023-07-04 16:05:27 +02:00
# include "index.js.hpp"
# include "completion.js.hpp"
2024-06-01 21:31:48 +02:00
# include "system-prompts.js.hpp"
# include "prompt-formats.js.hpp"
2023-08-15 00:14:14 +02:00
# include "json-schema-to-grammar.mjs.hpp"
2023-07-04 16:05:27 +02:00
2024-03-07 10:41:53 +01:00
# include <atomic>
2023-10-22 21:53:08 +02:00
# include <chrono>
2023-12-29 15:24:12 +01:00
# include <condition_variable>
2024-03-07 10:41:53 +01:00
# include <cstddef>
# include <mutex>
# include <thread>
2024-02-18 17:23:16 +01:00
# include <signal.h>
2024-03-09 10:57:09 +01:00
# include <memory>
2024-09-02 17:11:51 +02:00
# include <unordered_set>
# include <unordered_map>
# include <deque>
2023-09-01 15:34:50 +02:00
2024-03-22 14:07:44 +01:00
using json = nlohmann : : ordered_json ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-01-26 13:42:20 +01:00
bool server_verbose = false ;
2024-02-25 13:50:32 +01:00
bool server_log_json = true ;
2023-07-02 23:38:44 +02:00
2024-02-29 21:42:11 +01:00
enum stop_type {
2024-03-07 10:41:53 +01:00
STOP_TYPE_FULL ,
STOP_TYPE_PARTIAL ,
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
} ;
2024-09-06 23:21:29 +02:00
// state diagram: https://github.com/ggerganov/llama.cpp/pull/9283
2024-02-29 21:42:11 +01:00
enum slot_state {
2024-03-07 10:41:53 +01:00
SLOT_STATE_IDLE ,
2024-09-06 23:21:29 +02:00
SLOT_STATE_PROCESSING_PROMPT ,
SLOT_STATE_DONE_PROMPT ,
SLOT_STATE_GENERATING ,
2024-03-07 10:41:53 +01:00
} ;
enum server_state {
SERVER_STATE_LOADING_MODEL , // Server is starting up, model not fully loaded yet
SERVER_STATE_READY , // Server is ready and model is loaded
} ;
enum server_task_type {
SERVER_TASK_TYPE_COMPLETION ,
SERVER_TASK_TYPE_CANCEL ,
SERVER_TASK_TYPE_NEXT_RESPONSE ,
2024-04-08 14:43:30 +02:00
SERVER_TASK_TYPE_METRICS ,
SERVER_TASK_TYPE_SLOT_SAVE ,
SERVER_TASK_TYPE_SLOT_RESTORE ,
SERVER_TASK_TYPE_SLOT_ERASE ,
2024-08-06 17:33:39 +02:00
SERVER_TASK_TYPE_SET_LORA ,
2024-03-07 10:41:53 +01:00
} ;
2024-09-02 17:11:51 +02:00
enum server_task_cmpl_type {
SERVER_TASK_CMPL_TYPE_NORMAL ,
SERVER_TASK_CMPL_TYPE_EMBEDDING ,
SERVER_TASK_CMPL_TYPE_INFILL ,
} ;
2024-03-07 10:41:53 +01:00
struct server_task {
int id = - 1 ; // to be filled by server_queue
2024-09-02 17:11:51 +02:00
int id_target = - 1 ; // used by SERVER_TASK_TYPE_CANCEL
2024-03-07 10:41:53 +01:00
server_task_type type ;
json data ;
2024-09-02 17:11:51 +02:00
server_task_cmpl_type cmpl_type = SERVER_TASK_CMPL_TYPE_NORMAL ;
// utility function
static std : : unordered_set < int > get_list_id ( const std : : vector < server_task > & tasks ) {
std : : unordered_set < int > ids ( tasks . size ( ) ) ;
for ( size_t i = 0 ; i < tasks . size ( ) ; i + + ) {
ids . insert ( tasks [ i ] . id ) ;
}
return ids ;
}
2024-03-07 10:41:53 +01:00
} ;
struct server_task_result {
int id = - 1 ;
json data ;
bool stop ;
bool error ;
} ;
2024-02-29 21:42:11 +01:00
struct slot_params {
bool stream = true ;
bool cache_prompt = false ; // remember the prompt to avoid reprocessing all prompt
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-29 21:42:11 +01:00
int32_t n_keep = 0 ; // number of tokens to keep from initial prompt
2024-03-26 09:47:43 +01:00
int32_t n_discard = 0 ; // number of tokens after n_keep that may be discarded when shifting context, 0 defaults to half
2024-02-29 21:42:11 +01:00
int32_t n_predict = - 1 ; // new tokens to predict
2023-07-02 23:38:44 +02:00
2024-02-29 21:42:11 +01:00
std : : vector < std : : string > antiprompt ;
2023-07-02 23:38:44 +02:00
2024-02-29 21:42:11 +01:00
json input_prefix ;
json input_suffix ;
} ;
struct server_slot {
2023-10-22 21:53:08 +02:00
int id ;
2024-03-07 10:41:53 +01:00
int id_task = - 1 ;
2024-09-02 17:11:51 +02:00
// the index relative to completion multi-task request
size_t index = 0 ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
struct slot_params params ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
slot_state state = SLOT_STATE_IDLE ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
// used to determine the slot that has been used the longest
int64_t t_last_used = - 1 ;
2023-10-20 20:07:23 +02:00
2023-10-22 21:53:08 +02:00
// generation props
int32_t n_ctx = 0 ; // context size per slot
int32_t n_past = 0 ;
int32_t n_decoded = 0 ;
int32_t n_remaining = - 1 ;
int32_t i_batch = - 1 ;
2024-03-13 18:54:21 +01:00
int32_t n_predict = - 1 ; // TODO: disambiguate from params.n_predict
2023-10-20 20:07:23 +02:00
2024-02-29 21:42:11 +01:00
int32_t n_prompt_tokens = 0 ;
int32_t n_prompt_tokens_processed = 0 ;
2023-10-22 21:53:08 +02:00
2024-06-12 13:42:29 +02:00
json prompt ; // can be either a string, array of strings or array of token ids
2024-03-07 10:41:53 +01:00
// when a task is submitted, we first tokenize the prompt and store it here
std : : vector < llama_token > prompt_tokens ;
2023-10-22 21:53:08 +02:00
std : : string generated_text ;
std : : vector < llama_token > cache_tokens ;
std : : vector < completion_token_output > generated_token_probs ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-09-02 17:11:51 +02:00
server_task_cmpl_type cmpl_type = SERVER_TASK_CMPL_TYPE_NORMAL ;
2023-10-22 21:53:08 +02:00
bool has_next_token = true ;
2024-03-07 10:41:53 +01:00
bool truncated = false ;
bool stopped_eos = false ;
bool stopped_word = false ;
bool stopped_limit = false ;
2023-10-22 21:53:08 +02:00
2023-11-25 10:29:06 +01:00
bool oaicompat = false ;
2024-03-07 10:41:53 +01:00
std : : string oaicompat_model ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
std : : string stopping_word ;
2023-10-22 21:53:08 +02:00
// sampling
json-schema-to-grammar improvements (+ added to server) (#5978)
* json: fix arrays (disallow `[,1]`)
* json: support tuple types (`[number, string]`)
* json: support additionalProperties (`{[k: string]: [string,number][]}`)
* json: support required / optional properties
* json: add support for pattern
* json: resolve $ref (and support https schema urls)
* json: fix $ref resolution
* join: support union types (mostly for nullable types I think)
* json: support allOf + nested anyOf
* json: support any (`{}` or `{type: object}`)
* json: fix merge
* json: temp fix for escapes
* json: spaces in output and unrestricted output spaces
* json: add typings
* json:fix typo
* Create ts-type-to-grammar.sh
* json: fix _format_literal (json.dumps already escapes quotes)
* json: merge lit sequences and handle negatives
{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}
* json: handle pattern repetitions
* Update json-schema-to-grammar.mjs
* Create regex-to-grammar.py
* json: extract repeated regexp patterns to subrule
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* json: handle schema from pydantic Optional fields
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update ts-type-to-grammar.sh
* Update ts-type-to-grammar.sh
* json: simplify nullable fields handling
* json: accept duplicate identical rules
* json: revert space to 1 at most
* json: reuse regexp pattern subrules
* json: handle uuid string format
* json: fix literal escapes
* json: add --allow-fetch
* json: simplify range escapes
* json: support negative ranges in patterns
* Delete commit.txt
* json: custom regex parser, adds dot support & JS-portable
* json: rm trailing spaces
* Update json-schema-to-grammar.mjs
* json: updated server & chat `( cd examples/server && ./deps.sh )`
* json: port fixes from mjs to python
* Update ts-type-to-grammar.sh
* json: support prefixItems alongside array items
* json: add date format + fix uuid
* json: add date, time, date-time formats
* json: preserve order of props from TS defs
* json: port schema converter to C++, wire in ./server
* json: nits
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* json: fix mjs implementation + align outputs
* Update json-schema-to-grammar.mjs.hpp
* json: test C++, JS & Python versions
* json: nits + regen deps
* json: cleanup test
* json: revert from c++17 to 11
* json: nit fixes
* json: dirty include for test
* json: fix zig build
* json: pass static command to std::system in tests (fixed temp files)
* json: fix top-level $refs
* json: don't use c++20 designated initializers
* nit
* json: basic support for reserved names `{number:{number:{root:number}}}`
* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)
* json: re-ran server deps.sh
* json: simplify test
* json: support mix of additional props & required/optional
* json: add tests for some expected failures
* json: fix type=const in c++, add failure expectations for non-str const&enum
* json: test (& simplify output of) empty schema
* json: check parsing in test + fix value & string refs
* json: add server tests for OAI JSON response_format
* json: test/fix top-level anyOf
* json: improve grammar parsing failures
* json: test/fix additional props corner cases
* json: fix string patterns (was missing quotes)
* json: ws nit
* json: fix json handling in server when there's no response_format
* json: catch schema conversion errors in server
* json: don't complain about unknown format type in server if unset
* json: cleaner build of test
* json: create examples/json-schema-pydantic-example.py
* json: fix date pattern
* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common
* json: indent 4 spaces
* json: fix naming of top-level c++ function (+ drop unused one)
* json: avoid using namespace std
* json: fix zig build
* Update server.feature
* json: iostream -> fprintf
* json: space before & refs for consistency
* json: nits
2024-03-21 12:50:43 +01:00
json json_schema ;
2023-10-22 21:53:08 +02:00
2024-09-07 14:16:19 +02:00
struct gpt_sampler_params sparams ;
struct gpt_sampler * smpl = nullptr ;
llama_token sampled ;
2024-01-27 14:38:05 +01:00
int32_t ga_i = 0 ; // group-attention state
2024-01-30 19:17:30 +01:00
int32_t ga_n = 1 ; // group-attention factor
2024-01-27 14:38:05 +01:00
int32_t ga_w = 512 ; // group-attention width
int32_t n_past_se = 0 ; // self-extend
2023-10-22 21:53:08 +02:00
// stats
2024-02-29 21:42:11 +01:00
size_t n_sent_text = 0 ; // number of sent text character
size_t n_sent_token_probs = 0 ;
2023-10-22 21:53:08 +02:00
int64_t t_start_process_prompt ;
2024-03-07 10:41:53 +01:00
int64_t t_start_generation ;
2023-10-22 21:53:08 +02:00
double t_prompt_processing ; // ms
double t_token_generation ; // ms
2024-09-06 23:21:29 +02:00
std : : function < void ( int ) > callback_on_release ;
2023-10-22 21:53:08 +02:00
void reset ( ) {
2024-03-07 10:41:53 +01:00
n_prompt_tokens = 0 ;
generated_text = " " ;
truncated = false ;
stopped_eos = false ;
stopped_word = false ;
stopped_limit = false ;
stopping_word = " " ;
n_past = 0 ;
n_sent_text = 0 ;
n_sent_token_probs = 0 ;
2024-09-02 17:11:51 +02:00
cmpl_type = SERVER_TASK_CMPL_TYPE_NORMAL ;
2024-03-07 10:41:53 +01:00
ga_i = 0 ;
n_past_se = 0 ;
2024-01-30 19:17:30 +01:00
2023-10-22 21:53:08 +02:00
generated_token_probs . clear ( ) ;
}
bool has_budget ( gpt_params & global_params ) {
2024-02-29 21:42:11 +01:00
if ( params . n_predict = = - 1 & & global_params . n_predict = = - 1 ) {
2024-01-07 07:45:26 +01:00
return true ; // limitless
}
2023-10-22 21:53:08 +02:00
n_remaining = - 1 ;
2024-01-07 07:45:26 +01:00
2024-02-29 21:42:11 +01:00
if ( params . n_predict ! = - 1 ) {
2023-10-22 21:53:08 +02:00
n_remaining = params . n_predict - n_decoded ;
2024-02-29 21:42:11 +01:00
} else if ( global_params . n_predict ! = - 1 ) {
2023-10-22 21:53:08 +02:00
n_remaining = global_params . n_predict - n_decoded ;
}
2024-01-07 07:45:26 +01:00
return n_remaining > 0 ; // no budget
2023-10-22 21:53:08 +02:00
}
bool is_processing ( ) const {
2024-09-06 23:21:29 +02:00
return state ! = SLOT_STATE_IDLE ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
void add_token_string ( const completion_token_output & token ) {
2024-09-06 23:21:29 +02:00
if ( ! is_processing ( ) ) {
2023-10-22 21:53:08 +02:00
return ;
}
generated_token_probs . push_back ( token ) ;
}
void release ( ) {
2024-09-06 23:21:29 +02:00
if ( is_processing ( ) ) {
2024-03-07 10:41:53 +01:00
t_token_generation = ( ggml_time_us ( ) - t_start_generation ) / 1e3 ;
2024-09-06 23:21:29 +02:00
state = SLOT_STATE_IDLE ;
LOG_INFO ( " slot released " , {
{ " id_slot " , id } ,
{ " id_task " , id_task } ,
{ " n_past " , n_past } ,
{ " truncated " , truncated } ,
} ) ;
callback_on_release ( id ) ;
2023-10-22 21:53:08 +02:00
}
}
2024-03-07 10:41:53 +01:00
json get_formated_timings ( ) const {
return json {
2024-02-29 21:42:11 +01:00
{ " prompt_n " , n_prompt_tokens_processed } ,
2023-10-22 21:53:08 +02:00
{ " prompt_ms " , t_prompt_processing } ,
2024-02-29 21:42:11 +01:00
{ " prompt_per_token_ms " , t_prompt_processing / n_prompt_tokens_processed } ,
{ " prompt_per_second " , 1e3 / t_prompt_processing * n_prompt_tokens_processed } ,
2023-10-22 21:53:08 +02:00
{ " predicted_n " , n_decoded } ,
{ " predicted_ms " , t_token_generation } ,
{ " predicted_per_token_ms " , t_token_generation / n_decoded } ,
{ " predicted_per_second " , 1e3 / t_token_generation * n_decoded } ,
} ;
}
2024-03-07 10:41:53 +01:00
size_t find_stopping_strings ( const std : : string & text , const size_t last_token_size , const stop_type type ) {
size_t stop_pos = std : : string : : npos ;
for ( const std : : string & word : params . antiprompt ) {
size_t pos ;
if ( type = = STOP_TYPE_FULL ) {
const size_t tmp = word . size ( ) + last_token_size ;
const size_t from_pos = text . size ( ) > tmp ? text . size ( ) - tmp : 0 ;
pos = text . find ( word , from_pos ) ;
} else {
pos = find_partial_stop_string ( word , text ) ;
}
if ( pos ! = std : : string : : npos & & ( stop_pos = = std : : string : : npos | | pos < stop_pos ) ) {
if ( type = = STOP_TYPE_FULL ) {
stopped_word = true ;
stopping_word = word ;
has_next_token = false ;
}
stop_pos = pos ;
}
}
return stop_pos ;
}
2023-11-25 10:29:06 +01:00
void print_timings ( ) const {
2024-03-07 10:41:53 +01:00
char buffer [ 512 ] ;
2024-02-29 21:42:11 +01:00
double t_token = t_prompt_processing / n_prompt_tokens_processed ;
double n_tokens_second = 1e3 / t_prompt_processing * n_prompt_tokens_processed ;
2024-03-07 10:41:53 +01:00
snprintf ( buffer , 512 , " prompt eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second) " ,
2024-02-29 21:42:11 +01:00
t_prompt_processing , n_prompt_tokens_processed ,
2024-02-25 13:50:32 +01:00
t_token , n_tokens_second ) ;
2024-03-07 10:41:53 +01:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( buffer , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , id } ,
{ " id_task " , id_task } ,
2024-02-29 21:42:11 +01:00
{ " t_prompt_processing " , t_prompt_processing } ,
{ " n_prompt_tokens_processed " , n_prompt_tokens_processed } ,
{ " t_token " , t_token } ,
{ " n_tokens_second " , n_tokens_second } ,
2024-02-25 13:50:32 +01:00
} ) ;
t_token = t_token_generation / n_decoded ;
n_tokens_second = 1e3 / t_token_generation * n_decoded ;
2024-03-07 10:41:53 +01:00
snprintf ( buffer , 512 , " generation eval time = %10.2f ms / %5d runs (%8.2f ms per token, %8.2f tokens per second) " ,
2024-02-25 13:50:32 +01:00
t_token_generation , n_decoded ,
t_token , n_tokens_second ) ;
2024-03-07 10:41:53 +01:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( buffer , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , id } ,
{ " id_task " , id_task } ,
2024-02-25 13:50:32 +01:00
{ " t_token_generation " , t_token_generation } ,
{ " n_decoded " , n_decoded } ,
{ " t_token " , t_token } ,
{ " n_tokens_second " , n_tokens_second } ,
} ) ;
2024-03-07 10:41:53 +01:00
snprintf ( buffer , 512 , " total time = %10.2f ms " , t_prompt_processing + t_token_generation ) ;
2024-02-25 13:50:32 +01:00
LOG_INFO ( buffer , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , id } ,
{ " id_task " , id_task } ,
2024-02-25 13:50:32 +01:00
{ " t_prompt_processing " , t_prompt_processing } ,
{ " t_token_generation " , t_token_generation } ,
{ " t_total " , t_prompt_processing + t_token_generation } ,
} ) ;
2023-07-04 16:05:27 +02:00
}
2023-10-22 21:53:08 +02:00
} ;
2024-02-29 21:42:11 +01:00
struct server_metrics {
2024-03-09 16:34:15 +01:00
int64_t t_start = 0 ;
2024-03-08 12:25:04 +01:00
2024-02-25 13:49:43 +01:00
uint64_t n_prompt_tokens_processed_total = 0 ;
2024-03-08 12:25:04 +01:00
uint64_t t_prompt_processing_total = 0 ;
2024-02-25 13:49:43 +01:00
uint64_t n_tokens_predicted_total = 0 ;
2024-03-08 12:25:04 +01:00
uint64_t t_tokens_generation_total = 0 ;
2024-02-25 13:49:43 +01:00
uint64_t n_prompt_tokens_processed = 0 ;
uint64_t t_prompt_processing = 0 ;
2024-03-07 10:41:53 +01:00
uint64_t n_tokens_predicted = 0 ;
uint64_t t_tokens_generation = 0 ;
2024-02-25 13:49:43 +01:00
2024-09-06 23:21:29 +02:00
uint64_t n_decode_total = 0 ;
uint64_t n_busy_slots_total = 0 ;
2024-03-09 16:34:15 +01:00
void init ( ) {
t_start = ggml_time_us ( ) ;
}
void on_prompt_eval ( const server_slot & slot ) {
2024-02-29 21:42:11 +01:00
n_prompt_tokens_processed_total + = slot . n_prompt_tokens_processed ;
n_prompt_tokens_processed + = slot . n_prompt_tokens_processed ;
t_prompt_processing + = slot . t_prompt_processing ;
2024-03-08 12:25:04 +01:00
t_prompt_processing_total + = slot . t_prompt_processing ;
2024-02-25 13:49:43 +01:00
}
2024-03-09 16:34:15 +01:00
void on_prediction ( const server_slot & slot ) {
2024-03-08 12:25:04 +01:00
n_tokens_predicted_total + = slot . n_decoded ;
n_tokens_predicted + = slot . n_decoded ;
t_tokens_generation + = slot . t_token_generation ;
t_tokens_generation_total + = slot . t_token_generation ;
2024-02-25 13:49:43 +01:00
}
2024-09-06 23:21:29 +02:00
void on_decoded ( const std : : vector < server_slot > & slots ) {
n_decode_total + + ;
for ( const auto & slot : slots ) {
if ( slot . is_processing ( ) ) {
n_busy_slots_total + + ;
}
}
}
2024-02-25 13:49:43 +01:00
void reset_bucket ( ) {
n_prompt_tokens_processed = 0 ;
t_prompt_processing = 0 ;
n_tokens_predicted = 0 ;
t_tokens_generation = 0 ;
}
} ;
2024-03-07 10:41:53 +01:00
struct server_queue {
int id = 0 ;
bool running ;
// queues
2024-09-02 17:11:51 +02:00
std : : deque < server_task > queue_tasks ;
std : : deque < server_task > queue_tasks_deferred ;
2024-03-07 10:41:53 +01:00
std : : mutex mutex_tasks ;
std : : condition_variable condition_tasks ;
// callback functions
2024-09-02 17:11:51 +02:00
std : : function < void ( server_task & ) > callback_new_task ;
std : : function < void ( void ) > callback_update_slots ;
2024-03-07 10:41:53 +01:00
// Add a new task to the end of the queue
2024-09-02 17:11:51 +02:00
int post ( server_task task , bool front = false ) {
2024-03-07 10:41:53 +01:00
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
if ( task . id = = - 1 ) {
task . id = id + + ;
LOG_VERBOSE ( " new task id " , { { " new_id " , task . id } } ) ;
}
2024-09-02 17:11:51 +02:00
if ( front ) {
queue_tasks . push_front ( std : : move ( task ) ) ;
} else {
queue_tasks . push_back ( std : : move ( task ) ) ;
}
2024-03-07 10:41:53 +01:00
condition_tasks . notify_one ( ) ;
return task . id ;
}
2024-09-02 17:11:51 +02:00
// multi-task version of post()
int post ( std : : vector < server_task > & tasks , bool front = false ) {
2024-09-06 14:06:04 +02:00
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
2024-09-02 17:11:51 +02:00
for ( auto & task : tasks ) {
if ( task . id = = - 1 ) {
task . id = id + + ;
LOG_VERBOSE ( " new task id " , { { " new_id " , task . id } } ) ;
}
if ( front ) {
queue_tasks . push_front ( std : : move ( task ) ) ;
} else {
queue_tasks . push_back ( std : : move ( task ) ) ;
}
}
condition_tasks . notify_one ( ) ;
return 0 ;
}
2024-03-07 10:41:53 +01:00
// Add a new task, but defer until one slot is available
void defer ( server_task task ) {
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
queue_tasks_deferred . push_back ( std : : move ( task ) ) ;
2024-09-06 23:21:29 +02:00
condition_tasks . notify_one ( ) ;
2024-03-07 10:41:53 +01:00
}
2024-09-02 17:11:51 +02:00
// Get the next id for creating a new task
2024-03-07 10:41:53 +01:00
int get_new_id ( ) {
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
int new_id = id + + ;
LOG_VERBOSE ( " new task id " , { { " new_id " , new_id } } ) ;
return new_id ;
}
// Register function to process a new task
void on_new_task ( std : : function < void ( server_task & ) > callback ) {
callback_new_task = std : : move ( callback ) ;
}
// Register the function to be called when all slots data is ready to be processed
2024-03-11 10:56:41 +01:00
void on_update_slots ( std : : function < void ( void ) > callback ) {
callback_update_slots = std : : move ( callback ) ;
2024-03-07 10:41:53 +01:00
}
2024-09-06 23:21:29 +02:00
// Call when the state of one slot is changed, it will move one task from deferred to main queue
void pop_deferred_task ( ) {
2024-03-07 10:41:53 +01:00
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
2024-09-06 23:21:29 +02:00
if ( ! queue_tasks_deferred . empty ( ) ) {
queue_tasks . emplace_back ( std : : move ( queue_tasks_deferred . front ( ) ) ) ;
queue_tasks_deferred . pop_front ( ) ;
2024-03-07 10:41:53 +01:00
}
2024-09-06 23:21:29 +02:00
condition_tasks . notify_one ( ) ;
2024-03-07 10:41:53 +01:00
}
// end the start_loop routine
void terminate ( ) {
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
running = false ;
condition_tasks . notify_all ( ) ;
}
/**
* Main loop consists of these steps :
* - Wait until a new task arrives
* - Process the task ( i . e . maybe copy data into slot )
* - Check if multitask is finished
2024-03-11 10:56:41 +01:00
* - Update all slots
2024-03-07 10:41:53 +01:00
*/
void start_loop ( ) {
running = true ;
while ( true ) {
LOG_VERBOSE ( " new task may arrive " , { } ) ;
while ( true ) {
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
if ( queue_tasks . empty ( ) ) {
lock . unlock ( ) ;
break ;
}
server_task task = queue_tasks . front ( ) ;
2024-09-06 23:21:29 +02:00
queue_tasks . pop_front ( ) ;
2024-03-07 10:41:53 +01:00
lock . unlock ( ) ;
LOG_VERBOSE ( " callback_new_task " , { { " id_task " , task . id } } ) ;
callback_new_task ( task ) ;
}
// all tasks in the current loop is processed, slots data is now ready
2024-03-11 10:56:41 +01:00
LOG_VERBOSE ( " callback_update_slots " , { } ) ;
2024-03-07 10:41:53 +01:00
2024-03-11 10:56:41 +01:00
callback_update_slots ( ) ;
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " wait for new task " , { } ) ;
{
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
if ( queue_tasks . empty ( ) ) {
if ( ! running ) {
LOG_VERBOSE ( " ending start_loop " , { } ) ;
return ;
}
condition_tasks . wait ( lock , [ & ] {
return ( ! queue_tasks . empty ( ) | | ! running ) ;
} ) ;
}
}
}
}
} ;
struct server_response {
// for keeping track of all tasks waiting for the result
2024-09-02 17:11:51 +02:00
std : : unordered_set < int > waiting_task_ids ;
2024-03-07 10:41:53 +01:00
// the main result queue
std : : vector < server_task_result > queue_results ;
std : : mutex mutex_results ;
std : : condition_variable condition_results ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
// add the id_task to the list of tasks waiting for response
void add_waiting_task_id ( int id_task ) {
LOG_VERBOSE ( " waiting for task id " , { { " id_task " , id_task } } ) ;
std : : unique_lock < std : : mutex > lock ( mutex_results ) ;
waiting_task_ids . insert ( id_task ) ;
}
2024-09-02 17:11:51 +02:00
void add_waiting_tasks ( const std : : vector < server_task > & tasks ) {
for ( const auto & t : tasks ) {
add_waiting_task_id ( t . id ) ;
}
}
2024-03-07 10:41:53 +01:00
// when the request is finished, we can remove task associated with it
void remove_waiting_task_id ( int id_task ) {
LOG_VERBOSE ( " remove waiting for task id " , { { " id_task " , id_task } } ) ;
std : : unique_lock < std : : mutex > lock ( mutex_results ) ;
waiting_task_ids . erase ( id_task ) ;
}
2024-09-02 17:11:51 +02:00
// This function blocks the thread until there is a response for one of the id_tasks
server_task_result recv ( const std : : unordered_set < int > & id_tasks ) {
2024-03-07 10:41:53 +01:00
while ( true ) {
std : : unique_lock < std : : mutex > lock ( mutex_results ) ;
condition_results . wait ( lock , [ & ] {
return ! queue_results . empty ( ) ;
} ) ;
for ( int i = 0 ; i < ( int ) queue_results . size ( ) ; i + + ) {
2024-09-02 17:11:51 +02:00
if ( id_tasks . find ( queue_results [ i ] . id ) ! = id_tasks . end ( ) ) {
2024-03-07 10:41:53 +01:00
server_task_result res = queue_results [ i ] ;
queue_results . erase ( queue_results . begin ( ) + i ) ;
return res ;
}
}
}
// should never reach here
}
2024-09-02 17:11:51 +02:00
// single-task version of recv()
server_task_result recv ( int id_task ) {
std : : unordered_set < int > id_tasks = { id_task } ;
return recv ( id_tasks ) ;
2024-03-07 10:41:53 +01:00
}
// Send a new result to a waiting id_task
2024-09-02 17:11:51 +02:00
void send ( server_task_result & result ) {
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " send new result " , { { " id_task " , result . id } } ) ;
std : : unique_lock < std : : mutex > lock ( mutex_results ) ;
for ( const auto & id_task : waiting_task_ids ) {
if ( result . id = = id_task ) {
LOG_VERBOSE ( " queue_results.push_back " , { { " id_task " , id_task } } ) ;
2024-09-02 17:11:51 +02:00
queue_results . push_back ( std : : move ( result ) ) ;
2024-03-07 10:41:53 +01:00
condition_results . notify_all ( ) ;
return ;
}
}
}
} ;
struct server_context {
llama_model * model = nullptr ;
llama_context * ctx = nullptr ;
2024-08-06 17:33:39 +02:00
std : : vector < llama_lora_adapter_container > lora_adapters ;
2023-10-22 21:53:08 +02:00
gpt_params params ;
llama_batch batch ;
2024-03-07 10:41:53 +01:00
bool clean_kv_cache = true ;
bool add_bos_token = true ;
2024-08-12 09:21:50 +02:00
bool has_eos_token = false ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
int32_t n_ctx ; // total context for all clients / slots
2023-10-22 21:53:08 +02:00
// system prompt
bool system_need_update = false ;
std : : string system_prompt ;
std : : vector < llama_token > system_tokens ;
// slots / clients
2024-02-29 21:42:11 +01:00
std : : vector < server_slot > slots ;
2024-02-05 09:10:22 +01:00
json default_generation_settings_for_props ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
server_queue queue_tasks ;
server_response queue_results ;
2023-07-04 16:05:27 +02:00
2024-02-29 21:42:11 +01:00
server_metrics metrics ;
2024-02-25 13:49:43 +01:00
2024-06-08 09:50:31 +02:00
// Necessary similarity of prompt for slot selection
float slot_prompt_similarity = 0.0f ;
2024-03-07 10:41:53 +01:00
~ server_context ( ) {
if ( ctx ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
llama_free ( ctx ) ;
ctx = nullptr ;
2023-05-21 19:51:18 +02:00
}
2024-03-07 10:41:53 +01:00
if ( model ) {
2023-06-24 10:47:58 +02:00
llama_free_model ( model ) ;
model = nullptr ;
}
2024-05-11 10:13:02 +02:00
2024-05-14 16:11:24 +02:00
// Clear any sampling context
for ( server_slot & slot : slots ) {
2024-09-07 14:16:19 +02:00
if ( slot . smpl ! = nullptr ) {
gpt_sampler_free ( slot . smpl ) ;
2024-05-14 16:11:24 +02:00
}
}
2024-05-11 10:13:02 +02:00
llama_batch_free ( batch ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2024-03-07 10:41:53 +01:00
bool load_model ( const gpt_params & params_ ) {
2023-10-22 21:53:08 +02:00
params = params_ ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
// dedicate one sequence to the system prompt
params . n_parallel + = 1 ;
2024-08-05 18:14:10 +02:00
llama_init_result llama_init = llama_init_from_gpt_params ( params ) ;
model = llama_init . model ;
ctx = llama_init . context ;
2024-08-06 17:33:39 +02:00
lora_adapters = llama_init . lora_adapters ;
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
params . n_parallel - = 1 ; // but be sneaky about it
2024-03-07 10:41:53 +01:00
if ( model = = nullptr ) {
2023-10-22 21:53:08 +02:00
LOG_ERROR ( " unable to load model " , { { " model " , params . model } } ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
return false ;
2023-05-21 19:51:18 +02:00
}
2023-10-22 21:53:08 +02:00
2023-09-28 21:42:38 +02:00
n_ctx = llama_n_ctx ( ctx ) ;
2023-10-22 21:53:08 +02:00
2024-08-15 09:23:23 +02:00
add_bos_token = llama_add_bos_token ( model ) ;
has_eos_token = ! llama_add_eos_token ( model ) ;
2024-08-16 17:19:05 +02:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
return true ;
2023-05-21 19:51:18 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
bool validate_model_chat_template ( ) const {
2024-02-22 09:33:24 +01:00
llama_chat_message chat [ ] = { { " user " , " test " } } ;
2024-03-07 10:41:53 +01:00
const int res = llama_chat_apply_template ( model , nullptr , chat , 1 , true , nullptr , 0 ) ;
return res > 0 ;
2024-02-22 09:33:24 +01:00
}
2024-03-09 16:34:15 +01:00
void init ( ) {
2023-10-22 21:53:08 +02:00
const int32_t n_ctx_slot = n_ctx / params . n_parallel ;
2024-02-25 13:50:32 +01:00
LOG_INFO ( " initializing slots " , { { " n_slots " , params . n_parallel } } ) ;
2024-03-09 16:34:15 +01:00
2024-03-07 10:41:53 +01:00
for ( int i = 0 ; i < params . n_parallel ; i + + ) {
2024-02-29 21:42:11 +01:00
server_slot slot ;
2023-10-22 21:53:08 +02:00
slot . id = i ;
slot . n_ctx = n_ctx_slot ;
2024-02-18 17:30:09 +01:00
slot . n_predict = params . n_predict ;
2023-10-22 21:53:08 +02:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( " new slot " , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
2024-02-25 13:50:32 +01:00
{ " n_ctx_slot " , slot . n_ctx }
} ) ;
2024-01-27 14:38:05 +01:00
const int ga_n = params . grp_attn_n ;
const int ga_w = params . grp_attn_w ;
if ( ga_n ! = 1 ) {
2024-03-02 22:00:14 +01:00
GGML_ASSERT ( ga_n > 0 & & " ga_n must be positive " ) ; // NOLINT
GGML_ASSERT ( ga_w % ga_n = = 0 & & " ga_w must be a multiple of ga_n " ) ; // NOLINT
2024-01-27 14:38:05 +01:00
//GGML_ASSERT(n_ctx_train % ga_w == 0 && "n_ctx_train must be a multiple of ga_w"); // NOLINT
//GGML_ASSERT(n_ctx >= n_ctx_train * ga_n && "n_ctx must be at least n_ctx_train * ga_n"); // NOLINT
2024-02-25 13:50:32 +01:00
LOG_INFO ( " slot self-extend " , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
{ " ga_n " , ga_n } ,
{ " ga_w " , ga_w }
2024-02-25 13:50:32 +01:00
} ) ;
2024-01-27 14:38:05 +01:00
}
slot . ga_i = 0 ;
slot . ga_n = ga_n ;
slot . ga_w = ga_w ;
2024-07-11 02:08:17 +02:00
slot . sparams = params . sparams ;
2024-09-06 23:21:29 +02:00
slot . callback_on_release = [ this ] ( int ) {
queue_tasks . pop_deferred_task ( ) ;
} ;
2024-01-27 14:38:05 +01:00
slot . reset ( ) ;
2023-10-22 21:53:08 +02:00
slots . push_back ( slot ) ;
}
2024-02-05 09:10:22 +01:00
default_generation_settings_for_props = get_formated_generation ( slots . front ( ) ) ;
default_generation_settings_for_props [ " seed " ] = - 1 ;
2024-08-14 08:51:02 +02:00
// the update_slots() logic will always submit a maximum of n_batch or n_parallel tokens
2024-03-13 18:54:21 +01:00
// note that n_batch can be > n_ctx (e.g. for non-causal attention models such as BERT where the KV cache is not used)
{
const int32_t n_batch = llama_n_batch ( ctx ) ;
llama : greatly reduce output buffer memory usage (#6122)
* llama : greatly reduce logits memory usage
* llama : more compact state saving and reloading
* llama : fix lctx.n_outputs not being set before building graph
* perplexity : adapt to the logits API changes
* perplexity : fix Winogrande, use correct logits for second choice start
The first logits used to evaluate the second choice were not from
the end of the common prefix; instead, they were the logits from the end
of the first choice. This has been corrected.
The previous implementation sometimes had outliers in the scores of
choices for some tasks, and the logic to skip choices words
in the log-likelihood evaluation probably was an attempt to reduce those,
but it was complex and didn't quite seem to be the right thing.
This is simpler now, and the outlier scores aren't there anymore.
* perplexity : normalize spaces and punctuation in Winogrande sentences
* llama : fix embedding conditions
* llama : fix llama_get_embeddings_ith when the resulting id is 0
* llama : fix wrong n_outputs in llama_set_inputs
A mismatch happened when using a smaller n_ubatch than n_batch and then using
llama_batch_get_one(). The decision of what n_outputs should be now almost
fully depends on how lctx.n_outputs is set in llama_decode_internal.
The conditions are simpler this way.
* llama : when saving the state, recalculate n_outputs
This ensures the correct number of outputs for the entire previous batch
is stored in the session file, even when n_ubatch is smaller than n_batch.
* llama : fix not-skipping outputs of non-causal models
* llama : fix running a batch with n_outputs == 0
It previously worked because lctx.inp_out_ids was not initialized,
so it pointed to some garbage address which was somehow still valid when I
ran my tests.
* llama : keep same graph topology even when n_outputs == 0
* ggml : saner ggml_can_repeat with empty tensors
* ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1
* ggml : do not multi-thread ops returning empty tensors
* ggml : make ggml_is_empty public and work with views
* llama : use a vector for ctx->output_ids
* llama : rework reallocation logic for llama_output_reserve
Now comparing the actual size with the new total size of the output buffer
to allow more efficient enabling and disabling of the embeddings
and/or logits output in the future.
* ggml : skip empty tensors in all backends
* llama : fix llama_output_reserve nullptr deref when new_size is 0
* perplexity : make Winogrande work as it does on master
The problems with the Winogrande implementation will
need to be fixed in a separate PR to ease review.
* llama : clearer error messages for invalid logits or embeddings ids
* llama : assert all models that can have inp_out_ids
Since the graph topology is now constant, this presence check
can be done even when there are no outputs.
* llama : assert logits and embd buffers exist before writing to them
* llama : handle errors from llama_output_reserve at call sites
* perplexity : make hellaswag and multiple-choice outputs identical to master
Due to how the KV cache is updated, the logprobs for tokens in a batch
are very slightly affected by the other tokens present in the batch,
so to make hellaswag and multiple-choice return exactly the same results
as on master, the last token of each sequence needs to be evaluated
even though its output is not used at all.
This will probably be changed back in the future to make these benchmarks
a tiny bit faster.
* perplexity : fix division by zero when using less than 100 multiple-choice tasks
* llama : allow loading state saved with a different ctx size
When loading a session file, the context size is now only required to be
at least enough to load the KV cells contained in that session file,
instead of requiring to use exactly the same context size as when saving.
Doing this enables the use-case of extending or shrinking the context size
of a saved session.
This breaks existing session files because the meaning of kv_buf_size
is slightly changed (previously it was the size of the whole KV cache,
now it's only the size of the saved part of it). This allows for
finer-grained sanity checks when loading in an effort to keep kv_buf_size
useful even when the kv_size is changed.
* llama : minor
ggml-ci
* readme : update recent API changes, and warn about Vulkan
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-26 15:46:41 +01:00
// only a single seq_id per token is needed
2024-08-14 08:51:02 +02:00
batch = llama_batch_init ( std : : max ( n_batch , params . n_parallel ) , 0 , 1 ) ;
2024-03-13 18:54:21 +01:00
}
2024-03-09 16:34:15 +01:00
metrics . init ( ) ;
2023-10-22 21:53:08 +02:00
}
2024-04-09 19:44:08 +02:00
std : : vector < llama_token > tokenize ( const json & json_prompt , bool add_special ) const {
2023-11-25 10:29:06 +01:00
// TODO: currently, we tokenize using special tokens by default
// this is not always correct (see https://github.com/ggerganov/llama.cpp/pull/4160#issuecomment-1824826216)
// but it's better compared to completely ignoring ChatML and other chat templates
const bool TMP_FORCE_SPECIAL = true ;
2023-08-23 09:12:12 +02:00
// If `add_bos` is true, we only add BOS, when json_prompt is a string,
// or the first element of the json_prompt array is a string.
std : : vector < llama_token > prompt_tokens ;
2024-03-07 10:41:53 +01:00
if ( json_prompt . is_array ( ) ) {
2023-08-23 09:12:12 +02:00
bool first = true ;
2024-03-07 10:41:53 +01:00
for ( const auto & p : json_prompt ) {
if ( p . is_string ( ) ) {
2023-08-23 09:12:12 +02:00
auto s = p . template get < std : : string > ( ) ;
2024-03-07 10:41:53 +01:00
2023-08-23 09:12:12 +02:00
std : : vector < llama_token > p ;
2024-03-07 10:41:53 +01:00
if ( first ) {
2024-04-09 19:44:08 +02:00
p = : : llama_tokenize ( ctx , s , add_special , TMP_FORCE_SPECIAL ) ;
2023-08-23 09:12:12 +02:00
first = false ;
2024-03-07 10:41:53 +01:00
} else {
2023-11-25 10:29:06 +01:00
p = : : llama_tokenize ( ctx , s , false , TMP_FORCE_SPECIAL ) ;
2023-08-23 09:12:12 +02:00
}
2024-03-07 10:41:53 +01:00
2023-08-23 09:12:12 +02:00
prompt_tokens . insert ( prompt_tokens . end ( ) , p . begin ( ) , p . end ( ) ) ;
2024-03-07 10:41:53 +01:00
} else {
if ( first ) {
2023-08-23 09:12:12 +02:00
first = false ;
}
2024-03-07 10:41:53 +01:00
2023-08-23 09:12:12 +02:00
prompt_tokens . push_back ( p . template get < llama_token > ( ) ) ;
}
}
2024-03-07 10:41:53 +01:00
} else {
2023-08-23 09:12:12 +02:00
auto s = json_prompt . template get < std : : string > ( ) ;
2024-04-09 19:44:08 +02:00
prompt_tokens = : : llama_tokenize ( ctx , s , add_special , TMP_FORCE_SPECIAL ) ;
2023-08-23 09:12:12 +02:00
}
return prompt_tokens ;
}
2024-06-08 09:50:31 +02:00
server_slot * get_slot_by_id ( int id ) {
2024-03-07 10:41:53 +01:00
for ( server_slot & slot : slots ) {
2024-06-08 09:50:31 +02:00
if ( slot . id = = id ) {
2023-10-22 21:53:08 +02:00
return & slot ;
}
2024-06-08 09:50:31 +02:00
}
return nullptr ;
}
server_slot * get_available_slot ( const std : : string & prompt ) {
server_slot * ret = nullptr ;
// find the slot that has at least n% prompt similarity
if ( ret = = nullptr & & slot_prompt_similarity ! = 0.0f & & ! prompt . empty ( ) ) {
int max_lcp_len = 0 ;
float similarity = 0 ;
for ( server_slot & slot : slots ) {
// skip the slot if it is not available
2024-09-06 23:21:29 +02:00
if ( slot . is_processing ( ) ) {
2024-06-08 09:50:31 +02:00
continue ;
}
2024-06-12 13:42:29 +02:00
// skip the slot if it does not contains prompt
if ( ! slot . prompt . is_string ( ) ) {
continue ;
}
2024-06-08 09:50:31 +02:00
// current slot's prompt
2024-06-12 13:42:29 +02:00
std : : string slot_prompt = slot . prompt . get < std : : string > ( ) ;
2024-06-08 09:50:31 +02:00
// length of the current slot's prompt
int slot_prompt_len = slot_prompt . size ( ) ;
// length of the Longest Common Prefix between the current slot's prompt and the input prompt
int lcp_len = common_part ( slot_prompt , prompt ) ;
// fraction of the common substring length compared to the current slot's prompt length
similarity = static_cast < float > ( lcp_len ) / slot_prompt_len ;
// select the current slot if the criteria match
if ( lcp_len > max_lcp_len & & similarity > slot_prompt_similarity ) {
max_lcp_len = lcp_len ;
ret = & slot ;
}
}
2023-10-20 20:07:23 +02:00
2024-06-08 09:50:31 +02:00
if ( ret ! = nullptr ) {
LOG_VERBOSE ( " selected slot by lcp similarity " , {
{ " id_slot " , ret - > id } ,
{ " max_lcp_len " , max_lcp_len } ,
{ " similarity " , similarity } ,
} ) ;
2023-10-22 21:53:08 +02:00
}
}
2023-10-20 20:07:23 +02:00
2024-06-08 09:50:31 +02:00
// find the slot that has been least recently used
if ( ret = = nullptr ) {
int64_t t_last = ggml_time_us ( ) ;
for ( server_slot & slot : slots ) {
// skip the slot if it is not available
2024-09-06 23:21:29 +02:00
if ( slot . is_processing ( ) ) {
2024-06-08 09:50:31 +02:00
continue ;
}
// select the current slot if the criteria match
if ( slot . t_last_used < t_last ) {
t_last = slot . t_last_used ;
ret = & slot ;
}
}
if ( ret ! = nullptr ) {
LOG_VERBOSE ( " selected slot by lru " , {
{ " id_slot " , ret - > id } ,
{ " t_last " , t_last } ,
} ) ;
}
}
return ret ;
2023-08-08 15:29:19 +02:00
}
2024-03-11 10:56:41 +01:00
bool launch_slot_with_task ( server_slot & slot , const server_task & task ) {
2023-10-22 21:53:08 +02:00
slot_params default_params ;
2024-07-10 00:26:40 +02:00
// Sampling parameter defaults are loaded from the global server context (but individual requests can still override them)
2024-09-07 14:16:19 +02:00
auto default_sparams = params . sparams ;
const auto & data = task . data ;
2023-10-22 21:53:08 +02:00
2023-11-25 10:29:06 +01:00
if ( data . count ( " __oaicompat " ) ! = 0 ) {
2024-03-07 10:41:53 +01:00
slot . oaicompat = true ;
slot . oaicompat_model = json_value ( data , " model " , std : : string ( DEFAULT_OAICOMPAT_MODEL ) ) ;
2023-11-25 10:29:06 +01:00
} else {
2024-03-07 10:41:53 +01:00
slot . oaicompat = false ;
slot . oaicompat_model = " " ;
}
slot . params . stream = json_value ( data , " stream " , false ) ;
slot . params . cache_prompt = json_value ( data , " cache_prompt " , false ) ;
2024-08-04 20:16:23 +02:00
slot . params . n_predict = json_value ( data , " n_predict " , json_value ( data , " max_tokens " , default_params . n_predict ) ) ;
2024-03-07 10:41:53 +01:00
slot . sparams . top_k = json_value ( data , " top_k " , default_sparams . top_k ) ;
slot . sparams . top_p = json_value ( data , " top_p " , default_sparams . top_p ) ;
slot . sparams . min_p = json_value ( data , " min_p " , default_sparams . min_p ) ;
slot . sparams . tfs_z = json_value ( data , " tfs_z " , default_sparams . tfs_z ) ;
2024-09-07 14:16:19 +02:00
slot . sparams . typ_p = json_value ( data , " typical_p " , default_sparams . typ_p ) ;
2024-03-07 10:41:53 +01:00
slot . sparams . temp = json_value ( data , " temperature " , default_sparams . temp ) ;
slot . sparams . dynatemp_range = json_value ( data , " dynatemp_range " , default_sparams . dynatemp_range ) ;
slot . sparams . dynatemp_exponent = json_value ( data , " dynatemp_exponent " , default_sparams . dynatemp_exponent ) ;
slot . sparams . penalty_last_n = json_value ( data , " repeat_last_n " , default_sparams . penalty_last_n ) ;
slot . sparams . penalty_repeat = json_value ( data , " repeat_penalty " , default_sparams . penalty_repeat ) ;
slot . sparams . penalty_freq = json_value ( data , " frequency_penalty " , default_sparams . penalty_freq ) ;
slot . sparams . penalty_present = json_value ( data , " presence_penalty " , default_sparams . penalty_present ) ;
slot . sparams . mirostat = json_value ( data , " mirostat " , default_sparams . mirostat ) ;
slot . sparams . mirostat_tau = json_value ( data , " mirostat_tau " , default_sparams . mirostat_tau ) ;
slot . sparams . mirostat_eta = json_value ( data , " mirostat_eta " , default_sparams . mirostat_eta ) ;
slot . sparams . penalize_nl = json_value ( data , " penalize_nl " , default_sparams . penalize_nl ) ;
slot . params . n_keep = json_value ( data , " n_keep " , slot . params . n_keep ) ;
2024-03-26 09:47:43 +01:00
slot . params . n_discard = json_value ( data , " n_discard " , default_params . n_discard ) ;
2024-04-24 11:08:36 +02:00
slot . sparams . seed = json_value ( data , " seed " , default_sparams . seed ) ;
2024-03-25 09:42:17 +01:00
slot . sparams . n_probs = json_value ( data , " n_probs " , default_sparams . n_probs ) ;
slot . sparams . min_keep = json_value ( data , " min_keep " , default_sparams . min_keep ) ;
// process "json_schema" and "grammar"
2024-05-08 21:53:08 +02:00
if ( data . contains ( " json_schema " ) & & ! data . at ( " json_schema " ) . is_null ( ) & & data . contains ( " grammar " ) & & ! data . at ( " grammar " ) . is_null ( ) ) {
2024-03-25 09:42:17 +01:00
send_error ( task , " Either \" json_schema \" or \" grammar \" can be specified, but not both " , ERROR_TYPE_INVALID_REQUEST ) ;
return false ;
2024-09-07 14:16:19 +02:00
}
if ( data . contains ( " json_schema " ) & & ! data . contains ( " grammar " ) ) {
json-schema-to-grammar improvements (+ added to server) (#5978)
* json: fix arrays (disallow `[,1]`)
* json: support tuple types (`[number, string]`)
* json: support additionalProperties (`{[k: string]: [string,number][]}`)
* json: support required / optional properties
* json: add support for pattern
* json: resolve $ref (and support https schema urls)
* json: fix $ref resolution
* join: support union types (mostly for nullable types I think)
* json: support allOf + nested anyOf
* json: support any (`{}` or `{type: object}`)
* json: fix merge
* json: temp fix for escapes
* json: spaces in output and unrestricted output spaces
* json: add typings
* json:fix typo
* Create ts-type-to-grammar.sh
* json: fix _format_literal (json.dumps already escapes quotes)
* json: merge lit sequences and handle negatives
{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}
* json: handle pattern repetitions
* Update json-schema-to-grammar.mjs
* Create regex-to-grammar.py
* json: extract repeated regexp patterns to subrule
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* json: handle schema from pydantic Optional fields
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update ts-type-to-grammar.sh
* Update ts-type-to-grammar.sh
* json: simplify nullable fields handling
* json: accept duplicate identical rules
* json: revert space to 1 at most
* json: reuse regexp pattern subrules
* json: handle uuid string format
* json: fix literal escapes
* json: add --allow-fetch
* json: simplify range escapes
* json: support negative ranges in patterns
* Delete commit.txt
* json: custom regex parser, adds dot support & JS-portable
* json: rm trailing spaces
* Update json-schema-to-grammar.mjs
* json: updated server & chat `( cd examples/server && ./deps.sh )`
* json: port fixes from mjs to python
* Update ts-type-to-grammar.sh
* json: support prefixItems alongside array items
* json: add date format + fix uuid
* json: add date, time, date-time formats
* json: preserve order of props from TS defs
* json: port schema converter to C++, wire in ./server
* json: nits
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* json: fix mjs implementation + align outputs
* Update json-schema-to-grammar.mjs.hpp
* json: test C++, JS & Python versions
* json: nits + regen deps
* json: cleanup test
* json: revert from c++17 to 11
* json: nit fixes
* json: dirty include for test
* json: fix zig build
* json: pass static command to std::system in tests (fixed temp files)
* json: fix top-level $refs
* json: don't use c++20 designated initializers
* nit
* json: basic support for reserved names `{number:{number:{root:number}}}`
* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)
* json: re-ran server deps.sh
* json: simplify test
* json: support mix of additional props & required/optional
* json: add tests for some expected failures
* json: fix type=const in c++, add failure expectations for non-str const&enum
* json: test (& simplify output of) empty schema
* json: check parsing in test + fix value & string refs
* json: add server tests for OAI JSON response_format
* json: test/fix top-level anyOf
* json: improve grammar parsing failures
* json: test/fix additional props corner cases
* json: fix string patterns (was missing quotes)
* json: ws nit
* json: fix json handling in server when there's no response_format
* json: catch schema conversion errors in server
* json: don't complain about unknown format type in server if unset
* json: cleaner build of test
* json: create examples/json-schema-pydantic-example.py
* json: fix date pattern
* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common
* json: indent 4 spaces
* json: fix naming of top-level c++ function (+ drop unused one)
* json: avoid using namespace std
* json: fix zig build
* Update server.feature
* json: iostream -> fprintf
* json: space before & refs for consistency
* json: nits
2024-03-21 12:50:43 +01:00
try {
2024-03-25 09:42:17 +01:00
auto schema = json_value ( data , " json_schema " , json : : object ( ) ) ;
json-schema-to-grammar improvements (+ added to server) (#5978)
* json: fix arrays (disallow `[,1]`)
* json: support tuple types (`[number, string]`)
* json: support additionalProperties (`{[k: string]: [string,number][]}`)
* json: support required / optional properties
* json: add support for pattern
* json: resolve $ref (and support https schema urls)
* json: fix $ref resolution
* join: support union types (mostly for nullable types I think)
* json: support allOf + nested anyOf
* json: support any (`{}` or `{type: object}`)
* json: fix merge
* json: temp fix for escapes
* json: spaces in output and unrestricted output spaces
* json: add typings
* json:fix typo
* Create ts-type-to-grammar.sh
* json: fix _format_literal (json.dumps already escapes quotes)
* json: merge lit sequences and handle negatives
{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}
* json: handle pattern repetitions
* Update json-schema-to-grammar.mjs
* Create regex-to-grammar.py
* json: extract repeated regexp patterns to subrule
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* json: handle schema from pydantic Optional fields
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update ts-type-to-grammar.sh
* Update ts-type-to-grammar.sh
* json: simplify nullable fields handling
* json: accept duplicate identical rules
* json: revert space to 1 at most
* json: reuse regexp pattern subrules
* json: handle uuid string format
* json: fix literal escapes
* json: add --allow-fetch
* json: simplify range escapes
* json: support negative ranges in patterns
* Delete commit.txt
* json: custom regex parser, adds dot support & JS-portable
* json: rm trailing spaces
* Update json-schema-to-grammar.mjs
* json: updated server & chat `( cd examples/server && ./deps.sh )`
* json: port fixes from mjs to python
* Update ts-type-to-grammar.sh
* json: support prefixItems alongside array items
* json: add date format + fix uuid
* json: add date, time, date-time formats
* json: preserve order of props from TS defs
* json: port schema converter to C++, wire in ./server
* json: nits
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* json: fix mjs implementation + align outputs
* Update json-schema-to-grammar.mjs.hpp
* json: test C++, JS & Python versions
* json: nits + regen deps
* json: cleanup test
* json: revert from c++17 to 11
* json: nit fixes
* json: dirty include for test
* json: fix zig build
* json: pass static command to std::system in tests (fixed temp files)
* json: fix top-level $refs
* json: don't use c++20 designated initializers
* nit
* json: basic support for reserved names `{number:{number:{root:number}}}`
* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)
* json: re-ran server deps.sh
* json: simplify test
* json: support mix of additional props & required/optional
* json: add tests for some expected failures
* json: fix type=const in c++, add failure expectations for non-str const&enum
* json: test (& simplify output of) empty schema
* json: check parsing in test + fix value & string refs
* json: add server tests for OAI JSON response_format
* json: test/fix top-level anyOf
* json: improve grammar parsing failures
* json: test/fix additional props corner cases
* json: fix string patterns (was missing quotes)
* json: ws nit
* json: fix json handling in server when there's no response_format
* json: catch schema conversion errors in server
* json: don't complain about unknown format type in server if unset
* json: cleaner build of test
* json: create examples/json-schema-pydantic-example.py
* json: fix date pattern
* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common
* json: indent 4 spaces
* json: fix naming of top-level c++ function (+ drop unused one)
* json: avoid using namespace std
* json: fix zig build
* Update server.feature
* json: iostream -> fprintf
* json: space before & refs for consistency
* json: nits
2024-03-21 12:50:43 +01:00
slot . sparams . grammar = json_schema_to_grammar ( schema ) ;
} catch ( const std : : exception & e ) {
send_error ( task , std : : string ( " \" json_schema \" : " ) + e . what ( ) , ERROR_TYPE_INVALID_REQUEST ) ;
return false ;
}
} else {
slot . sparams . grammar = json_value ( data , " grammar " , default_sparams . grammar ) ;
}
2024-03-07 10:41:53 +01:00
if ( slot . params . cache_prompt & & slot . ga_n ! = 1 ) {
LOG_WARNING ( " cache_prompt is not supported with group-attention " , { } ) ;
slot . params . cache_prompt = false ;
}
if ( slot . n_predict > 0 & & slot . params . n_predict > slot . n_predict ) {
2024-02-18 17:30:09 +01:00
// Might be better to reject the request with a 400 ?
LOG_WARNING ( " Max tokens to predict exceeds server configuration " , {
2024-03-07 10:41:53 +01:00
{ " params.n_predict " , slot . params . n_predict } ,
{ " slot.n_predict " , slot . n_predict } ,
2024-02-18 17:30:09 +01:00
} ) ;
2024-03-07 10:41:53 +01:00
slot . params . n_predict = slot . n_predict ;
2024-02-18 17:30:09 +01:00
}
2023-10-22 21:53:08 +02:00
// infill
2024-03-07 10:41:53 +01:00
slot . params . input_prefix = json_value ( data , " input_prefix " , default_params . input_prefix ) ;
slot . params . input_suffix = json_value ( data , " input_suffix " , default_params . input_suffix ) ;
2024-03-09 12:16:53 +01:00
// get prompt
2024-09-02 17:11:51 +02:00
if ( task . cmpl_type ! = SERVER_TASK_CMPL_TYPE_INFILL ) {
2024-03-09 12:16:53 +01:00
const auto & prompt = data . find ( " prompt " ) ;
if ( prompt = = data . end ( ) ) {
2024-06-10 13:59:55 +02:00
send_error ( task , " \" prompt \" must be provided " , ERROR_TYPE_INVALID_REQUEST ) ;
2024-03-11 10:56:41 +01:00
return false ;
2024-03-09 12:16:53 +01:00
}
2024-06-10 13:59:55 +02:00
2024-06-12 13:42:29 +02:00
if ( ( prompt - > is_string ( ) ) | |
( prompt - > is_array ( ) & & prompt - > size ( ) = = 1 & & prompt - > at ( 0 ) . is_string ( ) ) | |
( prompt - > is_array ( ) & & ! prompt - > empty ( ) & & prompt - > at ( 0 ) . is_number_integer ( ) ) ) {
slot . prompt = * prompt ;
2024-08-09 08:32:02 +02:00
} else if ( prompt - > is_array ( ) & & prompt - > size ( ) = = 1 & & prompt - > at ( 0 ) . is_array ( ) ) {
slot . prompt = prompt - > at ( 0 ) ;
2024-06-10 13:59:55 +02:00
} else {
2024-06-12 13:42:29 +02:00
send_error ( task , " \" prompt \" must be a string or an array of integers " , ERROR_TYPE_INVALID_REQUEST ) ;
2024-03-11 10:56:41 +01:00
return false ;
}
2024-03-09 12:16:53 +01:00
}
2023-10-10 09:31:21 +02:00
2023-10-02 09:42:02 +02:00
{
2024-03-07 10:41:53 +01:00
slot . sparams . logit_bias . clear ( ) ;
2023-10-02 09:42:02 +02:00
2024-08-12 09:21:50 +02:00
if ( json_value ( data , " ignore_eos " , false ) & & has_eos_token ) {
2024-09-07 14:16:19 +02:00
slot . sparams . logit_bias . push_back ( { llama_token_eos ( model ) , - INFINITY } ) ;
2024-03-07 10:41:53 +01:00
}
2024-02-11 14:38:14 +01:00
2024-03-07 10:41:53 +01:00
const auto & logit_bias = data . find ( " logit_bias " ) ;
if ( logit_bias ! = data . end ( ) & & logit_bias - > is_array ( ) ) {
const int n_vocab = llama_n_vocab ( model ) ;
for ( const auto & el : * logit_bias ) {
2024-03-11 10:56:41 +01:00
// TODO: we may want to throw errors here, in case "el" is incorrect
2024-03-07 10:41:53 +01:00
if ( el . is_array ( ) & & el . size ( ) = = 2 ) {
float bias ;
if ( el [ 1 ] . is_number ( ) ) {
bias = el [ 1 ] . get < float > ( ) ;
} else if ( el [ 1 ] . is_boolean ( ) & & ! el [ 1 ] . get < bool > ( ) ) {
bias = - INFINITY ;
} else {
continue ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
if ( el [ 0 ] . is_number_integer ( ) ) {
llama_token tok = el [ 0 ] . get < llama_token > ( ) ;
if ( tok > = 0 & & tok < n_vocab ) {
2024-09-07 14:16:19 +02:00
slot . sparams . logit_bias . push_back ( { tok , bias } ) ;
2024-03-07 10:41:53 +01:00
}
} else if ( el [ 0 ] . is_string ( ) ) {
auto toks = llama_tokenize ( model , el [ 0 ] . get < std : : string > ( ) , false ) ;
for ( auto tok : toks ) {
2024-09-07 14:16:19 +02:00
slot . sparams . logit_bias . push_back ( { tok , bias } ) ;
2024-03-07 10:41:53 +01:00
}
2023-10-22 21:53:08 +02:00
}
}
}
}
2023-10-02 09:42:02 +02:00
}
2023-10-20 20:07:23 +02:00
2023-10-02 09:42:02 +02:00
{
2024-03-07 10:41:53 +01:00
slot . params . antiprompt . clear ( ) ;
2023-10-02 09:42:02 +02:00
2024-03-07 10:41:53 +01:00
const auto & stop = data . find ( " stop " ) ;
if ( stop ! = data . end ( ) & & stop - > is_array ( ) ) {
for ( const auto & word : * stop ) {
if ( ! word . empty ( ) ) {
slot . params . antiprompt . push_back ( word ) ;
}
2024-02-16 12:33:25 +01:00
}
}
}
2023-10-22 21:53:08 +02:00
{
2024-09-07 14:16:19 +02:00
const auto & samplers = data . find ( " samplers " ) ;
if ( samplers ! = data . end ( ) & & samplers - > is_array ( ) ) {
2024-03-07 10:41:53 +01:00
std : : vector < std : : string > sampler_names ;
2024-09-07 14:16:19 +02:00
for ( const auto & name : * samplers ) {
if ( name . is_string ( ) ) {
sampler_names . emplace_back ( name ) ;
2023-10-22 21:53:08 +02:00
}
}
2024-09-07 14:16:19 +02:00
slot . sparams . samplers = gpt_sampler_types_from_names ( sampler_names , false ) ;
2024-03-07 10:41:53 +01:00
} else {
2024-09-07 14:16:19 +02:00
slot . sparams . samplers = default_sparams . samplers ;
2023-10-22 21:53:08 +02:00
}
}
2023-10-12 08:29:04 +02:00
2023-10-02 09:42:02 +02:00
{
2024-09-07 14:16:19 +02:00
if ( slot . smpl ! = nullptr ) {
gpt_sampler_free ( slot . smpl ) ;
2024-03-07 10:41:53 +01:00
}
2024-09-07 14:16:19 +02:00
slot . smpl = gpt_sampler_init ( model , slot . sparams ) ;
if ( slot . smpl = = nullptr ) {
2024-03-11 10:56:41 +01:00
// for now, the only error that may happen here is invalid grammar
send_error ( task , " Failed to parse grammar " , ERROR_TYPE_INVALID_REQUEST ) ;
return false ;
}
2023-10-02 09:42:02 +02:00
}
2024-09-06 23:21:29 +02:00
slot . state = SLOT_STATE_PROCESSING_PROMPT ;
2024-03-07 10:41:53 +01:00
slot . prompt_tokens . clear ( ) ;
2023-10-12 08:29:04 +02:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( " slot is processing task " , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
2024-02-25 13:50:32 +01:00
} ) ;
2023-10-02 09:42:02 +02:00
2023-10-22 21:53:08 +02:00
return true ;
}
void kv_cache_clear ( ) {
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " clearing KV cache " , { } ) ;
2023-10-22 21:53:08 +02:00
// clear the entire KV cache
2023-10-29 18:31:40 +01:00
llama_kv_cache_clear ( ctx ) ;
2023-10-22 21:53:08 +02:00
clean_kv_cache = false ;
2023-10-02 09:42:02 +02:00
}
2023-08-23 09:12:12 +02:00
2024-02-29 21:42:11 +01:00
void system_prompt_update ( ) {
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " system prompt update " , {
{ " system_prompt " , system_prompt } ,
} ) ;
2024-02-16 11:00:56 +01:00
kv_cache_clear ( ) ;
system_tokens . clear ( ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-16 11:00:56 +01:00
if ( ! system_prompt . empty ( ) ) {
2024-04-09 19:44:08 +02:00
system_tokens = : : llama_tokenize ( ctx , system_prompt , true ) ;
2023-10-22 21:53:08 +02:00
2024-08-14 08:51:02 +02:00
const int32_t n_batch = llama_n_batch ( ctx ) ;
const int32_t n_tokens_prompt = system_tokens . size ( ) ;
2023-10-22 21:53:08 +02:00
2024-08-14 08:51:02 +02:00
for ( int32_t i = 0 ; i < n_tokens_prompt ; i + = n_batch ) {
const int32_t n_tokens = std : : min ( n_batch , n_tokens_prompt - i ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-08-14 08:51:02 +02:00
llama_batch_clear ( batch ) ;
for ( int32_t j = 0 ; j < n_tokens ; + + j ) {
llama_batch_add ( batch , system_tokens [ i + j ] , i + j , { 0 } , false ) ;
}
2024-03-13 18:54:21 +01:00
2024-08-14 08:51:02 +02:00
if ( llama_decode ( ctx , batch ) ! = 0 ) {
2024-04-12 13:49:21 +02:00
LOG_ERROR ( " llama_decode() failed " , { } ) ;
2024-02-25 19:43:50 +01:00
return ;
}
2024-02-16 11:00:56 +01:00
}
2023-10-20 20:07:23 +02:00
2024-02-16 11:00:56 +01:00
// assign the system KV cache to all parallel sequences
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
for ( int32_t i = 1 ; i < = params . n_parallel ; + + i ) {
llama_kv_cache_seq_cp ( ctx , 0 , i , - 1 , - 1 ) ;
2024-02-16 11:00:56 +01:00
}
2023-05-21 19:51:18 +02:00
}
2023-10-22 21:53:08 +02:00
system_need_update = false ;
}
2023-09-28 18:04:36 +02:00
2024-05-11 17:28:10 +02:00
bool system_prompt_set ( const std : : string & sys_prompt ) {
system_prompt = sys_prompt ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " system prompt process " , {
{ " system_prompt " , system_prompt } ,
} ) ;
2024-02-16 11:00:56 +01:00
2024-03-07 10:41:53 +01:00
// release all slots
for ( server_slot & slot : slots ) {
slot . release ( ) ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
system_need_update = true ;
2024-05-11 17:28:10 +02:00
return true ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
bool process_token ( completion_token_output & result , server_slot & slot ) {
2023-10-22 21:53:08 +02:00
// remember which tokens were sampled - used for repetition penalties during sampling
2024-07-18 10:06:22 +02:00
const std : : string token_str = llama_token_to_piece ( ctx , result . tok , params . special ) ;
2023-10-22 21:53:08 +02:00
slot . sampled = result . tok ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
// search stop word and delete it
slot . generated_text + = token_str ;
slot . has_next_token = true ;
2023-12-13 20:57:15 +01:00
// check if there is incomplete UTF-8 character at the end
bool incomplete = false ;
2024-03-07 10:41:53 +01:00
for ( unsigned i = 1 ; i < 5 & & i < = slot . generated_text . size ( ) ; + + i ) {
2023-12-13 20:57:15 +01:00
unsigned char c = slot . generated_text [ slot . generated_text . size ( ) - i ] ;
2024-03-07 10:41:53 +01:00
if ( ( c & 0xC0 ) = = 0x80 ) {
2023-12-13 20:57:15 +01:00
// continuation byte: 10xxxxxx
continue ;
}
2024-03-07 10:41:53 +01:00
if ( ( c & 0xE0 ) = = 0xC0 ) {
2023-12-13 20:57:15 +01:00
// 2-byte character: 110xxxxx ...
incomplete = i < 2 ;
2024-03-07 10:41:53 +01:00
} else if ( ( c & 0xF0 ) = = 0xE0 ) {
2023-12-13 20:57:15 +01:00
// 3-byte character: 1110xxxx ...
incomplete = i < 3 ;
2024-03-07 10:41:53 +01:00
} else if ( ( c & 0xF8 ) = = 0xF0 ) {
2023-12-13 20:57:15 +01:00
// 4-byte character: 11110xxx ...
incomplete = i < 4 ;
2023-10-22 21:53:08 +02:00
}
2023-12-13 20:57:15 +01:00
// else 1-byte character or invalid byte
break ;
2023-05-21 19:51:18 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
if ( ! incomplete ) {
2024-02-29 21:42:11 +01:00
size_t pos = std : : min ( slot . n_sent_text , slot . generated_text . size ( ) ) ;
2024-03-07 10:41:53 +01:00
2023-10-22 21:53:08 +02:00
const std : : string str_test = slot . generated_text . substr ( pos ) ;
bool is_stop_full = false ;
2024-03-07 10:41:53 +01:00
size_t stop_pos = slot . find_stopping_strings ( str_test , token_str . size ( ) , STOP_TYPE_FULL ) ;
if ( stop_pos ! = std : : string : : npos ) {
2023-10-22 21:53:08 +02:00
is_stop_full = true ;
slot . generated_text . erase (
slot . generated_text . begin ( ) + pos + stop_pos ,
slot . generated_text . end ( ) ) ;
2024-02-29 21:42:11 +01:00
pos = std : : min ( slot . n_sent_text , slot . generated_text . size ( ) ) ;
2024-03-07 10:41:53 +01:00
} else {
2023-10-22 21:53:08 +02:00
is_stop_full = false ;
2024-03-07 10:41:53 +01:00
stop_pos = slot . find_stopping_strings ( str_test , token_str . size ( ) , STOP_TYPE_PARTIAL ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-09-28 18:04:36 +02:00
2023-10-22 21:53:08 +02:00
// check if there is any token to predict
2024-03-07 10:41:53 +01:00
if ( stop_pos = = std : : string : : npos | | ( ! slot . has_next_token & & ! is_stop_full & & stop_pos > 0 ) ) {
2023-10-22 21:53:08 +02:00
// no send the stop word in the response
result . text_to_send = slot . generated_text . substr ( pos , std : : string : : npos ) ;
2024-02-29 21:42:11 +01:00
slot . n_sent_text + = result . text_to_send . size ( ) ;
2023-10-22 21:53:08 +02:00
// add the token to slot queue and cache
}
2024-03-07 10:41:53 +01:00
2023-10-22 21:53:08 +02:00
slot . add_token_string ( result ) ;
2024-03-07 10:41:53 +01:00
if ( slot . params . stream ) {
2023-10-22 21:53:08 +02:00
send_partial_response ( slot , result ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-05-21 19:51:18 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
if ( incomplete ) {
2023-10-22 21:53:08 +02:00
slot . has_next_token = true ;
2023-06-20 00:12:39 +02:00
}
2023-10-22 21:53:08 +02:00
// check the limits
2024-03-07 10:41:53 +01:00
if ( slot . n_decoded > 0 & & slot . has_next_token & & ! slot . has_budget ( params ) ) {
slot . stopped_limit = true ;
2023-10-22 21:53:08 +02:00
slot . has_next_token = false ;
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " stopped by limit " , {
{ " id_slot " , slot . id } ,
2024-03-09 10:30:04 +01:00
{ " id_task " , slot . id_task } ,
2024-03-07 10:41:53 +01:00
{ " n_decoded " , slot . n_decoded } ,
{ " n_predict " , slot . params . n_predict } ,
} ) ;
2023-10-22 21:53:08 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-04-21 13:50:41 +02:00
if ( llama_token_is_eog ( model , result . tok ) ) {
2024-03-07 10:41:53 +01:00
slot . stopped_eos = true ;
2023-10-22 21:53:08 +02:00
slot . has_next_token = false ;
2024-03-07 10:41:53 +01:00
2023-10-22 21:53:08 +02:00
LOG_VERBOSE ( " eos token found " , { } ) ;
}
2024-04-26 12:15:30 +02:00
auto n_ctx_train = llama_n_ctx_train ( model ) ;
2024-04-27 17:50:48 +02:00
if ( slot . params . n_predict < 1 & & slot . n_predict < 1 & & slot . ga_n = = 1
2024-04-26 12:15:30 +02:00
& & slot . n_prompt_tokens + slot . n_decoded > = n_ctx_train ) {
LOG_WARNING ( " n_predict is not set and self-context extend is disabled. "
" Limiting generated tokens to n_ctx_train to avoid EOS-less generation infinite loop " , {
{ " id_slot " , slot . id } ,
{ " params.n_predict " , slot . params . n_predict } ,
{ " slot.n_prompt_tokens " , slot . n_prompt_tokens } ,
{ " slot.n_decoded " , slot . n_decoded } ,
{ " slot.n_predict " , slot . n_predict } ,
{ " n_slots " , params . n_parallel } ,
{ " slot.n_ctx " , slot . n_ctx } ,
{ " n_ctx " , n_ctx } ,
{ " n_ctx_train " , n_ctx_train } ,
{ " ga_n " , slot . ga_n } ,
} ) ;
slot . truncated = true ;
slot . stopped_limit = true ;
slot . has_next_token = false ; // stop prediction
}
2023-10-22 21:53:08 +02:00
LOG_VERBOSE ( " next token " , {
2024-03-09 10:30:04 +01:00
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
2024-03-07 10:41:53 +01:00
{ " token " , result . tok } ,
{ " token_text " , tokens_to_output_formatted_string ( ctx , result . tok ) } ,
{ " has_next_token " , slot . has_next_token } ,
{ " n_remain " , slot . n_remaining } ,
{ " n_decoded " , slot . n_decoded } ,
{ " stopped_eos " , slot . stopped_eos } ,
{ " stopped_word " , slot . stopped_word } ,
{ " stopped_limit " , slot . stopped_limit } ,
{ " stopping_word " , slot . stopping_word } ,
} ) ;
2023-10-22 21:53:08 +02:00
return slot . has_next_token ; // continue
}
2023-08-08 15:29:19 +02:00
2024-03-07 10:41:53 +01:00
json get_formated_generation ( const server_slot & slot ) const {
2024-09-07 14:16:19 +02:00
std : : vector < std : : string > samplers ;
samplers . reserve ( slot . sparams . samplers . size ( ) ) ;
for ( const auto & sampler : slot . sparams . samplers ) {
samplers . emplace_back ( gpt_sampler_type_to_str ( sampler ) ) ;
2024-02-16 12:33:25 +01:00
}
2023-10-22 21:53:08 +02:00
return json {
2024-03-07 10:41:53 +01:00
{ " n_ctx " , slot . n_ctx } ,
2024-08-15 09:28:05 +02:00
{ " n_predict " , slot . n_predict } , // Server configured n_predict
2024-03-07 10:41:53 +01:00
{ " model " , params . model_alias } ,
2024-05-19 16:06:33 +02:00
{ " seed " , slot . sparams . seed } ,
2024-03-07 10:41:53 +01:00
{ " temperature " , slot . sparams . temp } ,
{ " dynatemp_range " , slot . sparams . dynatemp_range } ,
{ " dynatemp_exponent " , slot . sparams . dynatemp_exponent } ,
{ " top_k " , slot . sparams . top_k } ,
{ " top_p " , slot . sparams . top_p } ,
{ " min_p " , slot . sparams . min_p } ,
{ " tfs_z " , slot . sparams . tfs_z } ,
2024-09-07 14:16:19 +02:00
{ " typical_p " , slot . sparams . typ_p } ,
2024-03-07 10:41:53 +01:00
{ " repeat_last_n " , slot . sparams . penalty_last_n } ,
{ " repeat_penalty " , slot . sparams . penalty_repeat } ,
{ " presence_penalty " , slot . sparams . penalty_present } ,
{ " frequency_penalty " , slot . sparams . penalty_freq } ,
{ " mirostat " , slot . sparams . mirostat } ,
{ " mirostat_tau " , slot . sparams . mirostat_tau } ,
{ " mirostat_eta " , slot . sparams . mirostat_eta } ,
{ " penalize_nl " , slot . sparams . penalize_nl } ,
{ " stop " , slot . params . antiprompt } ,
2024-08-15 09:28:05 +02:00
{ " max_tokens " , slot . params . n_predict } , // User configured n_predict
2024-03-22 12:12:05 +01:00
{ " n_keep " , slot . params . n_keep } ,
2024-03-26 09:47:43 +01:00
{ " n_discard " , slot . params . n_discard } ,
2024-09-07 14:16:19 +02:00
{ " ignore_eos " , slot . sparams . ignore_eos } ,
2024-03-07 10:41:53 +01:00
{ " stream " , slot . params . stream } ,
2024-09-07 14:16:19 +02:00
//{"logit_bias", slot.sparams.logit_bias},
2024-03-07 10:41:53 +01:00
{ " n_probs " , slot . sparams . n_probs } ,
{ " min_keep " , slot . sparams . min_keep } ,
{ " grammar " , slot . sparams . grammar } ,
2024-09-07 14:16:19 +02:00
{ " samplers " , samplers } ,
2023-10-22 21:53:08 +02:00
} ;
}
2024-03-11 10:56:41 +01:00
void send_error ( const server_task & task , const std : : string & error , const enum error_type type = ERROR_TYPE_SERVER ) {
2024-09-02 17:11:51 +02:00
send_error ( task . id , error , type ) ;
2024-03-11 10:56:41 +01:00
}
void send_error ( const server_slot & slot , const std : : string & error , const enum error_type type = ERROR_TYPE_SERVER ) {
2024-09-02 17:11:51 +02:00
send_error ( slot . id_task , error , type ) ;
2024-03-11 10:56:41 +01:00
}
2024-09-02 17:11:51 +02:00
void send_error ( const int id_task , const std : : string & error , const enum error_type type = ERROR_TYPE_SERVER ) {
2024-04-12 13:49:21 +02:00
LOG_ERROR ( " task error " , {
{ " id_task " , id_task } ,
{ " error " , error } ,
} ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
server_task_result res ;
2024-03-11 10:56:41 +01:00
res . id = id_task ;
2024-03-07 10:41:53 +01:00
res . stop = false ;
res . error = true ;
2024-03-11 10:56:41 +01:00
res . data = format_error_response ( error , type ) ;
2024-03-07 10:41:53 +01:00
queue_results . send ( res ) ;
}
void send_partial_response ( server_slot & slot , completion_token_output tkn ) {
server_task_result res ;
res . id = slot . id_task ;
res . error = false ;
res . stop = false ;
res . data = json {
2023-10-22 21:53:08 +02:00
{ " content " , tkn . text_to_send } ,
{ " stop " , false } ,
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
2024-09-02 17:11:51 +02:00
{ " multimodal " , false } ,
{ " index " , slot . index } ,
2023-10-22 21:53:08 +02:00
} ;
2024-03-07 10:41:53 +01:00
if ( slot . sparams . n_probs > 0 ) {
2023-10-22 21:53:08 +02:00
const std : : vector < llama_token > to_send_toks = llama_tokenize ( ctx , tkn . text_to_send , false ) ;
2024-03-07 10:41:53 +01:00
const size_t probs_pos = std : : min ( slot . n_sent_token_probs , slot . generated_token_probs . size ( ) ) ;
const size_t probs_stop_pos = std : : min ( slot . n_sent_token_probs + to_send_toks . size ( ) , slot . generated_token_probs . size ( ) ) ;
std : : vector < completion_token_output > probs_output ;
if ( probs_pos < probs_stop_pos ) {
probs_output = std : : vector < completion_token_output > (
slot . generated_token_probs . begin ( ) + probs_pos ,
slot . generated_token_probs . begin ( ) + probs_stop_pos ) ;
2023-07-02 23:38:44 +02:00
}
2024-02-29 21:42:11 +01:00
slot . n_sent_token_probs = probs_stop_pos ;
2024-03-07 10:41:53 +01:00
res . data [ " completion_probabilities " ] = probs_vector_to_json ( ctx , probs_output ) ;
2023-10-22 21:53:08 +02:00
}
2023-08-08 15:29:19 +02:00
2024-03-07 10:41:53 +01:00
if ( slot . oaicompat ) {
res . data [ " oaicompat_token_ctr " ] = slot . n_decoded ;
res . data [ " model " ] = slot . oaicompat_model ;
2023-11-25 10:29:06 +01:00
}
2024-01-26 13:42:20 +01:00
queue_results . send ( res ) ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
void send_final_response ( const server_slot & slot ) {
server_task_result res ;
res . id = slot . id_task ;
res . error = false ;
res . stop = true ;
res . data = json {
2023-10-22 21:53:08 +02:00
{ " content " , ! slot . params . stream ? slot . generated_text : " " } ,
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
2023-10-22 21:53:08 +02:00
{ " stop " , true } ,
{ " model " , params . model_alias } ,
{ " tokens_predicted " , slot . n_decoded } ,
2024-02-29 21:42:11 +01:00
{ " tokens_evaluated " , slot . n_prompt_tokens } ,
2023-10-22 21:53:08 +02:00
{ " generation_settings " , get_formated_generation ( slot ) } ,
{ " prompt " , slot . prompt } ,
{ " truncated " , slot . truncated } ,
{ " stopped_eos " , slot . stopped_eos } ,
{ " stopped_word " , slot . stopped_word } ,
{ " stopped_limit " , slot . stopped_limit } ,
{ " stopping_word " , slot . stopping_word } ,
{ " tokens_cached " , slot . n_past } ,
2024-09-02 17:11:51 +02:00
{ " timings " , slot . get_formated_timings ( ) } ,
{ " index " , slot . index } ,
2023-10-22 21:53:08 +02:00
} ;
2024-03-07 10:41:53 +01:00
if ( slot . sparams . n_probs > 0 ) {
std : : vector < completion_token_output > probs ;
if ( ! slot . params . stream & & slot . stopped_word ) {
2023-10-22 21:53:08 +02:00
const std : : vector < llama_token > stop_word_toks = llama_tokenize ( ctx , slot . stopping_word , false ) ;
2024-03-07 10:41:53 +01:00
2024-05-04 11:06:40 +02:00
size_t safe_offset = std : : min ( slot . generated_token_probs . size ( ) , stop_word_toks . size ( ) ) ;
2023-10-22 21:53:08 +02:00
probs = std : : vector < completion_token_output > (
2024-03-07 10:41:53 +01:00
slot . generated_token_probs . begin ( ) ,
2024-05-04 11:06:40 +02:00
slot . generated_token_probs . end ( ) - safe_offset ) ;
2024-03-07 10:41:53 +01:00
} else {
probs = std : : vector < completion_token_output > (
slot . generated_token_probs . begin ( ) ,
slot . generated_token_probs . end ( ) ) ;
2023-10-05 16:02:55 +02:00
}
2024-03-07 10:41:53 +01:00
res . data [ " completion_probabilities " ] = probs_vector_to_json ( ctx , probs ) ;
2023-05-21 19:51:18 +02:00
}
2024-03-07 10:41:53 +01:00
if ( slot . oaicompat ) {
res . data [ " oaicompat_token_ctr " ] = slot . n_decoded ;
res . data [ " model " ] = slot . oaicompat_model ;
2023-11-25 10:29:06 +01:00
}
2024-01-26 13:42:20 +01:00
queue_results . send ( res ) ;
2023-10-22 21:53:08 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
void send_embedding ( const server_slot & slot , const llama_batch & batch ) {
server_task_result res ;
res . id = slot . id_task ;
res . error = false ;
res . stop = true ;
2023-10-22 21:53:08 +02:00
const int n_embd = llama_n_embd ( model ) ;
2024-03-04 21:31:20 +01:00
2024-03-09 13:27:58 +01:00
std : : vector < float > embd_res ( n_embd , 0.0f ) ;
2024-03-07 10:41:53 +01:00
for ( int i = 0 ; i < batch . n_tokens ; + + i ) {
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
if ( ! batch . logits [ i ] | | batch . seq_id [ i ] [ 0 ] ! = slot . id + 1 ) {
2024-03-07 10:41:53 +01:00
continue ;
}
2024-03-04 21:31:20 +01:00
2024-03-07 10:41:53 +01:00
const float * embd = llama_get_embeddings_seq ( ctx , batch . seq_id [ i ] [ 0 ] ) ;
if ( embd = = NULL ) {
embd = llama_get_embeddings_ith ( ctx , i ) ;
}
2024-03-04 21:31:20 +01:00
2024-03-07 10:41:53 +01:00
if ( embd = = NULL ) {
LOG_ERROR ( " failed to get embeddings " , {
{ " token " , batch . token [ i ] } ,
{ " seq_id " , batch . seq_id [ i ] [ 0 ] }
} ) ;
res . data = json {
{ " embedding " , std : : vector < float > ( n_embd , 0.0f ) } ,
2024-03-04 21:31:20 +01:00
} ;
2024-03-07 10:41:53 +01:00
continue ;
2024-03-04 21:31:20 +01:00
}
2024-03-07 10:41:53 +01:00
2024-03-09 13:27:58 +01:00
llama_embd_normalize ( embd , embd_res . data ( ) , n_embd ) ;
2024-03-07 10:41:53 +01:00
res . data = json {
2024-03-09 13:27:58 +01:00
{ " embedding " , embd_res } ,
2024-09-02 17:11:51 +02:00
{ " index " , slot . index } ,
2024-03-07 10:41:53 +01:00
} ;
2023-05-21 19:51:18 +02:00
}
2024-03-07 10:41:53 +01:00
2024-01-26 13:42:20 +01:00
queue_results . send ( res ) ;
2023-10-22 21:53:08 +02:00
}
2023-05-21 19:51:18 +02:00
2024-09-02 17:11:51 +02:00
//
// Functions to create new task(s) and receive result(s)
//
2024-02-06 09:16:23 +01:00
2024-09-02 17:11:51 +02:00
std : : vector < server_task > create_tasks_cmpl ( json data , server_task_cmpl_type cmpl_type ) {
std : : vector < server_task > tasks ;
auto create_task = [ & ] ( json & task_data , bool replace_prompt , json prompt ) {
server_task task ;
task . id = queue_tasks . get_new_id ( ) ;
task . cmpl_type = cmpl_type ;
task . type = SERVER_TASK_TYPE_COMPLETION ;
if ( replace_prompt ) {
task . data = task_data ;
task . data [ " prompt " ] = prompt ;
2024-02-06 09:16:23 +01:00
} else {
2024-09-02 17:11:51 +02:00
task . data = std : : move ( task_data ) ;
2024-02-06 09:16:23 +01:00
}
2024-09-02 17:11:51 +02:00
tasks . push_back ( std : : move ( task ) ) ;
} ;
static constexpr const char * error_msg = " \" prompt \" must be a string, an array of token ids or an array of prompts " ;
if ( ! data . contains ( " prompt " ) ) {
throw std : : runtime_error ( error_msg ) ;
2024-02-06 09:16:23 +01:00
}
2023-10-22 21:53:08 +02:00
2024-09-02 17:11:51 +02:00
json prompt = data . at ( " prompt " ) ;
// if the prompt is a singleton (i.e. a string or a list of tokens), we only need to create single task
if ( prompt . is_string ( ) | | json_is_array_of_numbers ( prompt ) ) {
data [ " index " ] = 0 ;
create_task ( data , false , nullptr ) ;
}
// otherwise, it's a multiple-prompt task, we break it into smaller tasks
else if ( prompt . is_array ( ) ) {
std : : vector < json > prompts = prompt ;
for ( size_t i = 0 ; i < prompts . size ( ) ; i + + ) {
const auto & e = prompts [ i ] ;
if ( e . is_string ( ) | | json_is_array_of_numbers ( e ) ) {
data [ " index " ] = i ;
create_task ( data , true , e ) ;
} else {
throw std : : runtime_error ( error_msg ) ;
}
}
}
// invalid case
else {
throw std : : runtime_error ( error_msg ) ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-09-02 17:11:51 +02:00
return tasks ;
2023-10-22 21:53:08 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-09-02 17:11:51 +02:00
void cancel_tasks ( const std : : unordered_set < int > & id_tasks ) {
std : : vector < server_task > cancel_tasks ;
cancel_tasks . reserve ( id_tasks . size ( ) ) ;
for ( const auto & id_task : id_tasks ) {
LOG_VERBOSE ( " cancel task " , { { " id_task " , id_task } } ) ;
server_task task ;
task . type = SERVER_TASK_TYPE_CANCEL ;
task . id_target = id_task ;
cancel_tasks . push_back ( task ) ;
queue_results . remove_waiting_task_id ( id_task ) ;
}
// push to beginning of the queue, so it has highest priority
queue_tasks . post ( cancel_tasks , true ) ;
}
// receive the results from task(s) created by create_tasks_cmpl
void receive_cmpl_results ( const std : : unordered_set < int > & id_tasks , std : : function < void ( std : : vector < server_task_result > & ) > result_handler , std : : function < void ( json ) > error_handler ) {
// TODO: currently, there is no way to detect the client has cancelled the request
std : : vector < server_task_result > results ( id_tasks . size ( ) ) ;
for ( size_t i = 0 ; i < id_tasks . size ( ) ; i + + ) {
server_task_result result = queue_results . recv ( id_tasks ) ;
if ( result . error ) {
error_handler ( result . data ) ;
cancel_tasks ( id_tasks ) ;
break ;
}
2023-11-30 23:25:04 +01:00
2024-09-02 17:11:51 +02:00
size_t idx = result . data [ " index " ] ;
results [ idx ] = result ;
2024-01-26 13:42:20 +01:00
}
2024-09-02 17:11:51 +02:00
result_handler ( results ) ;
}
2024-01-26 13:42:20 +01:00
2024-09-02 17:11:51 +02:00
// receive the results from task(s) created by create_tasks_cmpl, in stream mode
void receive_cmpl_results_stream ( const std : : unordered_set < int > & id_tasks , std : : function < bool ( server_task_result & ) > result_handler , std : : function < void ( json ) > error_handler ) {
size_t n_finished = 0 ;
while ( true ) {
server_task_result result = queue_results . recv ( id_tasks ) ;
if ( ! result_handler ( result ) ) {
cancel_tasks ( id_tasks ) ;
break ;
}
2024-01-26 13:42:20 +01:00
2024-09-02 17:11:51 +02:00
if ( result . error ) {
error_handler ( result . data ) ;
cancel_tasks ( id_tasks ) ;
break ;
}
2023-11-30 23:25:04 +01:00
2024-09-02 17:11:51 +02:00
if ( result . stop ) {
if ( + + n_finished = = id_tasks . size ( ) ) {
break ;
}
}
2023-11-30 23:25:04 +01:00
}
}
2024-09-02 17:11:51 +02:00
//
// Functions to process the task
//
2024-03-07 10:41:53 +01:00
void process_single_task ( const server_task & task ) {
switch ( task . type ) {
case SERVER_TASK_TYPE_COMPLETION :
2024-01-26 13:42:20 +01:00
{
2024-06-10 13:59:55 +02:00
const int id_slot = json_value ( task . data , " id_slot " , - 1 ) ;
2024-06-08 09:50:31 +02:00
server_slot * slot ;
if ( id_slot ! = - 1 ) {
slot = get_slot_by_id ( id_slot ) ;
} else {
2024-06-10 13:59:55 +02:00
std : : string prompt ;
if ( task . data . contains ( " prompt " ) & & task . data . at ( " prompt " ) . is_string ( ) ) {
2024-06-20 01:57:10 +02:00
prompt = json_value ( task . data , " prompt " , std : : string ( ) ) ;
2024-06-10 13:59:55 +02:00
}
2024-06-08 09:50:31 +02:00
slot = get_available_slot ( prompt ) ;
}
2024-03-07 10:41:53 +01:00
if ( slot = = nullptr ) {
// if no slot is available, we defer this task for processing later
LOG_VERBOSE ( " no slot is available " , { { " id_task " , task . id } } ) ;
queue_tasks . defer ( task ) ;
2024-01-13 18:31:26 +01:00
break ;
2023-10-22 21:53:08 +02:00
}
2024-09-06 23:21:29 +02:00
if ( slot - > is_processing ( ) ) {
2024-06-08 09:50:31 +02:00
// if requested slot is unavailable, we defer this task for processing later
LOG_VERBOSE ( " requested slot is unavailable " , { { " id_task " , task . id } } ) ;
queue_tasks . defer ( task ) ;
break ;
}
2024-03-07 10:41:53 +01:00
if ( task . data . contains ( " system_prompt " ) ) {
2024-05-11 17:28:10 +02:00
std : : string sys_prompt = json_value ( task . data , " system_prompt " , std : : string ( ) ) ;
system_prompt_set ( sys_prompt ) ;
2024-03-07 10:41:53 +01:00
for ( server_slot & slot : slots ) {
slot . n_past = 0 ;
slot . n_past_se = 0 ;
}
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
slot - > reset ( ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
slot - > id_task = task . id ;
2024-09-02 17:11:51 +02:00
slot - > cmpl_type = task . cmpl_type ;
slot - > index = json_value ( task . data , " index " , 0 ) ;
2023-10-22 21:53:08 +02:00
2024-03-11 10:56:41 +01:00
if ( ! launch_slot_with_task ( * slot , task ) ) {
LOG_ERROR ( " error while launching slot " , task . data ) ;
2023-10-22 21:53:08 +02:00
break ;
}
2024-03-07 10:41:53 +01:00
} break ;
case SERVER_TASK_TYPE_CANCEL :
{
// release slot linked with the task id
for ( auto & slot : slots ) {
if ( slot . id_task = = task . id_target ) {
slot . release ( ) ;
break ;
}
2024-02-24 12:28:55 +01:00
}
2024-03-07 10:41:53 +01:00
} break ;
case SERVER_TASK_TYPE_NEXT_RESPONSE :
{
// do nothing
} break ;
case SERVER_TASK_TYPE_METRICS :
{
json slots_data = json : : array ( ) ;
int n_idle_slots = 0 ;
int n_processing_slots = 0 ;
for ( server_slot & slot : slots ) {
json slot_data = get_formated_generation ( slot ) ;
slot_data [ " id " ] = slot . id ;
slot_data [ " id_task " ] = slot . id_task ;
slot_data [ " state " ] = slot . state ;
slot_data [ " prompt " ] = slot . prompt ;
slot_data [ " next_token " ] = {
{ " has_next_token " , slot . has_next_token } ,
{ " n_remain " , slot . n_remaining } ,
{ " n_decoded " , slot . n_decoded } ,
{ " stopped_eos " , slot . stopped_eos } ,
{ " stopped_word " , slot . stopped_word } ,
{ " stopped_limit " , slot . stopped_limit } ,
{ " stopping_word " , slot . stopping_word } ,
} ;
if ( slot_data [ " state " ] = = SLOT_STATE_IDLE ) {
n_idle_slots + + ;
} else {
n_processing_slots + + ;
}
slots_data . push_back ( slot_data ) ;
}
LOG_INFO ( " slot data " , {
{ " id_task " , task . id } ,
{ " n_idle_slots " , n_idle_slots } ,
{ " n_processing_slots " , n_processing_slots }
} ) ;
LOG_VERBOSE ( " slot data " , {
{ " id_task " , task . id } ,
{ " n_idle_slots " , n_idle_slots } ,
{ " n_processing_slots " , n_processing_slots } ,
{ " slots " , slots_data }
} ) ;
server_task_result res ;
res . id = task . id ;
res . stop = true ;
res . error = false ;
res . data = {
2024-02-25 13:49:43 +01:00
{ " idle " , n_idle_slots } ,
{ " processing " , n_processing_slots } ,
{ " deferred " , queue_tasks . queue_tasks_deferred . size ( ) } ,
2024-03-08 12:25:04 +01:00
{ " t_start " , metrics . t_start } ,
2024-02-25 13:49:43 +01:00
{ " n_prompt_tokens_processed_total " , metrics . n_prompt_tokens_processed_total } ,
2024-03-08 12:25:04 +01:00
{ " t_tokens_generation_total " , metrics . t_tokens_generation_total } ,
2024-02-25 13:49:43 +01:00
{ " n_tokens_predicted_total " , metrics . n_tokens_predicted_total } ,
2024-03-08 12:25:04 +01:00
{ " t_prompt_processing_total " , metrics . t_prompt_processing_total } ,
2024-02-25 13:49:43 +01:00
{ " n_prompt_tokens_processed " , metrics . n_prompt_tokens_processed } ,
{ " t_prompt_processing " , metrics . t_prompt_processing } ,
{ " n_tokens_predicted " , metrics . n_tokens_predicted } ,
{ " t_tokens_generation " , metrics . t_tokens_generation } ,
2024-09-06 23:21:29 +02:00
{ " n_decode_total " , metrics . n_decode_total } ,
{ " n_busy_slots_total " , metrics . n_busy_slots_total } ,
2024-02-29 21:42:11 +01:00
{ " kv_cache_tokens_count " , llama_get_kv_cache_token_count ( ctx ) } ,
{ " kv_cache_used_cells " , llama_get_kv_cache_used_cells ( ctx ) } ,
2024-02-25 13:49:43 +01:00
2024-02-29 21:42:11 +01:00
{ " slots " , slots_data } ,
2024-03-07 10:41:53 +01:00
} ;
2024-03-08 12:25:04 +01:00
if ( json_value ( task . data , " reset_bucket " , false ) ) {
metrics . reset_bucket ( ) ;
}
2024-03-07 10:41:53 +01:00
queue_results . send ( res ) ;
} break ;
2024-04-08 14:43:30 +02:00
case SERVER_TASK_TYPE_SLOT_SAVE :
{
2024-05-08 21:53:08 +02:00
int id_slot = task . data . at ( " id_slot " ) ;
2024-06-08 09:50:31 +02:00
server_slot * slot = get_slot_by_id ( id_slot ) ;
2024-04-08 14:43:30 +02:00
if ( slot = = nullptr ) {
send_error ( task , " Invalid slot ID " , ERROR_TYPE_INVALID_REQUEST ) ;
break ;
}
2024-09-06 23:21:29 +02:00
if ( slot - > is_processing ( ) ) {
2024-06-08 09:50:31 +02:00
// if requested slot is unavailable, we defer this task for processing later
LOG_VERBOSE ( " requested slot is unavailable " , { { " id_task " , task . id } } ) ;
queue_tasks . defer ( task ) ;
break ;
}
2024-04-08 14:43:30 +02:00
const size_t token_count = slot - > cache_tokens . size ( ) ;
const int64_t t_start = ggml_time_us ( ) ;
2024-05-08 21:53:08 +02:00
std : : string filename = task . data . at ( " filename " ) ;
std : : string filepath = task . data . at ( " filepath " ) ;
2024-04-08 14:43:30 +02:00
const size_t nwrite = llama_state_seq_save_file ( ctx , filepath . c_str ( ) , slot - > id + 1 , slot - > cache_tokens . data ( ) , token_count ) ;
const int64_t t_end = ggml_time_us ( ) ;
const double t_save_ms = ( t_end - t_start ) / 1000.0 ;
server_task_result result ;
result . id = task . id ;
result . stop = true ;
result . error = false ;
result . data = json {
{ " id_slot " , id_slot } ,
{ " filename " , filename } ,
{ " n_saved " , token_count } , // tokens saved
{ " n_written " , nwrite } , // bytes written
{ " timings " , {
{ " save_ms " , t_save_ms }
} }
} ;
queue_results . send ( result ) ;
} break ;
case SERVER_TASK_TYPE_SLOT_RESTORE :
{
2024-05-08 21:53:08 +02:00
int id_slot = task . data . at ( " id_slot " ) ;
2024-06-08 09:50:31 +02:00
server_slot * slot = get_slot_by_id ( id_slot ) ;
2024-04-08 14:43:30 +02:00
if ( slot = = nullptr ) {
send_error ( task , " Invalid slot ID " , ERROR_TYPE_INVALID_REQUEST ) ;
break ;
}
2024-09-06 23:21:29 +02:00
if ( slot - > is_processing ( ) ) {
2024-06-08 09:50:31 +02:00
// if requested slot is unavailable, we defer this task for processing later
LOG_VERBOSE ( " requested slot is unavailable " , { { " id_task " , task . id } } ) ;
queue_tasks . defer ( task ) ;
break ;
}
2024-04-08 14:43:30 +02:00
const int64_t t_start = ggml_time_us ( ) ;
2024-05-08 21:53:08 +02:00
std : : string filename = task . data . at ( " filename " ) ;
std : : string filepath = task . data . at ( " filepath " ) ;
2024-04-08 14:43:30 +02:00
slot - > cache_tokens . resize ( slot - > n_ctx ) ;
size_t token_count = 0 ;
size_t nread = llama_state_seq_load_file ( ctx , filepath . c_str ( ) , slot - > id + 1 , slot - > cache_tokens . data ( ) , slot - > cache_tokens . size ( ) , & token_count ) ;
if ( nread = = 0 ) {
slot - > cache_tokens . resize ( 0 ) ;
send_error ( task , " Unable to restore slot, no available space in KV cache or invalid slot save file " , ERROR_TYPE_INVALID_REQUEST ) ;
break ;
}
slot - > cache_tokens . resize ( token_count ) ;
const int64_t t_end = ggml_time_us ( ) ;
const double t_restore_ms = ( t_end - t_start ) / 1000.0 ;
server_task_result result ;
result . id = task . id ;
result . stop = true ;
result . error = false ;
result . data = json {
{ " id_slot " , id_slot } ,
{ " filename " , filename } ,
{ " n_restored " , token_count } , // tokens restored
{ " n_read " , nread } , // bytes read
{ " timings " , {
{ " restore_ms " , t_restore_ms }
} }
} ;
queue_results . send ( result ) ;
} break ;
case SERVER_TASK_TYPE_SLOT_ERASE :
{
2024-05-08 21:53:08 +02:00
int id_slot = task . data . at ( " id_slot " ) ;
2024-06-08 09:50:31 +02:00
server_slot * slot = get_slot_by_id ( id_slot ) ;
2024-04-08 14:43:30 +02:00
if ( slot = = nullptr ) {
send_error ( task , " Invalid slot ID " , ERROR_TYPE_INVALID_REQUEST ) ;
break ;
}
2024-09-06 23:21:29 +02:00
if ( slot - > is_processing ( ) ) {
2024-06-08 09:50:31 +02:00
// if requested slot is unavailable, we defer this task for processing later
LOG_VERBOSE ( " requested slot is unavailable " , { { " id_task " , task . id } } ) ;
queue_tasks . defer ( task ) ;
break ;
}
2024-04-08 14:43:30 +02:00
// Erase token cache
const size_t n_erased = slot - > cache_tokens . size ( ) ;
llama_kv_cache_seq_rm ( ctx , slot - > id + 1 , - 1 , - 1 ) ;
slot - > cache_tokens . clear ( ) ;
server_task_result result ;
result . id = task . id ;
result . stop = true ;
result . error = false ;
result . data = json {
{ " id_slot " , id_slot } ,
{ " n_erased " , n_erased }
} ;
queue_results . send ( result ) ;
} break ;
2024-08-06 17:33:39 +02:00
case SERVER_TASK_TYPE_SET_LORA :
{
llama_lora_adapters_apply ( ctx , lora_adapters ) ;
server_task_result result ;
result . id = task . id ;
2024-08-15 08:21:57 +02:00
result . stop = true ;
result . error = false ;
2024-08-06 17:33:39 +02:00
result . data = json { { " success " , true } } ;
queue_results . send ( result ) ;
} break ;
2023-07-02 23:38:44 +02:00
}
2024-01-26 13:42:20 +01:00
}
2023-11-30 23:25:04 +01:00
2024-03-11 10:56:41 +01:00
void update_slots ( ) {
2024-03-07 10:41:53 +01:00
if ( system_need_update ) {
2024-02-29 21:42:11 +01:00
system_prompt_update ( ) ;
2023-07-05 22:51:13 +02:00
}
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
// check if all slots are idle
{
bool all_idle = true ;
for ( auto & slot : slots ) {
2024-09-06 23:21:29 +02:00
if ( slot . is_processing ( ) ) {
2024-03-07 10:41:53 +01:00
all_idle = false ;
break ;
}
}
if ( all_idle ) {
LOG_INFO ( " all slots are idle " , { } ) ;
if ( system_prompt . empty ( ) & & clean_kv_cache ) {
kv_cache_clear ( ) ;
}
2024-03-11 10:56:41 +01:00
return ;
2024-03-07 10:41:53 +01:00
}
}
2024-01-30 19:17:30 +01:00
2023-10-22 21:53:08 +02:00
{
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " posting NEXT_RESPONSE " , { } ) ;
server_task task ;
task . type = SERVER_TASK_TYPE_NEXT_RESPONSE ;
task . id_target = - 1 ;
queue_tasks . post ( task ) ;
}
// apply context-shift if needed
// TODO: simplify and improve
for ( server_slot & slot : slots ) {
if ( slot . ga_n = = 1 ) {
if ( slot . is_processing ( ) & & ( int ) system_tokens . size ( ) + slot . n_past > = slot . n_ctx - 1 ) {
2024-01-27 14:38:05 +01:00
// Shift context
2024-02-21 16:33:54 +01:00
const int n_keep = slot . params . n_keep + add_bos_token ;
2024-02-25 13:50:32 +01:00
const int n_left = ( int ) system_tokens . size ( ) + slot . n_past - n_keep ;
2024-03-26 09:47:43 +01:00
const int n_discard = slot . params . n_discard ? slot . params . n_discard : ( n_left / 2 ) ;
2023-10-22 21:53:08 +02:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( " slot context shift " , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
2024-02-25 13:50:32 +01:00
{ " n_keep " , n_keep } ,
{ " n_left " , n_left } ,
{ " n_discard " , n_discard } ,
{ " n_ctx " , n_ctx } ,
{ " n_past " , slot . n_past } ,
{ " n_system_tokens " , system_tokens . size ( ) } ,
{ " n_cache_tokens " , slot . cache_tokens . size ( ) }
} ) ;
2024-03-07 10:41:53 +01:00
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
llama_kv_cache_seq_rm ( ctx , slot . id + 1 , n_keep , n_keep + n_discard ) ;
llama_kv_cache_seq_add ( ctx , slot . id + 1 , n_keep + n_discard , system_tokens . size ( ) + slot . n_past , - n_discard ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
if ( slot . params . cache_prompt ) {
for ( size_t i = n_keep + n_discard ; i < slot . cache_tokens . size ( ) ; i + + ) {
slot . cache_tokens [ i - n_discard ] = slot . cache_tokens [ i ] ;
}
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
slot . cache_tokens . resize ( slot . cache_tokens . size ( ) - n_discard ) ;
}
2023-10-22 21:53:08 +02:00
2024-01-27 14:38:05 +01:00
slot . n_past - = n_discard ;
2023-10-22 21:53:08 +02:00
2024-01-27 14:38:05 +01:00
slot . truncated = true ;
}
2023-07-05 22:51:13 +02:00
}
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
// start populating the batch for this iteration
llama_batch_clear ( batch ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
// frist, add sampled tokens from any ongoing sequences
for ( auto & slot : slots ) {
2024-09-06 23:21:29 +02:00
if ( slot . state ! = SLOT_STATE_GENERATING ) {
2023-10-22 21:53:08 +02:00
continue ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-10-22 21:53:08 +02:00
slot . i_batch = batch . n_tokens ;
2024-01-27 14:38:05 +01:00
const int32_t slot_npast = slot . n_past_se > 0 ? slot . n_past_se : slot . n_past ;
2023-10-22 21:53:08 +02:00
2024-01-30 19:17:30 +01:00
// TODO: we always have to take into account the "system_tokens"
// this is not great and needs to be improved somehow
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
llama_batch_add ( batch , slot . sampled , system_tokens . size ( ) + slot_npast , { slot . id + 1 } , true ) ;
2024-03-07 10:41:53 +01:00
2023-10-22 21:53:08 +02:00
slot . n_past + = 1 ;
2024-03-07 10:41:53 +01:00
if ( slot . params . cache_prompt ) {
slot . cache_tokens . push_back ( slot . sampled ) ;
}
LOG_VERBOSE ( " slot decode token " , {
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
{ " n_ctx " , n_ctx } ,
{ " n_past " , slot . n_past } ,
{ " n_system_tokens " , system_tokens . size ( ) } ,
{ " n_cache_tokens " , slot . cache_tokens . size ( ) } ,
{ " truncated " , slot . truncated }
} ) ;
2023-05-21 19:51:18 +02:00
}
2023-10-22 21:53:08 +02:00
// process in chunks of params.n_batch
2024-03-22 12:08:28 +01:00
int32_t n_batch = llama_n_batch ( ctx ) ;
2024-03-13 18:54:21 +01:00
int32_t n_ubatch = llama_n_ubatch ( ctx ) ;
2023-10-22 21:53:08 +02:00
2024-07-12 10:14:12 +02:00
// track if this is an embedding or non-embedding batch
// if we've added sampled tokens above, we are in non-embedding mode
// -1: none, 0: non-embedding, 1: embedding
int32_t batch_type = batch . n_tokens > 0 ? 0 : - 1 ;
2024-03-07 10:41:53 +01:00
// next, batch any pending prompts without exceeding n_batch
if ( params . cont_batching | | batch . n_tokens = = 0 ) {
for ( auto & slot : slots ) {
// this slot still has a prompt to be processed
2024-09-06 23:21:29 +02:00
if ( slot . state = = SLOT_STATE_PROCESSING_PROMPT ) {
2024-03-07 10:41:53 +01:00
auto & prompt_tokens = slot . prompt_tokens ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
// we haven't tokenized the prompt yet - do it now:
if ( prompt_tokens . empty ( ) ) {
LOG_VERBOSE ( " tokenizing prompt " , {
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task }
} ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
slot . t_start_process_prompt = ggml_time_us ( ) ;
slot . t_start_generation = 0 ;
2024-09-02 17:11:51 +02:00
if ( slot . cmpl_type = = SERVER_TASK_CMPL_TYPE_INFILL ) {
2024-08-15 09:23:23 +02:00
const bool add_bos = llama_add_bos_token ( model ) ;
2024-03-07 10:41:53 +01:00
bool suff_rm_leading_spc = true ;
if ( params . input_suffix . find_first_of ( ' ' ) = = 0 & & params . input_suffix . size ( ) > 1 ) {
params . input_suffix . erase ( 0 , 1 ) ;
suff_rm_leading_spc = false ;
}
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
auto prefix_tokens = tokenize ( slot . params . input_prefix , false ) ;
auto suffix_tokens = tokenize ( slot . params . input_suffix , false ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
const int space_token = 29871 ; // TODO: this should not be hardcoded
if ( suff_rm_leading_spc & & ! suffix_tokens . empty ( ) & & suffix_tokens [ 0 ] = = space_token ) {
suffix_tokens . erase ( suffix_tokens . begin ( ) ) ;
}
2023-11-11 06:48:21 +01:00
2024-03-07 10:41:53 +01:00
prefix_tokens . insert ( prefix_tokens . begin ( ) , llama_token_prefix ( model ) ) ;
2024-06-28 12:53:43 +02:00
suffix_tokens . insert ( suffix_tokens . begin ( ) , llama_token_suffix ( model ) ) ;
auto embd_inp = params . spm_infill ? suffix_tokens : prefix_tokens ;
auto embd_end = params . spm_infill ? prefix_tokens : suffix_tokens ;
if ( add_bos ) {
embd_inp . insert ( embd_inp . begin ( ) , llama_token_bos ( model ) ) ;
}
embd_inp . insert ( embd_inp . end ( ) , embd_end . begin ( ) , embd_end . end ( ) ) ;
2024-06-18 14:19:45 +02:00
const llama_token middle_token = llama_token_middle ( model ) ;
if ( middle_token > = 0 ) {
2024-06-28 12:53:43 +02:00
embd_inp . push_back ( middle_token ) ;
2024-06-18 14:19:45 +02:00
}
2024-06-28 12:53:43 +02:00
prompt_tokens = embd_inp ;
2024-03-07 10:41:53 +01:00
} else {
2024-04-09 19:44:08 +02:00
prompt_tokens = tokenize ( slot . prompt , system_prompt . empty ( ) ) ; // add BOS if there isn't system prompt
2024-03-07 10:41:53 +01:00
}
slot . n_past = 0 ;
2024-02-29 21:42:11 +01:00
slot . n_prompt_tokens = prompt_tokens . size ( ) ;
2023-11-11 06:48:21 +01:00
2024-03-09 10:30:04 +01:00
LOG_VERBOSE ( " prompt tokenized " , {
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
{ " n_ctx " , slot . n_ctx } ,
{ " n_keep " , slot . params . n_keep } ,
{ " n_prompt_tokens " , slot . n_prompt_tokens } ,
{ " prompt_tokens " , tokens_to_str ( ctx , prompt_tokens . cbegin ( ) , prompt_tokens . cend ( ) ) } ,
} ) ;
2024-03-09 11:34:18 +01:00
// empty prompt passed -> release the slot and send empty response
if ( prompt_tokens . empty ( ) ) {
LOG_INFO ( " empty prompt - releasing slot " , {
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task }
} ) ;
slot . release ( ) ;
slot . print_timings ( ) ;
send_final_response ( slot ) ;
continue ;
}
2024-09-02 17:11:51 +02:00
if ( slot . cmpl_type = = SERVER_TASK_CMPL_TYPE_EMBEDDING ) {
2024-03-07 10:41:53 +01:00
// this prompt is too large to process - discard it
2024-03-13 18:54:21 +01:00
if ( slot . n_prompt_tokens > n_ubatch ) {
2024-03-07 10:41:53 +01:00
slot . release ( ) ;
2024-05-20 07:56:05 +02:00
send_error ( slot , " input is too large to process. increase the physical batch size " , ERROR_TYPE_SERVER ) ;
2024-03-07 10:41:53 +01:00
continue ;
}
} else {
if ( slot . params . n_keep < 0 ) {
slot . params . n_keep = slot . n_prompt_tokens ;
}
slot . params . n_keep = std : : min ( slot . n_ctx - 4 , slot . params . n_keep ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
// if input prompt is too big, truncate it (if group attention self-extend is disabled)
if ( slot . ga_n = = 1 & & slot . n_prompt_tokens > = slot . n_ctx ) {
const int n_left = slot . n_ctx - slot . params . n_keep ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
const int n_block_size = n_left / 2 ;
const int erased_blocks = ( slot . n_prompt_tokens - slot . params . n_keep - n_block_size ) / n_block_size ;
2024-02-25 19:43:50 +01:00
2024-03-07 10:41:53 +01:00
std : : vector < llama_token > new_tokens (
prompt_tokens . begin ( ) ,
prompt_tokens . begin ( ) + slot . params . n_keep ) ;
2024-02-25 19:43:50 +01:00
2024-03-07 10:41:53 +01:00
new_tokens . insert (
new_tokens . end ( ) ,
prompt_tokens . begin ( ) + slot . params . n_keep + erased_blocks * n_block_size ,
prompt_tokens . end ( ) ) ;
prompt_tokens = std : : move ( new_tokens ) ;
slot . truncated = true ;
slot . n_prompt_tokens = prompt_tokens . size ( ) ;
LOG_VERBOSE ( " input truncated " , {
2024-03-09 10:30:04 +01:00
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
{ " n_ctx " , slot . n_ctx } ,
{ " n_keep " , slot . params . n_keep } ,
{ " n_left " , n_left } ,
{ " n_prompt_tokens " , slot . n_prompt_tokens } ,
{ " prompt_tokens " , tokens_to_str ( ctx , prompt_tokens . cbegin ( ) , prompt_tokens . cend ( ) ) } ,
2024-03-07 10:41:53 +01:00
} ) ;
GGML_ASSERT ( slot . n_prompt_tokens < slot . n_ctx ) ;
}
2024-09-07 14:16:19 +02:00
gpt_sampler_reset ( slot . smpl ) ;
2024-03-07 10:41:53 +01:00
if ( ! slot . params . cache_prompt ) {
slot . n_past_se = 0 ;
slot . ga_i = 0 ;
} else {
GGML_ASSERT ( slot . ga_n = = 1 ) ;
// reuse any previously computed tokens that are common with the new prompt
slot . n_past = common_part ( slot . cache_tokens , prompt_tokens ) ;
// push the prompt into the sampling context (do not apply grammar)
for ( int i = 0 ; i < slot . n_past ; + + i ) {
2024-09-07 14:16:19 +02:00
gpt_sampler_accept ( slot . smpl , slot . cache_tokens [ i ] , false ) ;
2024-01-27 14:38:05 +01:00
}
}
}
2024-03-07 10:41:53 +01:00
if ( slot . n_past = = slot . n_prompt_tokens & & slot . n_past > 0 ) {
// we have to evaluate at least 1 token to generate logits.
LOG_INFO ( " we have to evaluate at least 1 token to generate logits " , {
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task }
} ) ;
slot . n_past - - ;
if ( slot . ga_i > 0 ) {
slot . n_past_se - - ;
}
}
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
slot . n_prompt_tokens_processed = 0 ;
}
2023-10-22 21:53:08 +02:00
2024-09-02 17:11:51 +02:00
if ( slot . cmpl_type = = SERVER_TASK_CMPL_TYPE_EMBEDDING ) {
2024-03-07 10:41:53 +01:00
// cannot fit the prompt in the current batch - will try next iter
if ( batch . n_tokens + slot . n_prompt_tokens > n_batch ) {
continue ;
2024-01-27 14:38:05 +01:00
}
2023-10-22 21:53:08 +02:00
}
2024-07-12 10:14:12 +02:00
// check that we are in the right batch_type, if not defer the slot
2024-09-02 17:11:51 +02:00
bool slot_type = slot . cmpl_type = = SERVER_TASK_CMPL_TYPE_EMBEDDING ? 1 : 0 ;
2024-07-12 10:14:12 +02:00
if ( batch_type = = - 1 ) {
batch_type = slot_type ;
} else if ( batch_type ! = slot_type ) {
continue ;
}
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
// keep only the common part
int p0 = ( int ) system_tokens . size ( ) + slot . n_past ;
if ( ! llama_kv_cache_seq_rm ( ctx , slot . id + 1 , p0 , - 1 ) ) {
// could not partially delete (likely using a non-Transformer model)
llama_kv_cache_seq_rm ( ctx , slot . id + 1 , - 1 , - 1 ) ;
p0 = ( int ) system_tokens . size ( ) ;
if ( p0 ! = 0 ) {
// copy over the system prompt when there is one
llama_kv_cache_seq_cp ( ctx , 0 , slot . id + 1 , - 1 , - 1 ) ;
}
// there is no common part left (except for the system prompt)
slot . n_past = 0 ;
slot . n_past_se = 0 ;
slot . ga_i = 0 ;
// TODO: is the system prompt ever in the sampling context?
2024-09-07 14:16:19 +02:00
gpt_sampler_reset ( slot . smpl ) ;
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
}
// remove the non-common part from the cache
slot . cache_tokens . resize ( slot . n_past ) ;
2024-03-07 10:41:53 +01:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( " kv cache rm [p0, end) " , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
2024-02-25 13:50:32 +01:00
{ " p0 " , p0 }
} ) ;
2024-01-30 19:17:30 +01:00
2024-01-27 14:38:05 +01:00
int32_t slot_npast = slot . n_past_se > 0 ? slot . n_past_se : slot . n_past ;
2024-01-30 19:17:30 +01:00
int32_t ga_i = slot . ga_i ;
2024-01-27 14:38:05 +01:00
int32_t ga_n = slot . ga_n ;
int32_t ga_w = slot . ga_w ;
2024-01-30 19:17:30 +01:00
2024-03-07 10:41:53 +01:00
// add prompt tokens for processing in the current batch
// TODO: the self-extend stuff here is a mess - simplify and/or abstract it somehow
for ( ; slot . n_past < slot . n_prompt_tokens & & batch . n_tokens < n_batch ; + + slot . n_past ) {
if ( slot . ga_n ! = 1 ) {
2024-01-27 14:38:05 +01:00
while ( slot_npast > = ga_i + ga_w ) {
const int bd = ( ga_w / ga_n ) * ( ga_n - 1 ) ;
slot_npast - = bd ;
ga_i + = ga_w / ga_n ;
}
}
2024-03-07 10:41:53 +01:00
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
llama_batch_add ( batch , prompt_tokens [ slot . n_past ] , system_tokens . size ( ) + slot_npast , { slot . id + 1 } , false ) ;
2024-03-07 10:41:53 +01:00
if ( slot . params . cache_prompt ) {
slot . cache_tokens . push_back ( prompt_tokens [ slot . n_past ] ) ;
}
slot . n_prompt_tokens_processed + + ;
2024-01-30 19:17:30 +01:00
slot_npast + + ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " prompt processing progress " , {
{ " id_slot " , slot . id } ,
{ " n_past " , slot . n_past } ,
{ " n_ctx " , n_ctx } ,
{ " n_tokens " , batch . n_tokens } ,
{ " progress " , ( float ) slot . n_prompt_tokens_processed / slot . n_prompt_tokens } ,
} ) ;
2024-09-06 23:21:29 +02:00
// entire prompt has been processed
2024-03-07 10:41:53 +01:00
if ( slot . n_past = = slot . n_prompt_tokens ) {
2024-09-06 23:21:29 +02:00
slot . state = SLOT_STATE_DONE_PROMPT ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
GGML_ASSERT ( batch . n_tokens > 0 ) ;
// extract the logits only for the last token
2023-10-22 21:53:08 +02:00
batch . logits [ batch . n_tokens - 1 ] = true ;
2024-03-07 10:41:53 +01:00
slot . n_decoded = 0 ;
slot . i_batch = batch . n_tokens - 1 ;
LOG_VERBOSE ( " prompt done " , {
{ " id_slot " , slot . id } ,
{ " n_past " , slot . n_past } ,
{ " n_ctx " , n_ctx } ,
{ " n_tokens " , batch . n_tokens } ,
} ) ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
}
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
if ( batch . n_tokens > = n_batch ) {
break ;
2023-10-22 21:53:08 +02:00
}
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-05-21 19:51:18 +02:00
2024-03-07 10:41:53 +01:00
if ( batch . n_tokens = = 0 ) {
LOG_VERBOSE ( " no tokens to decode " , { } ) ;
2024-03-11 10:56:41 +01:00
return ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-05-21 19:51:18 +02:00
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " decoding batch " , {
{ " n_tokens " , batch . n_tokens } ,
} ) ;
2024-07-12 10:14:12 +02:00
// make sure we're in the right embedding mode
llama_set_embeddings ( ctx , batch_type = = 1 ) ;
2024-03-07 10:41:53 +01:00
// process the created batch of tokens
2024-04-26 12:15:30 +02:00
for ( int32_t i = 0 ; i < batch . n_tokens ; i + = n_batch ) {
2024-03-04 21:31:20 +01:00
const int32_t n_tokens = std : : min ( n_batch , batch . n_tokens - i ) ;
2024-01-27 14:38:05 +01:00
2024-03-07 10:41:53 +01:00
for ( auto & slot : slots ) {
if ( slot . ga_n ! = 1 ) {
2024-01-27 14:38:05 +01:00
// context extension via Self-Extend
2024-03-07 10:41:53 +01:00
// TODO: simplify and/or abstract this
while ( slot . n_past_se > = slot . ga_i + slot . ga_w ) {
2024-01-27 14:38:05 +01:00
const int ib = ( slot . ga_n * slot . ga_i ) / slot . ga_w ;
const int bd = ( slot . ga_w / slot . ga_n ) * ( slot . ga_n - 1 ) ;
const int dd = ( slot . ga_w / slot . ga_n ) - ib * bd - slot . ga_w ;
LOG_TEE ( " \n " ) ;
LOG_TEE ( " shift: [%6d, %6d] + %6d -> [%6d, %6d] \n " , slot . ga_i , slot . n_past_se , ib * bd , slot . ga_i + ib * bd , slot . n_past_se + ib * bd ) ;
LOG_TEE ( " div: [%6d, %6d] / %6d -> [%6d, %6d] \n " , slot . ga_i + ib * bd , slot . ga_i + ib * bd + slot . ga_w , slot . ga_n , ( slot . ga_i + ib * bd ) / slot . ga_n , ( slot . ga_i + ib * bd + slot . ga_w ) / slot . ga_n ) ;
LOG_TEE ( " shift: [%6d, %6d] + %6d -> [%6d, %6d] \n " , slot . ga_i + ib * bd + slot . ga_w , slot . n_past_se + ib * bd , dd , slot . ga_i + ib * bd + slot . ga_w + dd , slot . n_past_se + ib * bd + dd ) ;
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
llama_kv_cache_seq_add ( ctx , slot . id + 1 , slot . ga_i , slot . n_past_se , ib * bd ) ;
llama_kv_cache_seq_div ( ctx , slot . id + 1 , slot . ga_i + ib * bd , slot . ga_i + ib * bd + slot . ga_w , slot . ga_n ) ;
llama_kv_cache_seq_add ( ctx , slot . id + 1 , slot . ga_i + ib * bd + slot . ga_w , slot . n_past_se + ib * bd , dd ) ;
2024-01-27 14:38:05 +01:00
slot . n_past_se - = bd ;
slot . ga_i + = slot . ga_w / slot . ga_n ;
LOG_TEE ( " \n n_past_old = %d, n_past = %d, ga_i = %d \n \n " , slot . n_past_se + bd , slot . n_past_se , slot . ga_i ) ;
}
2024-03-07 10:41:53 +01:00
2024-01-27 14:38:05 +01:00
slot . n_past_se + = n_tokens ;
}
}
2024-01-30 19:17:30 +01:00
2024-03-07 10:41:53 +01:00
llama_batch batch_view = {
2023-10-22 21:53:08 +02:00
n_tokens ,
batch . token + i ,
nullptr ,
batch . pos + i ,
batch . n_seq_id + i ,
batch . seq_id + i ,
batch . logits + i ,
0 , 0 , 0 , // unused
} ;
2023-05-21 19:51:18 +02:00
2023-10-22 21:53:08 +02:00
const int ret = llama_decode ( ctx , batch_view ) ;
2024-09-06 23:21:29 +02:00
metrics . on_decoded ( slots ) ;
2024-01-27 14:38:05 +01:00
2024-03-07 10:41:53 +01:00
if ( ret ! = 0 ) {
if ( n_batch = = 1 | | ret < 0 ) {
2023-10-22 21:53:08 +02:00
// if you get here, it means the KV cache is full - try increasing it via the context size
2024-04-12 13:49:21 +02:00
LOG_ERROR ( " failed to decode the batch: KV cache is full - try increasing it via the context size " , {
2024-09-06 23:21:29 +02:00
{ " i " , i } ,
{ " n_batch " , n_batch } ,
{ " ret " , ret } ,
2024-04-12 13:49:21 +02:00
} ) ;
2024-03-11 10:56:41 +01:00
for ( auto & slot : slots ) {
slot . release ( ) ;
send_error ( slot , " Input prompt is too big compared to KV size. Please try increasing KV size. " ) ;
}
break ; // break loop of n_batch
2023-10-22 21:53:08 +02:00
}
2023-06-20 00:12:39 +02:00
2023-10-22 21:53:08 +02:00
// retry with half the batch size to try to find a free slot in the KV cache
n_batch / = 2 ;
i - = n_batch ;
2024-03-07 10:41:53 +01:00
2024-04-12 13:49:21 +02:00
LOG_WARNING ( " failed to find free space in the KV cache, retrying with smaller batch size - try increasing it via the context size or enable defragmentation " , {
2024-09-06 23:21:29 +02:00
{ " i " , i } ,
{ " n_batch " , n_batch } ,
{ " ret " , ret } ,
2024-04-12 13:49:21 +02:00
} ) ;
2024-03-11 10:56:41 +01:00
continue ; // continue loop of n_batch
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
for ( auto & slot : slots ) {
2024-09-06 23:21:29 +02:00
if ( slot . i_batch < ( int ) i | | slot . i_batch > = ( int ) ( i + n_tokens ) ) {
2024-03-11 10:56:41 +01:00
continue ; // continue loop of slots
2023-10-22 21:53:08 +02:00
}
2024-09-06 23:21:29 +02:00
if ( slot . state = = SLOT_STATE_DONE_PROMPT ) {
if ( slot . cmpl_type = = SERVER_TASK_CMPL_TYPE_EMBEDDING ) {
// prompt evaluated for embedding
send_embedding ( slot , batch_view ) ;
slot . release ( ) ;
slot . i_batch = - 1 ;
continue ; // continue loop of slots
}
2024-09-07 14:16:19 +02:00
// prompt evaluated for next-token prediction
slot . state = SLOT_STATE_GENERATING ;
2024-09-06 23:21:29 +02:00
} else if ( slot . state ! = SLOT_STATE_GENERATING ) {
2024-03-11 10:56:41 +01:00
continue ; // continue loop of slots
2023-10-22 21:53:08 +02:00
}
completion_token_output result ;
2024-09-07 14:16:19 +02:00
const llama_token id = gpt_sampler_sample ( slot . smpl , ctx , slot . i_batch - i ) ;
2023-10-22 21:53:08 +02:00
2024-09-07 14:16:19 +02:00
gpt_sampler_accept ( slot . smpl , id , true ) ;
2023-10-22 21:53:08 +02:00
2024-01-07 07:45:26 +01:00
slot . n_decoded + = 1 ;
2024-03-07 10:41:53 +01:00
if ( slot . n_decoded = = 1 ) {
slot . t_start_generation = ggml_time_us ( ) ;
slot . t_prompt_processing = ( slot . t_start_generation - slot . t_start_process_prompt ) / 1e3 ;
2024-02-25 13:49:43 +01:00
metrics . on_prompt_eval ( slot ) ;
2023-10-22 21:53:08 +02:00
}
result . tok = id ;
2024-09-07 14:16:19 +02:00
const auto * cur_p = gpt_sampler_get_candidates ( slot . smpl ) ;
2023-10-22 21:53:08 +02:00
2024-09-07 14:16:19 +02:00
for ( size_t i = 0 ; i < ( size_t ) slot . sparams . n_probs ; + + i ) {
result . probs . push_back ( {
cur_p - > data [ i ] . id ,
i > = cur_p - > size ? 0.0f : cur_p - > data [ i ] . p ,
} ) ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
if ( ! process_token ( result , slot ) ) {
2024-09-06 23:21:29 +02:00
// release slot because of stop condition
2023-10-22 21:53:08 +02:00
slot . release ( ) ;
slot . print_timings ( ) ;
2023-10-24 22:08:20 +02:00
send_final_response ( slot ) ;
2024-02-25 13:49:43 +01:00
metrics . on_prediction ( slot ) ;
2023-10-22 21:53:08 +02:00
}
slot . i_batch = - 1 ;
}
2023-06-20 00:12:39 +02:00
}
2024-02-25 13:50:32 +01:00
2024-03-11 10:56:41 +01:00
LOG_VERBOSE ( " run slots completed " , { } ) ;
2023-06-20 00:12:39 +02:00
}
2024-03-02 22:00:14 +01:00
2024-03-07 10:41:53 +01:00
json model_meta ( ) const {
return json {
{ " vocab_type " , llama_vocab_type ( model ) } ,
{ " n_vocab " , llama_n_vocab ( model ) } ,
{ " n_ctx_train " , llama_n_ctx_train ( model ) } ,
{ " n_embd " , llama_n_embd ( model ) } ,
{ " n_params " , llama_model_n_params ( model ) } ,
{ " size " , llama_model_size ( model ) } ,
2024-03-02 22:00:14 +01:00
} ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
} ;
2024-03-07 10:41:53 +01:00
static void log_server_request ( const httplib : : Request & req , const httplib : : Response & res ) {
2024-02-25 13:50:32 +01:00
// skip GH copilot requests when using default port
2024-03-07 10:41:53 +01:00
if ( req . path = = " /v1/health " | | req . path = = " /v1/completions " ) {
2024-02-25 13:50:32 +01:00
return ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
LOG_INFO ( " request " , {
2024-02-25 13:50:32 +01:00
{ " remote_addr " , req . remote_addr } ,
{ " remote_port " , req . remote_port } ,
{ " status " , res . status } ,
{ " method " , req . method } ,
{ " path " , req . path } ,
{ " params " , req . params } ,
} ) ;
2023-07-04 16:05:27 +02:00
LOG_VERBOSE ( " request " , {
2024-02-25 13:50:32 +01:00
{ " request " , req . body } ,
{ " response " , res . body } ,
} ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-05-21 19:51:18 +02:00
2024-02-18 17:23:16 +01:00
std : : function < void ( int ) > shutdown_handler ;
2024-02-28 09:55:37 +01:00
std : : atomic_flag is_terminating = ATOMIC_FLAG_INIT ;
2024-03-07 10:41:53 +01:00
2024-02-28 09:55:37 +01:00
inline void signal_handler ( int signal ) {
if ( is_terminating . test_and_set ( ) ) {
// in case it hangs, we can force terminate the server by hitting Ctrl+C twice
// this is for better developer experience, we can remove when the server is stable enough
fprintf ( stderr , " Received second interrupt, terminating immediately. \n " ) ;
exit ( 1 ) ;
}
2024-03-07 10:41:53 +01:00
2024-02-28 09:55:37 +01:00
shutdown_handler ( signal ) ;
}
2024-02-18 17:23:16 +01:00
2024-03-07 10:41:53 +01:00
int main ( int argc , char * * argv ) {
2023-12-17 16:02:16 +01:00
# if SERVER_VERBOSE != 1
log_disable ( ) ;
# endif
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
// own arguments required by this example
2024-06-04 20:23:39 +02:00
gpt_params params ;
if ( ! gpt_params_parse ( argc , argv , params ) ) {
gpt_params_print_usage ( argc , argv , params ) ;
return 1 ;
}
2024-08-21 11:04:34 +02:00
// parse arguments from environment variables
gpt_params_parse_from_env ( params ) ;
2024-06-04 20:23:39 +02:00
// TODO: not great to use extern vars
server_log_json = params . log_json ;
2024-06-06 15:30:58 +02:00
server_verbose = params . verbosity > 0 ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
// struct that contains llama context and inference
2024-03-07 10:41:53 +01:00
server_context ctx_server ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-06-04 20:23:39 +02:00
if ( ! params . system_prompt . empty ( ) ) {
ctx_server . system_prompt_set ( params . system_prompt ) ;
2024-03-07 10:41:53 +01:00
}
if ( params . model_alias = = " unknown " ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . model_alias = params . model ;
}
2024-02-16 10:31:07 +01:00
llama_backend_init ( ) ;
llama_numa_init ( params . numa ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
LOG_INFO ( " build info " , {
{ " build " , LLAMA_BUILD_NUMBER } ,
{ " commit " , LLAMA_COMMIT }
} ) ;
2023-10-22 21:53:08 +02:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
LOG_INFO ( " system info " , {
Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
{ " n_threads " , params . cpuparams . n_threads } ,
{ " n_threads_batch " , params . cpuparams_batch . n_threads } ,
2024-03-07 10:41:53 +01:00
{ " total_threads " , std : : thread : : hardware_concurrency ( ) } ,
{ " system_info " , llama_print_system_info ( ) } ,
} ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-09 10:57:09 +01:00
std : : unique_ptr < httplib : : Server > svr ;
# ifdef CPPHTTPLIB_OPENSSL_SUPPORT
2024-06-04 20:23:39 +02:00
if ( params . ssl_file_key ! = " " & & params . ssl_file_cert ! = " " ) {
LOG_INFO ( " Running with SSL " , { { " key " , params . ssl_file_key } , { " cert " , params . ssl_file_cert } } ) ;
2024-03-09 10:57:09 +01:00
svr . reset (
2024-06-04 20:23:39 +02:00
new httplib : : SSLServer ( params . ssl_file_cert . c_str ( ) , params . ssl_file_key . c_str ( ) )
2024-03-09 10:57:09 +01:00
) ;
} else {
LOG_INFO ( " Running without SSL " , { } ) ;
svr . reset ( new httplib : : Server ( ) ) ;
}
# else
svr . reset ( new httplib : : Server ( ) ) ;
# endif
2024-01-10 20:56:05 +01:00
2024-01-11 08:10:34 +01:00
std : : atomic < server_state > state { SERVER_STATE_LOADING_MODEL } ;
2024-01-10 20:56:05 +01:00
2024-03-09 10:57:09 +01:00
svr - > set_default_headers ( { { " Server " , " llama.cpp " } } ) ;
2024-01-11 19:02:48 +01:00
// CORS preflight
2024-08-16 17:19:05 +02:00
svr - > Options ( R " (.*) " , [ ] ( const httplib : : Request & , httplib : : Response & res ) {
// Access-Control-Allow-Origin is already set by middleware
2024-01-11 19:02:48 +01:00
res . set_header ( " Access-Control-Allow-Credentials " , " true " ) ;
2024-03-07 10:41:53 +01:00
res . set_header ( " Access-Control-Allow-Methods " , " POST " ) ;
res . set_header ( " Access-Control-Allow-Headers " , " * " ) ;
2024-08-16 17:19:05 +02:00
return res . set_content ( " " , " text/html " ) ; // blank response, no data
2024-01-11 19:02:48 +01:00
} ) ;
2024-01-10 20:56:05 +01:00
2024-03-09 10:57:09 +01:00
svr - > set_logger ( log_server_request ) ;
2024-01-10 20:56:05 +01:00
2024-03-11 10:56:41 +01:00
auto res_error = [ ] ( httplib : : Response & res , json error_data ) {
json final_response { { " error " , error_data } } ;
2024-08-27 12:28:06 +02:00
res . set_content ( final_response . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) , MIMETYPE_JSON ) ;
2024-03-11 10:56:41 +01:00
res . status = json_value ( error_data , " code " , 500 ) ;
} ;
2024-01-10 20:56:05 +01:00
2024-09-02 17:11:51 +02:00
auto res_ok = [ ] ( httplib : : Response & res , json data ) {
res . set_content ( data . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) , MIMETYPE_JSON ) ;
res . status = 200 ;
} ;
2024-03-11 10:56:41 +01:00
svr - > set_exception_handler ( [ & res_error ] ( const httplib : : Request & , httplib : : Response & res , std : : exception_ptr ep ) {
std : : string message ;
2024-03-07 10:41:53 +01:00
try {
std : : rethrow_exception ( std : : move ( ep ) ) ;
2024-03-11 10:56:41 +01:00
} catch ( std : : exception & e ) {
message = e . what ( ) ;
2024-03-07 10:41:53 +01:00
} catch ( . . . ) {
2024-03-11 10:56:41 +01:00
message = " Unknown Exception " ;
2024-03-07 10:41:53 +01:00
}
2024-03-11 10:56:41 +01:00
json formatted_error = format_error_response ( message , ERROR_TYPE_SERVER ) ;
LOG_VERBOSE ( " Got exception " , formatted_error ) ;
res_error ( res , formatted_error ) ;
2024-03-07 10:41:53 +01:00
} ) ;
2024-03-11 10:56:41 +01:00
svr - > set_error_handler ( [ & res_error ] ( const httplib : : Request & , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
if ( res . status = = 404 ) {
2024-03-11 10:56:41 +01:00
res_error ( res , format_error_response ( " File Not Found " , ERROR_TYPE_NOT_FOUND ) ) ;
2024-03-07 10:41:53 +01:00
}
2024-03-11 10:56:41 +01:00
// for other error codes, we skip processing here because it's already done by res_error()
2024-03-07 10:41:53 +01:00
} ) ;
2024-01-10 20:56:05 +01:00
// set timeouts and change hostname and port
2024-06-04 20:23:39 +02:00
svr - > set_read_timeout ( params . timeout_read ) ;
svr - > set_write_timeout ( params . timeout_write ) ;
2024-01-10 20:56:05 +01:00
std : : unordered_map < std : : string , std : : string > log_data ;
2024-03-07 10:41:53 +01:00
2024-06-04 20:23:39 +02:00
log_data [ " hostname " ] = params . hostname ;
log_data [ " port " ] = std : : to_string ( params . port ) ;
2024-01-10 20:56:05 +01:00
2024-06-04 20:23:39 +02:00
if ( params . api_keys . size ( ) = = 1 ) {
auto key = params . api_keys [ 0 ] ;
2024-03-09 11:27:53 +01:00
log_data [ " api_key " ] = " api_key: **** " + key . substr ( std : : max ( ( int ) ( key . length ( ) - 4 ) , 0 ) ) ;
2024-06-04 20:23:39 +02:00
} else if ( params . api_keys . size ( ) > 1 ) {
log_data [ " api_key " ] = " api_key: " + std : : to_string ( params . api_keys . size ( ) ) + " keys loaded " ;
2024-01-10 20:56:05 +01:00
}
2024-06-08 09:50:31 +02:00
// Necessary similarity of prompt for slot selection
ctx_server . slot_prompt_similarity = params . slot_prompt_similarity ;
2024-03-09 11:27:53 +01:00
//
// Middlewares
//
2024-06-04 20:23:39 +02:00
auto middleware_validate_api_key = [ & params , & res_error ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-09 11:27:53 +01:00
// TODO: should we apply API key to all endpoints, including "/health" and "/models"?
2024-09-02 17:11:51 +02:00
static const std : : unordered_set < std : : string > protected_endpoints = {
2024-03-09 11:27:53 +01:00
" /props " ,
" /completion " ,
" /completions " ,
" /v1/completions " ,
" /chat/completions " ,
" /v1/chat/completions " ,
" /infill " ,
" /tokenize " ,
" /detokenize " ,
" /embedding " ,
" /embeddings " ,
" /v1/embeddings " ,
} ;
2023-12-15 12:49:01 +01:00
// If API key is not set, skip validation
2024-06-04 20:23:39 +02:00
if ( params . api_keys . empty ( ) ) {
2023-12-15 12:49:01 +01:00
return true ;
}
2024-03-09 11:27:53 +01:00
// If path is not in protected_endpoints list, skip validation
if ( protected_endpoints . find ( req . path ) = = protected_endpoints . end ( ) ) {
return true ;
}
2023-12-15 12:49:01 +01:00
// Check for API key in the header
auto auth_header = req . get_header_value ( " Authorization " ) ;
2024-03-07 10:41:53 +01:00
2023-12-15 12:49:01 +01:00
std : : string prefix = " Bearer " ;
if ( auth_header . substr ( 0 , prefix . size ( ) ) = = prefix ) {
std : : string received_api_key = auth_header . substr ( prefix . size ( ) ) ;
2024-06-04 20:23:39 +02:00
if ( std : : find ( params . api_keys . begin ( ) , params . api_keys . end ( ) , received_api_key ) ! = params . api_keys . end ( ) ) {
2023-12-15 12:49:01 +01:00
return true ; // API key is valid
}
}
// API key is invalid or not provided
2024-03-11 10:56:41 +01:00
res_error ( res , format_error_response ( " Invalid API Key " , ERROR_TYPE_AUTHENTICATION ) ) ;
2023-12-15 12:49:01 +01:00
LOG_WARNING ( " Unauthorized: Invalid API Key " , { } ) ;
return false ;
} ;
2024-08-16 17:19:05 +02:00
auto middleware_server_state = [ & res_error , & state ] ( const httplib : : Request & , httplib : : Response & res ) {
server_state current_state = state . load ( ) ;
if ( current_state = = SERVER_STATE_LOADING_MODEL ) {
res_error ( res , format_error_response ( " Loading model " , ERROR_TYPE_UNAVAILABLE ) ) ;
return false ;
}
return true ;
} ;
2024-03-09 11:27:53 +01:00
// register server middlewares
2024-08-16 17:19:05 +02:00
svr - > set_pre_routing_handler ( [ & middleware_validate_api_key , & middleware_server_state ] ( const httplib : : Request & req , httplib : : Response & res ) {
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
if ( ! middleware_server_state ( req , res ) ) {
return httplib : : Server : : HandlerResponse : : Handled ;
}
2024-03-09 11:27:53 +01:00
if ( ! middleware_validate_api_key ( req , res ) ) {
return httplib : : Server : : HandlerResponse : : Handled ;
}
return httplib : : Server : : HandlerResponse : : Unhandled ;
2024-03-07 10:41:53 +01:00
} ) ;
2023-07-05 22:51:13 +02:00
2024-03-09 11:27:53 +01:00
//
// Route handlers (or controllers)
//
2023-07-04 16:05:27 +02:00
2024-08-16 17:19:05 +02:00
const auto handle_health = [ & ] ( const httplib : : Request & , httplib : : Response & res ) {
// error and loading states are handled by middleware
json health = { { " status " , " ok " } } ;
2024-09-02 17:11:51 +02:00
res_ok ( res , health ) ;
2024-03-09 11:27:53 +01:00
} ;
2024-08-16 17:19:05 +02:00
const auto handle_slots = [ & ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-06-04 20:23:39 +02:00
if ( ! params . endpoint_slots ) {
2024-08-16 17:19:05 +02:00
res_error ( res , format_error_response ( " This server does not support slots endpoint. Start it without `--no-slots` " , ERROR_TYPE_NOT_SUPPORTED ) ) ;
2024-03-09 11:27:53 +01:00
return ;
}
// request slots data using task queue
server_task task ;
task . id = ctx_server . queue_tasks . get_new_id ( ) ;
task . type = SERVER_TASK_TYPE_METRICS ;
ctx_server . queue_results . add_waiting_task_id ( task . id ) ;
2024-09-06 23:21:29 +02:00
ctx_server . queue_tasks . post ( task , true ) ; // high-priority task
2024-03-09 11:27:53 +01:00
// get the result
server_task_result result = ctx_server . queue_results . recv ( task . id ) ;
ctx_server . queue_results . remove_waiting_task_id ( task . id ) ;
2024-08-16 17:19:05 +02:00
// optionally return "fail_on_no_slot" error
const int n_idle_slots = result . data . at ( " idle " ) ;
if ( req . has_param ( " fail_on_no_slot " ) ) {
if ( n_idle_slots = = 0 ) {
res_error ( res , format_error_response ( " no slot available " , ERROR_TYPE_UNAVAILABLE ) ) ;
return ;
}
}
2024-09-02 17:11:51 +02:00
res_ok ( res , result . data . at ( " slots " ) ) ;
2024-03-09 11:27:53 +01:00
} ;
const auto handle_metrics = [ & ] ( const httplib : : Request & , httplib : : Response & res ) {
2024-06-04 20:23:39 +02:00
if ( ! params . endpoint_metrics ) {
2024-08-16 17:19:05 +02:00
res_error ( res , format_error_response ( " This server does not support metrics endpoint. Start it with `--metrics` " , ERROR_TYPE_NOT_SUPPORTED ) ) ;
2024-03-09 11:27:53 +01:00
return ;
}
// request slots data using task queue
server_task task ;
task . id = ctx_server . queue_tasks . get_new_id ( ) ;
task . id_target = - 1 ;
task . type = SERVER_TASK_TYPE_METRICS ;
task . data . push_back ( { { " reset_bucket " , true } } ) ;
ctx_server . queue_results . add_waiting_task_id ( task . id ) ;
2024-09-06 23:21:29 +02:00
ctx_server . queue_tasks . post ( task , true ) ; // high-priority task
2024-03-09 11:27:53 +01:00
// get the result
server_task_result result = ctx_server . queue_results . recv ( task . id ) ;
ctx_server . queue_results . remove_waiting_task_id ( task . id ) ;
json data = result . data ;
2024-05-08 21:53:08 +02:00
const uint64_t n_prompt_tokens_processed = data . at ( " n_prompt_tokens_processed " ) ;
const uint64_t t_prompt_processing = data . at ( " t_prompt_processing " ) ;
2024-03-09 11:27:53 +01:00
2024-05-08 21:53:08 +02:00
const uint64_t n_tokens_predicted = data . at ( " n_tokens_predicted " ) ;
const uint64_t t_tokens_generation = data . at ( " t_tokens_generation " ) ;
2024-03-09 11:27:53 +01:00
2024-09-06 23:21:29 +02:00
const uint64_t n_decode_total = data . at ( " n_decode_total " ) ;
const uint64_t n_busy_slots_total = data . at ( " n_busy_slots_total " ) ;
2024-05-08 21:53:08 +02:00
const int32_t kv_cache_used_cells = data . at ( " kv_cache_used_cells " ) ;
2024-03-09 11:27:53 +01:00
// metrics definition: https://prometheus.io/docs/practices/naming/#metric-names
json all_metrics_def = json {
{ " counter " , { {
{ " name " , " prompt_tokens_total " } ,
{ " help " , " Number of prompt tokens processed. " } ,
2024-05-08 21:53:08 +02:00
{ " value " , ( uint64_t ) data . at ( " n_prompt_tokens_processed_total " ) }
2024-03-09 11:27:53 +01:00
} , {
{ " name " , " prompt_seconds_total " } ,
{ " help " , " Prompt process time " } ,
2024-05-08 21:53:08 +02:00
{ " value " , ( uint64_t ) data . at ( " t_prompt_processing_total " ) / 1.e3 }
2024-03-09 11:27:53 +01:00
} , {
{ " name " , " tokens_predicted_total " } ,
{ " help " , " Number of generation tokens processed. " } ,
2024-05-08 21:53:08 +02:00
{ " value " , ( uint64_t ) data . at ( " n_tokens_predicted_total " ) }
2024-03-09 11:27:53 +01:00
} , {
{ " name " , " tokens_predicted_seconds_total " } ,
{ " help " , " Predict process time " } ,
2024-05-08 21:53:08 +02:00
{ " value " , ( uint64_t ) data . at ( " t_tokens_generation_total " ) / 1.e3 }
2024-09-06 23:21:29 +02:00
} , {
{ " name " , " n_decode_total " } ,
{ " help " , " Total number of llama_decode() calls " } ,
{ " value " , n_decode_total }
} , {
{ " name " , " n_busy_slots_per_decode " } ,
{ " help " , " Average number of busy slots per llama_decode() call " } ,
{ " value " , ( float ) n_busy_slots_total / ( float ) n_decode_total }
2024-03-09 11:27:53 +01:00
} } } ,
{ " gauge " , { {
{ " name " , " prompt_tokens_seconds " } ,
{ " help " , " Average prompt throughput in tokens/s. " } ,
{ " value " , n_prompt_tokens_processed ? 1.e3 / t_prompt_processing * n_prompt_tokens_processed : 0. }
} , {
{ " name " , " predicted_tokens_seconds " } ,
{ " help " , " Average generation throughput in tokens/s. " } ,
{ " value " , n_tokens_predicted ? 1.e3 / t_tokens_generation * n_tokens_predicted : 0. }
} , {
{ " name " , " kv_cache_usage_ratio " } ,
{ " help " , " KV-cache usage. 1 means 100 percent usage. " } ,
{ " value " , 1. * kv_cache_used_cells / params . n_ctx }
} , {
{ " name " , " kv_cache_tokens " } ,
{ " help " , " KV-cache tokens. " } ,
2024-05-08 21:53:08 +02:00
{ " value " , ( uint64_t ) data . at ( " kv_cache_tokens_count " ) }
2024-03-09 11:27:53 +01:00
} , {
{ " name " , " requests_processing " } ,
{ " help " , " Number of request processing. " } ,
2024-05-08 21:53:08 +02:00
{ " value " , ( uint64_t ) data . at ( " processing " ) }
2024-03-09 11:27:53 +01:00
} , {
{ " name " , " requests_deferred " } ,
{ " help " , " Number of request deferred. " } ,
2024-05-08 21:53:08 +02:00
{ " value " , ( uint64_t ) data . at ( " deferred " ) }
2024-03-09 11:27:53 +01:00
} } }
} ;
std : : stringstream prometheus ;
for ( const auto & el : all_metrics_def . items ( ) ) {
const auto & type = el . key ( ) ;
const auto & metrics_def = el . value ( ) ;
for ( const auto & metric_def : metrics_def ) {
2024-05-08 21:53:08 +02:00
const std : : string name = metric_def . at ( " name " ) ;
const std : : string help = metric_def . at ( " help " ) ;
2024-03-09 11:27:53 +01:00
auto value = json_value ( metric_def , " value " , 0. ) ;
prometheus < < " # HELP llamacpp: " < < name < < " " < < help < < " \n "
< < " # TYPE llamacpp: " < < name < < " " < < type < < " \n "
< < " llamacpp: " < < name < < " " < < value < < " \n " ;
}
}
2024-05-08 21:53:08 +02:00
const int64_t t_start = data . at ( " t_start " ) ;
2024-03-09 11:27:53 +01:00
res . set_header ( " Process-Start-Time-Unix " , std : : to_string ( t_start ) ) ;
res . set_content ( prometheus . str ( ) , " text/plain; version=0.0.4 " ) ;
res . status = 200 ; // HTTP OK
} ;
2024-09-02 17:11:51 +02:00
const auto handle_slots_save = [ & ctx_server , & res_error , & res_ok , & params ] ( const httplib : : Request & req , httplib : : Response & res , int id_slot ) {
2024-04-08 14:43:30 +02:00
json request_data = json : : parse ( req . body ) ;
2024-05-08 21:53:08 +02:00
std : : string filename = request_data . at ( " filename " ) ;
2024-05-22 19:04:20 +02:00
if ( ! fs_validate_filename ( filename ) ) {
2024-04-08 14:43:30 +02:00
res_error ( res , format_error_response ( " Invalid filename " , ERROR_TYPE_INVALID_REQUEST ) ) ;
return ;
}
2024-06-04 20:23:39 +02:00
std : : string filepath = params . slot_save_path + filename ;
2024-04-08 14:43:30 +02:00
server_task task ;
task . type = SERVER_TASK_TYPE_SLOT_SAVE ;
task . data = {
{ " id_slot " , id_slot } ,
{ " filename " , filename } ,
2024-09-06 23:21:29 +02:00
{ " filepath " , filepath } ,
2024-04-08 14:43:30 +02:00
} ;
const int id_task = ctx_server . queue_tasks . post ( task ) ;
ctx_server . queue_results . add_waiting_task_id ( id_task ) ;
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
if ( result . error ) {
res_error ( res , result . data ) ;
} else {
2024-09-02 17:11:51 +02:00
res_ok ( res , result . data ) ;
2024-04-08 14:43:30 +02:00
}
} ;
2024-09-02 17:11:51 +02:00
const auto handle_slots_restore = [ & ctx_server , & res_error , & res_ok , & params ] ( const httplib : : Request & req , httplib : : Response & res , int id_slot ) {
2024-04-08 14:43:30 +02:00
json request_data = json : : parse ( req . body ) ;
2024-05-08 21:53:08 +02:00
std : : string filename = request_data . at ( " filename " ) ;
2024-05-22 19:04:20 +02:00
if ( ! fs_validate_filename ( filename ) ) {
2024-04-08 14:43:30 +02:00
res_error ( res , format_error_response ( " Invalid filename " , ERROR_TYPE_INVALID_REQUEST ) ) ;
return ;
}
2024-06-04 20:23:39 +02:00
std : : string filepath = params . slot_save_path + filename ;
2024-04-08 14:43:30 +02:00
server_task task ;
task . type = SERVER_TASK_TYPE_SLOT_RESTORE ;
task . data = {
{ " id_slot " , id_slot } ,
{ " filename " , filename } ,
2024-09-06 23:21:29 +02:00
{ " filepath " , filepath } ,
2024-04-08 14:43:30 +02:00
} ;
const int id_task = ctx_server . queue_tasks . post ( task ) ;
ctx_server . queue_results . add_waiting_task_id ( id_task ) ;
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
if ( result . error ) {
res_error ( res , result . data ) ;
} else {
2024-09-02 17:11:51 +02:00
res_ok ( res , result . data ) ;
2024-04-08 14:43:30 +02:00
}
} ;
2024-09-02 17:11:51 +02:00
const auto handle_slots_erase = [ & ctx_server , & res_error , & res_ok ] ( const httplib : : Request & /* req */ , httplib : : Response & res , int id_slot ) {
2024-04-08 14:43:30 +02:00
server_task task ;
task . type = SERVER_TASK_TYPE_SLOT_ERASE ;
task . data = {
{ " id_slot " , id_slot } ,
} ;
const int id_task = ctx_server . queue_tasks . post ( task ) ;
ctx_server . queue_results . add_waiting_task_id ( id_task ) ;
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
if ( result . error ) {
res_error ( res , result . data ) ;
} else {
2024-09-02 17:11:51 +02:00
res_ok ( res , result . data ) ;
2024-04-08 14:43:30 +02:00
}
} ;
2024-09-02 17:11:51 +02:00
const auto handle_slots_action = [ & params , & res_error , & handle_slots_save , & handle_slots_restore , & handle_slots_erase ] ( const httplib : : Request & req , httplib : : Response & res ) {
if ( params . slot_save_path . empty ( ) ) {
res_error ( res , format_error_response ( " This server does not support slots action. Start it with `--slot-save-path` " , ERROR_TYPE_NOT_SUPPORTED ) ) ;
return ;
}
2024-04-08 14:43:30 +02:00
std : : string id_slot_str = req . path_params . at ( " id_slot " ) ;
int id_slot ;
try {
id_slot = std : : stoi ( id_slot_str ) ;
} catch ( const std : : exception & ) {
res_error ( res , format_error_response ( " Invalid slot ID " , ERROR_TYPE_INVALID_REQUEST ) ) ;
return ;
}
std : : string action = req . get_param_value ( " action " ) ;
if ( action = = " save " ) {
handle_slots_save ( req , res , id_slot ) ;
} else if ( action = = " restore " ) {
handle_slots_restore ( req , res , id_slot ) ;
} else if ( action = = " erase " ) {
handle_slots_erase ( req , res , id_slot ) ;
} else {
res_error ( res , format_error_response ( " Invalid action " , ERROR_TYPE_INVALID_REQUEST ) ) ;
}
} ;
2024-09-02 17:11:51 +02:00
const auto handle_props = [ & ctx_server , & res_ok ] ( const httplib : : Request & , httplib : : Response & res ) {
2024-07-07 11:10:38 +02:00
std : : string template_key = " tokenizer.chat_template " , curr_tmpl ;
int32_t tlen = llama_model_meta_val_str ( ctx_server . model , template_key . c_str ( ) , nullptr , 0 ) ;
if ( tlen > 0 ) {
std : : vector < char > curr_tmpl_buf ( tlen + 1 , 0 ) ;
if ( llama_model_meta_val_str ( ctx_server . model , template_key . c_str ( ) , curr_tmpl_buf . data ( ) , curr_tmpl_buf . size ( ) ) = = tlen ) {
curr_tmpl = std : : string ( curr_tmpl_buf . data ( ) , tlen ) ;
}
}
2024-03-07 10:41:53 +01:00
json data = {
2024-05-11 17:28:10 +02:00
{ " system_prompt " , ctx_server . system_prompt . c_str ( ) } ,
2024-03-07 10:41:53 +01:00
{ " default_generation_settings " , ctx_server . default_generation_settings_for_props } ,
2024-07-07 11:10:38 +02:00
{ " total_slots " , ctx_server . params . n_parallel } ,
2024-09-06 23:21:29 +02:00
{ " chat_template " , curr_tmpl . c_str ( ) } ,
2024-03-07 10:41:53 +01:00
} ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-09-02 17:11:51 +02:00
res_ok ( res , data ) ;
2024-03-09 11:27:53 +01:00
} ;
2024-03-07 10:41:53 +01:00
2024-09-02 17:11:51 +02:00
const auto handle_completions_generic = [ & ctx_server , & res_error , & res_ok ] ( server_task_cmpl_type cmpl_type , json & data , httplib : : Response & res ) {
2024-07-12 10:14:12 +02:00
if ( ctx_server . params . embedding ) {
res_error ( res , format_error_response ( " This server does not support completions. Start it without `--embeddings` " , ERROR_TYPE_NOT_SUPPORTED ) ) ;
return ;
}
2024-09-02 17:11:51 +02:00
std : : vector < server_task > tasks = ctx_server . create_tasks_cmpl ( data , cmpl_type ) ;
ctx_server . queue_results . add_waiting_tasks ( tasks ) ;
ctx_server . queue_tasks . post ( tasks ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-09-02 17:11:51 +02:00
bool stream = json_value ( data , " stream " , false ) ;
const auto task_ids = server_task : : get_list_id ( tasks ) ;
2024-03-07 10:41:53 +01:00
2024-09-02 17:11:51 +02:00
if ( ! stream ) {
ctx_server . receive_cmpl_results ( task_ids , [ & ] ( std : : vector < server_task_result > & results ) {
if ( results . size ( ) = = 1 ) {
// single result
res_ok ( res , results [ 0 ] . data ) ;
} else {
// multiple results (multitask)
json arr = json : : array ( ) ;
for ( const auto & res : results ) {
arr . push_back ( res . data ) ;
2024-03-07 10:41:53 +01:00
}
2024-09-02 17:11:51 +02:00
res_ok ( res , arr ) ;
2023-10-22 21:53:08 +02:00
}
2024-09-02 17:11:51 +02:00
} , [ & ] ( json error_data ) {
res_error ( res , error_data ) ;
} ) ;
} else {
const auto chunked_content_provider = [ task_ids , & ctx_server ] ( size_t , httplib : : DataSink & sink ) {
ctx_server . receive_cmpl_results_stream ( task_ids , [ & ] ( server_task_result result ) - > bool {
return server_sent_event ( sink , " data " , result . data ) ;
} , [ & ] ( json error_data ) {
server_sent_event ( sink , " error " , error_data ) ;
} ) ;
2024-03-07 10:41:53 +01:00
sink . done ( ) ;
2024-09-02 17:11:51 +02:00
return false ;
2024-03-07 10:41:53 +01:00
} ;
2024-09-02 17:11:51 +02:00
res . set_chunked_content_provider ( " text/event-stream " , chunked_content_provider ) ;
2024-03-07 10:41:53 +01:00
}
2024-03-07 11:42:39 +01:00
} ;
2024-09-02 17:11:51 +02:00
const auto handle_completions = [ & handle_completions_generic ] ( const httplib : : Request & req , httplib : : Response & res ) {
json data = json : : parse ( req . body ) ;
return handle_completions_generic ( SERVER_TASK_CMPL_TYPE_NORMAL , data , res ) ;
} ;
2024-03-07 10:41:53 +01:00
2024-09-02 17:11:51 +02:00
const auto handle_infill = [ & handle_completions_generic ] ( const httplib : : Request & req , httplib : : Response & res ) {
json data = json : : parse ( req . body ) ;
return handle_completions_generic ( SERVER_TASK_CMPL_TYPE_INFILL , data , res ) ;
2024-03-09 11:27:53 +01:00
} ;
2024-03-07 10:41:53 +01:00
2024-09-02 17:11:51 +02:00
// TODO: maybe merge this function with "handle_completions_generic"
const auto handle_chat_completions = [ & ctx_server , & params , & res_error , & res_ok ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-07-12 10:14:12 +02:00
if ( ctx_server . params . embedding ) {
2024-09-02 17:11:51 +02:00
res_error ( res , format_error_response ( " This server does not support completions. Start it without `--embeddings` " , ERROR_TYPE_NOT_SUPPORTED ) ) ;
2024-07-12 10:14:12 +02:00
return ;
}
2024-03-07 10:41:53 +01:00
2024-09-02 17:11:51 +02:00
json data = oaicompat_completion_params_parse ( ctx_server . model , json : : parse ( req . body ) , params . chat_template ) ;
2024-03-07 10:41:53 +01:00
2024-09-02 17:11:51 +02:00
std : : vector < server_task > tasks = ctx_server . create_tasks_cmpl ( data , SERVER_TASK_CMPL_TYPE_NORMAL ) ;
ctx_server . queue_results . add_waiting_tasks ( tasks ) ;
ctx_server . queue_tasks . post ( tasks ) ;
2023-11-25 10:29:06 +01:00
2024-09-02 17:11:51 +02:00
bool stream = json_value ( data , " stream " , false ) ;
const auto task_ids = server_task : : get_list_id ( tasks ) ;
2024-03-11 09:09:32 +01:00
const auto completion_id = gen_chatcmplid ( ) ;
2023-11-25 10:29:06 +01:00
2024-09-02 17:11:51 +02:00
if ( ! stream ) {
ctx_server . receive_cmpl_results ( task_ids , [ & ] ( std : : vector < server_task_result > & results ) {
// multitask is never support in chat completion, there is only one result
json result_oai = format_final_response_oaicompat ( data , results [ 0 ] . data , completion_id ) ;
res_ok ( res , result_oai ) ;
} , [ & ] ( json error_data ) {
res_error ( res , error_data ) ;
} ) ;
2024-02-28 09:39:15 +01:00
} else {
2024-09-02 17:11:51 +02:00
const auto chunked_content_provider = [ task_ids , & ctx_server , completion_id ] ( size_t , httplib : : DataSink & sink ) {
ctx_server . receive_cmpl_results_stream ( task_ids , [ & ] ( server_task_result result ) - > bool {
std : : vector < json > result_array = format_partial_response_oaicompat ( result . data , completion_id ) ;
for ( auto & event_data : result_array ) {
if ( event_data . empty ( ) ) {
continue ; // skip the stop token
2023-11-25 10:29:06 +01:00
}
2024-09-02 17:11:51 +02:00
if ( ! server_sent_event ( sink , " data " , event_data ) ) {
return false ; // connection is closed
2024-02-28 09:39:15 +01:00
}
}
2024-09-02 17:11:51 +02:00
return true ; // ok
} , [ & ] ( json error_data ) {
server_sent_event ( sink , " error " , error_data ) ;
} ) ;
2024-02-28 09:39:15 +01:00
sink . done ( ) ;
return true ;
} ;
2024-09-02 17:11:51 +02:00
res . set_chunked_content_provider ( " text/event-stream " , chunked_content_provider ) ;
2024-02-28 09:39:15 +01:00
}
} ;
2024-09-02 17:11:51 +02:00
const auto handle_models = [ & params , & ctx_server ] ( const httplib : : Request & , httplib : : Response & res ) {
json models = {
{ " object " , " list " } ,
{ " data " , {
2024-09-06 23:21:29 +02:00
{
{ " id " , params . model_alias } ,
{ " object " , " model " } ,
{ " created " , std : : time ( 0 ) } ,
{ " owned_by " , " llamacpp " } ,
{ " meta " , ctx_server . model_meta ( ) }
} ,
2024-09-02 17:11:51 +02:00
} }
} ;
2024-01-26 13:42:20 +01:00
2024-09-02 17:11:51 +02:00
res . set_content ( models . dump ( ) , MIMETYPE_JSON ) ;
2024-03-09 11:27:53 +01:00
} ;
2024-01-29 14:48:10 +01:00
2024-09-02 17:11:51 +02:00
const auto handle_tokenize = [ & ctx_server , & res_ok ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
const json body = json : : parse ( req . body ) ;
2024-01-29 14:48:10 +01:00
2024-03-07 10:41:53 +01:00
std : : vector < llama_token > tokens ;
if ( body . count ( " content " ) ! = 0 ) {
2024-05-08 14:27:58 +02:00
const bool add_special = json_value ( body , " add_special " , false ) ;
2024-05-08 21:53:08 +02:00
tokens = ctx_server . tokenize ( body . at ( " content " ) , add_special ) ;
2024-03-07 10:41:53 +01:00
}
const json data = format_tokenizer_response ( tokens ) ;
2024-09-02 17:11:51 +02:00
res_ok ( res , data ) ;
2024-03-09 11:27:53 +01:00
} ;
2024-01-29 14:48:10 +01:00
2024-09-02 17:11:51 +02:00
const auto handle_detokenize = [ & ctx_server , & res_ok ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
const json body = json : : parse ( req . body ) ;
2024-01-29 14:48:10 +01:00
2024-03-07 10:41:53 +01:00
std : : string content ;
if ( body . count ( " tokens " ) ! = 0 ) {
2024-05-08 21:53:08 +02:00
const std : : vector < llama_token > tokens = body . at ( " tokens " ) ;
2024-03-07 10:41:53 +01:00
content = tokens_to_str ( ctx_server . ctx , tokens . cbegin ( ) , tokens . cend ( ) ) ;
}
2024-01-29 14:48:10 +01:00
2024-03-07 10:41:53 +01:00
const json data = format_detokenized_response ( content ) ;
2024-09-02 17:11:51 +02:00
res_ok ( res , data ) ;
2024-03-09 11:27:53 +01:00
} ;
2024-01-29 14:48:10 +01:00
2024-09-02 17:11:51 +02:00
const auto handle_embeddings = [ & ctx_server , & res_error , & res_ok ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
const json body = json : : parse ( req . body ) ;
2024-03-09 11:27:53 +01:00
bool is_openai = false ;
2024-03-07 10:41:53 +01:00
2024-03-13 11:39:11 +01:00
// an input prompt can be a string or a list of tokens (integer)
json prompt ;
2024-03-09 11:27:53 +01:00
if ( body . count ( " input " ) ! = 0 ) {
is_openai = true ;
2024-05-08 21:53:08 +02:00
prompt = body . at ( " input " ) ;
2024-03-09 11:27:53 +01:00
} else if ( body . count ( " content " ) ! = 0 ) {
2024-03-13 11:39:11 +01:00
// with "content", we only support single prompt
2024-05-08 21:53:08 +02:00
prompt = std : : vector < std : : string > { body . at ( " content " ) } ;
2024-03-07 10:41:53 +01:00
} else {
2024-03-11 10:56:41 +01:00
res_error ( res , format_error_response ( " \" input \" or \" content \" must be provided " , ERROR_TYPE_INVALID_REQUEST ) ) ;
return ;
2024-03-07 10:41:53 +01:00
}
2024-03-13 11:39:11 +01:00
// create and queue the task
2024-09-02 17:11:51 +02:00
json responses = json : : array ( ) ;
bool error = false ;
2024-03-13 11:39:11 +01:00
{
2024-09-02 17:11:51 +02:00
std : : vector < server_task > tasks = ctx_server . create_tasks_cmpl ( { { " prompt " , prompt } } , SERVER_TASK_CMPL_TYPE_EMBEDDING ) ;
ctx_server . queue_results . add_waiting_tasks ( tasks ) ;
ctx_server . queue_tasks . post ( tasks ) ;
2024-03-07 10:41:53 +01:00
2024-03-09 11:27:53 +01:00
// get the result
2024-09-02 17:11:51 +02:00
std : : unordered_set < int > task_ids = server_task : : get_list_id ( tasks ) ;
ctx_server . receive_cmpl_results ( task_ids , [ & ] ( std : : vector < server_task_result > & results ) {
for ( const auto & res : results ) {
responses . push_back ( res . data ) ;
2024-03-13 11:39:11 +01:00
}
2024-09-02 17:11:51 +02:00
} , [ & ] ( json error_data ) {
res_error ( res , error_data ) ;
error = true ;
} ) ;
}
if ( error ) {
return ;
2024-03-09 11:27:53 +01:00
}
// write JSON response
2024-03-13 11:39:11 +01:00
json root = is_openai
? format_embeddings_response_oaicompat ( body , responses )
: responses [ 0 ] ;
2024-09-02 17:11:51 +02:00
res_ok ( res , root ) ;
2024-03-09 11:27:53 +01:00
} ;
2024-03-07 10:41:53 +01:00
2024-08-16 17:19:05 +02:00
const auto handle_lora_adapters_list = [ & ] ( const httplib : : Request & , httplib : : Response & res ) {
2024-08-06 17:33:39 +02:00
json result = json : : array ( ) ;
for ( size_t i = 0 ; i < ctx_server . lora_adapters . size ( ) ; + + i ) {
auto & la = ctx_server . lora_adapters [ i ] ;
result . push_back ( {
{ " id " , i } ,
{ " path " , la . path } ,
{ " scale " , la . scale } ,
} ) ;
}
2024-09-02 17:11:51 +02:00
res_ok ( res , result ) ;
2024-08-06 17:33:39 +02:00
res . status = 200 ; // HTTP OK
} ;
const auto handle_lora_adapters_apply = [ & ] ( const httplib : : Request & req , httplib : : Response & res ) {
const std : : vector < json > body = json : : parse ( req . body ) ;
int max_idx = ctx_server . lora_adapters . size ( ) ;
// clear existing value
for ( auto & la : ctx_server . lora_adapters ) {
la . scale = 0.0f ;
}
// set value
for ( auto entry : body ) {
int id = entry . at ( " id " ) ;
float scale = entry . at ( " scale " ) ;
if ( 0 < = id & & id < max_idx ) {
ctx_server . lora_adapters [ id ] . scale = scale ;
} else {
throw std : : runtime_error ( " invalid adapter id " ) ;
}
}
server_task task ;
task . type = SERVER_TASK_TYPE_SET_LORA ;
const int id_task = ctx_server . queue_tasks . post ( task ) ;
ctx_server . queue_results . add_waiting_task_id ( id_task ) ;
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
2024-09-02 17:11:51 +02:00
res_ok ( res , result . data ) ;
2024-08-06 17:33:39 +02:00
res . status = 200 ; // HTTP OK
} ;
2024-03-13 11:39:11 +01:00
auto handle_static_file = [ ] ( unsigned char * content , size_t len , const char * mime_type ) {
return [ content , len , mime_type ] ( const httplib : : Request & , httplib : : Response & res ) {
res . set_content ( reinterpret_cast < const char * > ( content ) , len , mime_type ) ;
return false ;
} ;
} ;
2024-03-09 11:27:53 +01:00
//
// Router
//
2024-03-07 10:41:53 +01:00
2024-03-09 11:27:53 +01:00
// register static assets routes
2024-06-04 20:23:39 +02:00
if ( ! params . public_path . empty ( ) ) {
2024-03-09 11:27:53 +01:00
// Set the base directory for serving static files
2024-06-04 20:23:39 +02:00
svr - > set_base_dir ( params . public_path ) ;
2024-03-09 11:27:53 +01:00
}
2024-06-04 20:23:39 +02:00
2024-03-09 11:27:53 +01:00
// using embedded static files
2024-06-04 20:23:39 +02:00
svr - > Get ( " / " , handle_static_file ( index_html , index_html_len , " text/html; charset=utf-8 " ) ) ;
svr - > Get ( " /index.js " , handle_static_file ( index_js , index_js_len , " text/javascript; charset=utf-8 " ) ) ;
svr - > Get ( " /completion.js " , handle_static_file ( completion_js , completion_js_len , " text/javascript; charset=utf-8 " ) ) ;
svr - > Get ( " /json-schema-to-grammar.mjs " , handle_static_file ( json_schema_to_grammar_mjs , json_schema_to_grammar_mjs_len , " text/javascript; charset=utf-8 " ) ) ;
2024-06-01 21:31:48 +02:00
// add new-ui files
2024-06-04 20:23:39 +02:00
svr - > Get ( " /colorthemes.css " , handle_static_file ( colorthemes_css , colorthemes_css_len , " text/css; charset=utf-8 " ) ) ;
svr - > Get ( " /style.css " , handle_static_file ( style_css , style_css_len , " text/css; charset=utf-8 " ) ) ;
2024-06-01 21:31:48 +02:00
svr - > Get ( " /theme-beeninorder.css " , handle_static_file ( theme_beeninorder_css , theme_beeninorder_css_len , " text/css; charset=utf-8 " ) ) ;
2024-06-04 20:23:39 +02:00
svr - > Get ( " /theme-ketivah.css " , handle_static_file ( theme_ketivah_css , theme_ketivah_css_len , " text/css; charset=utf-8 " ) ) ;
svr - > Get ( " /theme-mangotango.css " , handle_static_file ( theme_mangotango_css , theme_mangotango_css_len , " text/css; charset=utf-8 " ) ) ;
svr - > Get ( " /theme-playground.css " , handle_static_file ( theme_playground_css , theme_playground_css_len , " text/css; charset=utf-8 " ) ) ;
svr - > Get ( " /theme-polarnight.css " , handle_static_file ( theme_polarnight_css , theme_polarnight_css_len , " text/css; charset=utf-8 " ) ) ;
svr - > Get ( " /theme-snowstorm.css " , handle_static_file ( theme_snowstorm_css , theme_snowstorm_css_len , " text/css; charset=utf-8 " ) ) ;
svr - > Get ( " /index-new.html " , handle_static_file ( index_new_html , index_new_html_len , " text/html; charset=utf-8 " ) ) ;
svr - > Get ( " /system-prompts.js " , handle_static_file ( system_prompts_js , system_prompts_js_len , " text/javascript; charset=utf-8 " ) ) ;
svr - > Get ( " /prompt-formats.js " , handle_static_file ( prompt_formats_js , prompt_formats_js_len , " text/javascript; charset=utf-8 " ) ) ;
2024-03-09 11:27:53 +01:00
// register API routes
svr - > Get ( " /health " , handle_health ) ;
svr - > Get ( " /metrics " , handle_metrics ) ;
svr - > Get ( " /props " , handle_props ) ;
svr - > Get ( " /v1/models " , handle_models ) ;
svr - > Post ( " /completion " , handle_completions ) ; // legacy
svr - > Post ( " /completions " , handle_completions ) ;
svr - > Post ( " /v1/completions " , handle_completions ) ;
svr - > Post ( " /chat/completions " , handle_chat_completions ) ;
svr - > Post ( " /v1/chat/completions " , handle_chat_completions ) ;
svr - > Post ( " /infill " , handle_infill ) ;
svr - > Post ( " /embedding " , handle_embeddings ) ; // legacy
svr - > Post ( " /embeddings " , handle_embeddings ) ;
svr - > Post ( " /v1/embeddings " , handle_embeddings ) ;
svr - > Post ( " /tokenize " , handle_tokenize ) ;
svr - > Post ( " /detokenize " , handle_detokenize ) ;
2024-08-06 17:33:39 +02:00
// LoRA adapters hotswap
svr - > Get ( " /lora-adapters " , handle_lora_adapters_list ) ;
svr - > Post ( " /lora-adapters " , handle_lora_adapters_apply ) ;
// Save & load slots
svr - > Get ( " /slots " , handle_slots ) ;
2024-09-02 17:11:51 +02:00
svr - > Post ( " /slots/:id_slot " , handle_slots_action ) ;
2023-05-21 19:51:18 +02:00
2024-03-09 11:27:53 +01:00
//
// Start the server
//
2024-06-04 20:23:39 +02:00
if ( params . n_threads_http < 1 ) {
2024-03-03 08:48:36 +01:00
// +2 threads for monitoring endpoints
2024-06-04 20:23:39 +02:00
params . n_threads_http = std : : max ( params . n_parallel + 2 , ( int32_t ) std : : thread : : hardware_concurrency ( ) - 1 ) ;
2024-03-01 10:08:08 +01:00
}
2024-06-04 20:23:39 +02:00
log_data [ " n_threads_http " ] = std : : to_string ( params . n_threads_http ) ;
svr - > new_task_queue = [ & params ] { return new httplib : : ThreadPool ( params . n_threads_http ) ; } ;
2024-03-01 10:08:08 +01:00
2024-08-16 17:19:05 +02:00
// clean up function, to be called before exit
auto clean_up = [ & svr ] ( ) {
svr - > stop ( ) ;
llama_backend_free ( ) ;
} ;
2024-03-07 10:41:53 +01:00
2024-08-16 17:19:05 +02:00
// bind HTTP listen port, run the HTTP server in a thread
if ( ! svr - > bind_to_port ( params . hostname , params . port ) ) {
LOG_ERROR ( " couldn't bind HTTP server socket " , {
{ " hostname " , params . hostname } ,
{ " port " , params . port } ,
} ) ;
clean_up ( ) ;
LOG_ERROR ( " exiting due to HTTP server error " , { } ) ;
return 1 ;
}
std : : thread t ( [ & ] ( ) { svr - > listen_after_bind ( ) ; } ) ;
svr - > wait_until_ready ( ) ;
LOG_INFO ( " HTTP server is listening " , log_data ) ;
// load the model
LOG_INFO ( " loading model " , log_data ) ;
if ( ! ctx_server . load_model ( params ) ) {
clean_up ( ) ;
t . join ( ) ;
LOG_ERROR ( " exiting due to model loading error " , { } ) ;
return 1 ;
} else {
ctx_server . init ( ) ;
state . store ( SERVER_STATE_READY ) ;
LOG_INFO ( " model loaded " , { } ) ;
// if a custom chat template is not supplied, we will use the one that comes with the model (if any)
if ( params . chat_template . empty ( ) ) {
if ( ! ctx_server . validate_model_chat_template ( ) ) {
LOG_WARNING ( " The chat template that comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses " , { } ) ;
params . chat_template = " chatml " ;
}
2024-03-07 10:41:53 +01:00
}
2024-02-24 12:28:55 +01:00
2024-08-16 17:19:05 +02:00
// print sample chat example to make it clear which template is used
{
LOG_INFO ( " chat template " , {
{ " chat_example " , llama_chat_format_example ( ctx_server . model , params . chat_template ) } ,
{ " built_in " , params . chat_template . empty ( ) } ,
} ) ;
}
2024-02-24 12:28:55 +01:00
2024-08-16 17:19:05 +02:00
ctx_server . queue_tasks . on_new_task ( std : : bind (
& server_context : : process_single_task , & ctx_server , std : : placeholders : : _1 ) ) ;
ctx_server . queue_tasks . on_update_slots ( std : : bind (
& server_context : : update_slots , & ctx_server ) ) ;
shutdown_handler = [ & ] ( int ) {
ctx_server . queue_tasks . terminate ( ) ;
} ;
ctx_server . queue_tasks . start_loop ( ) ;
}
2024-02-18 17:23:16 +01:00
# if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
struct sigaction sigint_action ;
sigint_action . sa_handler = signal_handler ;
sigemptyset ( & sigint_action . sa_mask ) ;
sigint_action . sa_flags = 0 ;
sigaction ( SIGINT , & sigint_action , NULL ) ;
2024-03-28 09:50:48 +01:00
sigaction ( SIGTERM , & sigint_action , NULL ) ;
2024-02-18 17:23:16 +01:00
# elif defined (_WIN32)
auto console_ctrl_handler = + [ ] ( DWORD ctrl_type ) - > BOOL {
return ( ctrl_type = = CTRL_C_EVENT ) ? ( signal_handler ( SIGINT ) , true ) : false ;
} ;
SetConsoleCtrlHandler ( reinterpret_cast < PHANDLER_ROUTINE > ( console_ctrl_handler ) , true ) ;
# endif
2024-03-07 10:41:53 +01:00
2024-08-16 17:19:05 +02:00
clean_up ( ) ;
2023-10-22 21:53:08 +02:00
t . join ( ) ;
2023-07-10 17:49:56 +02:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
return 0 ;
2023-05-21 19:51:18 +02:00
}