2024-03-07 10:41:53 +01:00
# include "utils.hpp"
2023-05-21 19:51:18 +02:00
# include "common.h"
json-schema-to-grammar improvements (+ added to server) (#5978)
* json: fix arrays (disallow `[,1]`)
* json: support tuple types (`[number, string]`)
* json: support additionalProperties (`{[k: string]: [string,number][]}`)
* json: support required / optional properties
* json: add support for pattern
* json: resolve $ref (and support https schema urls)
* json: fix $ref resolution
* join: support union types (mostly for nullable types I think)
* json: support allOf + nested anyOf
* json: support any (`{}` or `{type: object}`)
* json: fix merge
* json: temp fix for escapes
* json: spaces in output and unrestricted output spaces
* json: add typings
* json:fix typo
* Create ts-type-to-grammar.sh
* json: fix _format_literal (json.dumps already escapes quotes)
* json: merge lit sequences and handle negatives
{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}
* json: handle pattern repetitions
* Update json-schema-to-grammar.mjs
* Create regex-to-grammar.py
* json: extract repeated regexp patterns to subrule
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* json: handle schema from pydantic Optional fields
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update ts-type-to-grammar.sh
* Update ts-type-to-grammar.sh
* json: simplify nullable fields handling
* json: accept duplicate identical rules
* json: revert space to 1 at most
* json: reuse regexp pattern subrules
* json: handle uuid string format
* json: fix literal escapes
* json: add --allow-fetch
* json: simplify range escapes
* json: support negative ranges in patterns
* Delete commit.txt
* json: custom regex parser, adds dot support & JS-portable
* json: rm trailing spaces
* Update json-schema-to-grammar.mjs
* json: updated server & chat `( cd examples/server && ./deps.sh )`
* json: port fixes from mjs to python
* Update ts-type-to-grammar.sh
* json: support prefixItems alongside array items
* json: add date format + fix uuid
* json: add date, time, date-time formats
* json: preserve order of props from TS defs
* json: port schema converter to C++, wire in ./server
* json: nits
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* json: fix mjs implementation + align outputs
* Update json-schema-to-grammar.mjs.hpp
* json: test C++, JS & Python versions
* json: nits + regen deps
* json: cleanup test
* json: revert from c++17 to 11
* json: nit fixes
* json: dirty include for test
* json: fix zig build
* json: pass static command to std::system in tests (fixed temp files)
* json: fix top-level $refs
* json: don't use c++20 designated initializers
* nit
* json: basic support for reserved names `{number:{number:{root:number}}}`
* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)
* json: re-ran server deps.sh
* json: simplify test
* json: support mix of additional props & required/optional
* json: add tests for some expected failures
* json: fix type=const in c++, add failure expectations for non-str const&enum
* json: test (& simplify output of) empty schema
* json: check parsing in test + fix value & string refs
* json: add server tests for OAI JSON response_format
* json: test/fix top-level anyOf
* json: improve grammar parsing failures
* json: test/fix additional props corner cases
* json: fix string patterns (was missing quotes)
* json: ws nit
* json: fix json handling in server when there's no response_format
* json: catch schema conversion errors in server
* json: don't complain about unknown format type in server if unset
* json: cleaner build of test
* json: create examples/json-schema-pydantic-example.py
* json: fix date pattern
* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common
* json: indent 4 spaces
* json: fix naming of top-level c++ function (+ drop unused one)
* json: avoid using namespace std
* json: fix zig build
* Update server.feature
* json: iostream -> fprintf
* json: space before & refs for consistency
* json: nits
2024-03-21 12:50:43 +01:00
# include "json-schema-to-grammar.h"
2023-05-21 19:51:18 +02:00
# include "llama.h"
2023-10-22 21:53:08 +02:00
# include "grammar-parser.h"
2023-05-21 19:51:18 +02:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
# ifndef NDEBUG
// crash the server in debug mode, otherwise send an http 500 error
# define CPPHTTPLIB_NO_EXCEPTIONS 1
# endif
2023-12-17 15:54:37 +01:00
// increase max payload length to allow use of larger context size
# define CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 1048576
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
# include "httplib.h"
# include "json.hpp"
2023-07-04 16:05:27 +02:00
// auto generated files (update with ./deps.sh)
# include "index.html.hpp"
# include "index.js.hpp"
# include "completion.js.hpp"
2023-08-15 00:14:14 +02:00
# include "json-schema-to-grammar.mjs.hpp"
2023-07-04 16:05:27 +02:00
2024-03-07 10:41:53 +01:00
# include <atomic>
2023-10-22 21:53:08 +02:00
# include <chrono>
2023-12-29 15:24:12 +01:00
# include <condition_variable>
2024-03-07 10:41:53 +01:00
# include <cstddef>
# include <set>
# include <mutex>
# include <thread>
2024-02-18 17:23:16 +01:00
# include <signal.h>
2024-03-09 10:57:09 +01:00
# include <memory>
2023-09-01 15:34:50 +02:00
2024-03-22 14:07:44 +01:00
using json = nlohmann : : ordered_json ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-01-26 13:42:20 +01:00
bool server_verbose = false ;
2024-02-25 13:50:32 +01:00
bool server_log_json = true ;
2023-07-02 23:38:44 +02:00
2024-02-29 21:42:11 +01:00
enum stop_type {
2024-03-07 10:41:53 +01:00
STOP_TYPE_FULL ,
STOP_TYPE_PARTIAL ,
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
} ;
2024-02-29 21:42:11 +01:00
enum slot_state {
2024-03-07 10:41:53 +01:00
SLOT_STATE_IDLE ,
SLOT_STATE_PROCESSING ,
2024-02-29 21:42:11 +01:00
} ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-29 21:42:11 +01:00
enum slot_command {
2024-03-07 10:41:53 +01:00
SLOT_COMMAND_NONE ,
SLOT_COMMAND_LOAD_PROMPT ,
SLOT_COMMAND_RELEASE ,
} ;
enum server_state {
SERVER_STATE_LOADING_MODEL , // Server is starting up, model not fully loaded yet
SERVER_STATE_READY , // Server is ready and model is loaded
SERVER_STATE_ERROR // An error occurred, load_model failed
} ;
enum server_task_type {
SERVER_TASK_TYPE_COMPLETION ,
SERVER_TASK_TYPE_CANCEL ,
SERVER_TASK_TYPE_NEXT_RESPONSE ,
2024-04-08 14:43:30 +02:00
SERVER_TASK_TYPE_METRICS ,
SERVER_TASK_TYPE_SLOT_SAVE ,
SERVER_TASK_TYPE_SLOT_RESTORE ,
SERVER_TASK_TYPE_SLOT_ERASE ,
2024-03-07 10:41:53 +01:00
} ;
struct server_task {
int id = - 1 ; // to be filled by server_queue
int id_multi = - 1 ;
int id_target = - 1 ;
server_task_type type ;
json data ;
bool infill = false ;
bool embedding = false ;
} ;
struct server_task_result {
int id = - 1 ;
int id_multi = - 1 ;
json data ;
bool stop ;
bool error ;
} ;
struct server_task_multi {
int id = - 1 ;
std : : set < int > subtasks_remaining ;
std : : vector < server_task_result > results ;
2024-02-29 21:42:11 +01:00
} ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-29 21:42:11 +01:00
struct slot_params {
bool stream = true ;
bool cache_prompt = false ; // remember the prompt to avoid reprocessing all prompt
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-29 21:42:11 +01:00
uint32_t seed = - 1 ; // RNG seed
int32_t n_keep = 0 ; // number of tokens to keep from initial prompt
2024-03-26 09:47:43 +01:00
int32_t n_discard = 0 ; // number of tokens after n_keep that may be discarded when shifting context, 0 defaults to half
2024-02-29 21:42:11 +01:00
int32_t n_predict = - 1 ; // new tokens to predict
2023-07-02 23:38:44 +02:00
2024-02-29 21:42:11 +01:00
std : : vector < std : : string > antiprompt ;
2023-07-02 23:38:44 +02:00
2024-02-29 21:42:11 +01:00
json input_prefix ;
json input_suffix ;
} ;
2024-03-07 10:41:53 +01:00
struct server_params {
int32_t port = 8080 ;
int32_t read_timeout = 600 ;
int32_t write_timeout = 600 ;
int32_t n_threads_http = - 1 ;
2024-02-29 21:42:11 +01:00
2024-03-07 10:41:53 +01:00
std : : string hostname = " 127.0.0.1 " ;
2024-03-09 11:27:53 +01:00
std : : string public_path = " " ;
2024-03-07 10:41:53 +01:00
std : : string chat_template = " " ;
std : : string system_prompt = " " ;
2024-02-29 21:42:11 +01:00
2024-03-07 10:41:53 +01:00
std : : vector < std : : string > api_keys ;
2024-02-29 21:42:11 +01:00
2024-03-09 10:57:09 +01:00
# ifdef CPPHTTPLIB_OPENSSL_SUPPORT
std : : string ssl_key_file = " " ;
std : : string ssl_cert_file = " " ;
# endif
2024-03-07 10:41:53 +01:00
bool slots_endpoint = true ;
bool metrics_endpoint = false ;
2024-04-08 14:43:30 +02:00
std : : string slot_save_path ;
2024-02-29 21:42:11 +01:00
} ;
struct server_slot {
2023-10-22 21:53:08 +02:00
int id ;
2024-03-07 10:41:53 +01:00
int id_task = - 1 ;
int id_multi = - 1 ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
struct slot_params params ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
slot_state state = SLOT_STATE_IDLE ;
slot_command command = SLOT_COMMAND_NONE ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
// used to determine the slot that has been used the longest
int64_t t_last_used = - 1 ;
2023-10-20 20:07:23 +02:00
2023-10-22 21:53:08 +02:00
// generation props
int32_t n_ctx = 0 ; // context size per slot
int32_t n_past = 0 ;
int32_t n_decoded = 0 ;
int32_t n_remaining = - 1 ;
int32_t i_batch = - 1 ;
2024-03-13 18:54:21 +01:00
int32_t n_predict = - 1 ; // TODO: disambiguate from params.n_predict
2023-10-20 20:07:23 +02:00
2024-02-29 21:42:11 +01:00
int32_t n_prompt_tokens = 0 ;
int32_t n_prompt_tokens_processed = 0 ;
2023-10-22 21:53:08 +02:00
json prompt ;
2024-03-07 10:41:53 +01:00
// when a task is submitted, we first tokenize the prompt and store it here
std : : vector < llama_token > prompt_tokens ;
2023-10-22 21:53:08 +02:00
std : : string generated_text ;
std : : vector < llama_token > cache_tokens ;
std : : vector < completion_token_output > generated_token_probs ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
bool infill = false ;
bool embedding = false ;
2023-10-22 21:53:08 +02:00
bool has_next_token = true ;
2024-03-07 10:41:53 +01:00
bool truncated = false ;
bool stopped_eos = false ;
bool stopped_word = false ;
bool stopped_limit = false ;
2023-10-22 21:53:08 +02:00
2023-11-25 10:29:06 +01:00
bool oaicompat = false ;
2024-03-07 10:41:53 +01:00
std : : string oaicompat_model ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
std : : string stopping_word ;
2023-10-22 21:53:08 +02:00
// sampling
2024-03-07 10:41:53 +01:00
llama_token sampled ;
2023-10-22 21:53:08 +02:00
struct llama_sampling_params sparams ;
2024-03-07 10:41:53 +01:00
llama_sampling_context * ctx_sampling = nullptr ;
json-schema-to-grammar improvements (+ added to server) (#5978)
* json: fix arrays (disallow `[,1]`)
* json: support tuple types (`[number, string]`)
* json: support additionalProperties (`{[k: string]: [string,number][]}`)
* json: support required / optional properties
* json: add support for pattern
* json: resolve $ref (and support https schema urls)
* json: fix $ref resolution
* join: support union types (mostly for nullable types I think)
* json: support allOf + nested anyOf
* json: support any (`{}` or `{type: object}`)
* json: fix merge
* json: temp fix for escapes
* json: spaces in output and unrestricted output spaces
* json: add typings
* json:fix typo
* Create ts-type-to-grammar.sh
* json: fix _format_literal (json.dumps already escapes quotes)
* json: merge lit sequences and handle negatives
{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}
* json: handle pattern repetitions
* Update json-schema-to-grammar.mjs
* Create regex-to-grammar.py
* json: extract repeated regexp patterns to subrule
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* json: handle schema from pydantic Optional fields
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update ts-type-to-grammar.sh
* Update ts-type-to-grammar.sh
* json: simplify nullable fields handling
* json: accept duplicate identical rules
* json: revert space to 1 at most
* json: reuse regexp pattern subrules
* json: handle uuid string format
* json: fix literal escapes
* json: add --allow-fetch
* json: simplify range escapes
* json: support negative ranges in patterns
* Delete commit.txt
* json: custom regex parser, adds dot support & JS-portable
* json: rm trailing spaces
* Update json-schema-to-grammar.mjs
* json: updated server & chat `( cd examples/server && ./deps.sh )`
* json: port fixes from mjs to python
* Update ts-type-to-grammar.sh
* json: support prefixItems alongside array items
* json: add date format + fix uuid
* json: add date, time, date-time formats
* json: preserve order of props from TS defs
* json: port schema converter to C++, wire in ./server
* json: nits
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* json: fix mjs implementation + align outputs
* Update json-schema-to-grammar.mjs.hpp
* json: test C++, JS & Python versions
* json: nits + regen deps
* json: cleanup test
* json: revert from c++17 to 11
* json: nit fixes
* json: dirty include for test
* json: fix zig build
* json: pass static command to std::system in tests (fixed temp files)
* json: fix top-level $refs
* json: don't use c++20 designated initializers
* nit
* json: basic support for reserved names `{number:{number:{root:number}}}`
* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)
* json: re-ran server deps.sh
* json: simplify test
* json: support mix of additional props & required/optional
* json: add tests for some expected failures
* json: fix type=const in c++, add failure expectations for non-str const&enum
* json: test (& simplify output of) empty schema
* json: check parsing in test + fix value & string refs
* json: add server tests for OAI JSON response_format
* json: test/fix top-level anyOf
* json: improve grammar parsing failures
* json: test/fix additional props corner cases
* json: fix string patterns (was missing quotes)
* json: ws nit
* json: fix json handling in server when there's no response_format
* json: catch schema conversion errors in server
* json: don't complain about unknown format type in server if unset
* json: cleaner build of test
* json: create examples/json-schema-pydantic-example.py
* json: fix date pattern
* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common
* json: indent 4 spaces
* json: fix naming of top-level c++ function (+ drop unused one)
* json: avoid using namespace std
* json: fix zig build
* Update server.feature
* json: iostream -> fprintf
* json: space before & refs for consistency
* json: nits
2024-03-21 12:50:43 +01:00
json json_schema ;
2023-10-22 21:53:08 +02:00
2024-01-27 14:38:05 +01:00
int32_t ga_i = 0 ; // group-attention state
2024-01-30 19:17:30 +01:00
int32_t ga_n = 1 ; // group-attention factor
2024-01-27 14:38:05 +01:00
int32_t ga_w = 512 ; // group-attention width
int32_t n_past_se = 0 ; // self-extend
2023-10-22 21:53:08 +02:00
// stats
2024-02-29 21:42:11 +01:00
size_t n_sent_text = 0 ; // number of sent text character
size_t n_sent_token_probs = 0 ;
2023-10-22 21:53:08 +02:00
int64_t t_start_process_prompt ;
2024-03-07 10:41:53 +01:00
int64_t t_start_generation ;
2023-10-22 21:53:08 +02:00
double t_prompt_processing ; // ms
double t_token_generation ; // ms
void reset ( ) {
2024-03-07 10:41:53 +01:00
n_prompt_tokens = 0 ;
generated_text = " " ;
truncated = false ;
stopped_eos = false ;
stopped_word = false ;
stopped_limit = false ;
stopping_word = " " ;
n_past = 0 ;
n_sent_text = 0 ;
n_sent_token_probs = 0 ;
infill = false ;
ga_i = 0 ;
n_past_se = 0 ;
2024-01-30 19:17:30 +01:00
2023-10-22 21:53:08 +02:00
generated_token_probs . clear ( ) ;
}
bool has_budget ( gpt_params & global_params ) {
2024-02-29 21:42:11 +01:00
if ( params . n_predict = = - 1 & & global_params . n_predict = = - 1 ) {
2024-01-07 07:45:26 +01:00
return true ; // limitless
}
2023-10-22 21:53:08 +02:00
n_remaining = - 1 ;
2024-01-07 07:45:26 +01:00
2024-02-29 21:42:11 +01:00
if ( params . n_predict ! = - 1 ) {
2023-10-22 21:53:08 +02:00
n_remaining = params . n_predict - n_decoded ;
2024-02-29 21:42:11 +01:00
} else if ( global_params . n_predict ! = - 1 ) {
2023-10-22 21:53:08 +02:00
n_remaining = global_params . n_predict - n_decoded ;
}
2024-01-07 07:45:26 +01:00
return n_remaining > 0 ; // no budget
2023-10-22 21:53:08 +02:00
}
bool available ( ) const {
2024-03-07 10:41:53 +01:00
return state = = SLOT_STATE_IDLE & & command = = SLOT_COMMAND_NONE ;
2023-10-22 21:53:08 +02:00
}
bool is_processing ( ) const {
2024-03-07 10:41:53 +01:00
return ( state = = SLOT_STATE_IDLE & & command = = SLOT_COMMAND_LOAD_PROMPT ) | | state = = SLOT_STATE_PROCESSING ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
void add_token_string ( const completion_token_output & token ) {
if ( command = = SLOT_COMMAND_RELEASE ) {
2023-10-22 21:53:08 +02:00
return ;
}
generated_token_probs . push_back ( token ) ;
}
void release ( ) {
2024-03-07 10:41:53 +01:00
if ( state = = SLOT_STATE_PROCESSING ) {
t_token_generation = ( ggml_time_us ( ) - t_start_generation ) / 1e3 ;
command = SLOT_COMMAND_RELEASE ;
2023-10-22 21:53:08 +02:00
}
}
2024-03-07 10:41:53 +01:00
json get_formated_timings ( ) const {
return json {
2024-02-29 21:42:11 +01:00
{ " prompt_n " , n_prompt_tokens_processed } ,
2023-10-22 21:53:08 +02:00
{ " prompt_ms " , t_prompt_processing } ,
2024-02-29 21:42:11 +01:00
{ " prompt_per_token_ms " , t_prompt_processing / n_prompt_tokens_processed } ,
{ " prompt_per_second " , 1e3 / t_prompt_processing * n_prompt_tokens_processed } ,
2023-10-22 21:53:08 +02:00
{ " predicted_n " , n_decoded } ,
{ " predicted_ms " , t_token_generation } ,
{ " predicted_per_token_ms " , t_token_generation / n_decoded } ,
{ " predicted_per_second " , 1e3 / t_token_generation * n_decoded } ,
} ;
}
2024-03-07 10:41:53 +01:00
size_t find_stopping_strings ( const std : : string & text , const size_t last_token_size , const stop_type type ) {
size_t stop_pos = std : : string : : npos ;
for ( const std : : string & word : params . antiprompt ) {
size_t pos ;
if ( type = = STOP_TYPE_FULL ) {
const size_t tmp = word . size ( ) + last_token_size ;
const size_t from_pos = text . size ( ) > tmp ? text . size ( ) - tmp : 0 ;
pos = text . find ( word , from_pos ) ;
} else {
pos = find_partial_stop_string ( word , text ) ;
}
if ( pos ! = std : : string : : npos & & ( stop_pos = = std : : string : : npos | | pos < stop_pos ) ) {
if ( type = = STOP_TYPE_FULL ) {
stopped_word = true ;
stopping_word = word ;
has_next_token = false ;
}
stop_pos = pos ;
}
}
return stop_pos ;
}
2023-11-25 10:29:06 +01:00
void print_timings ( ) const {
2024-03-07 10:41:53 +01:00
char buffer [ 512 ] ;
2024-02-29 21:42:11 +01:00
double t_token = t_prompt_processing / n_prompt_tokens_processed ;
double n_tokens_second = 1e3 / t_prompt_processing * n_prompt_tokens_processed ;
2024-03-07 10:41:53 +01:00
snprintf ( buffer , 512 , " prompt eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second) " ,
2024-02-29 21:42:11 +01:00
t_prompt_processing , n_prompt_tokens_processed ,
2024-02-25 13:50:32 +01:00
t_token , n_tokens_second ) ;
2024-03-07 10:41:53 +01:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( buffer , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , id } ,
{ " id_task " , id_task } ,
2024-02-29 21:42:11 +01:00
{ " t_prompt_processing " , t_prompt_processing } ,
{ " n_prompt_tokens_processed " , n_prompt_tokens_processed } ,
{ " t_token " , t_token } ,
{ " n_tokens_second " , n_tokens_second } ,
2024-02-25 13:50:32 +01:00
} ) ;
t_token = t_token_generation / n_decoded ;
n_tokens_second = 1e3 / t_token_generation * n_decoded ;
2024-03-07 10:41:53 +01:00
snprintf ( buffer , 512 , " generation eval time = %10.2f ms / %5d runs (%8.2f ms per token, %8.2f tokens per second) " ,
2024-02-25 13:50:32 +01:00
t_token_generation , n_decoded ,
t_token , n_tokens_second ) ;
2024-03-07 10:41:53 +01:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( buffer , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , id } ,
{ " id_task " , id_task } ,
2024-02-25 13:50:32 +01:00
{ " t_token_generation " , t_token_generation } ,
{ " n_decoded " , n_decoded } ,
{ " t_token " , t_token } ,
{ " n_tokens_second " , n_tokens_second } ,
} ) ;
2024-03-07 10:41:53 +01:00
snprintf ( buffer , 512 , " total time = %10.2f ms " , t_prompt_processing + t_token_generation ) ;
2024-02-25 13:50:32 +01:00
LOG_INFO ( buffer , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , id } ,
{ " id_task " , id_task } ,
2024-02-25 13:50:32 +01:00
{ " t_prompt_processing " , t_prompt_processing } ,
{ " t_token_generation " , t_token_generation } ,
{ " t_total " , t_prompt_processing + t_token_generation } ,
} ) ;
2023-07-04 16:05:27 +02:00
}
2023-10-22 21:53:08 +02:00
} ;
2024-02-29 21:42:11 +01:00
struct server_metrics {
2024-03-09 16:34:15 +01:00
int64_t t_start = 0 ;
2024-03-08 12:25:04 +01:00
2024-02-25 13:49:43 +01:00
uint64_t n_prompt_tokens_processed_total = 0 ;
2024-03-08 12:25:04 +01:00
uint64_t t_prompt_processing_total = 0 ;
2024-02-25 13:49:43 +01:00
uint64_t n_tokens_predicted_total = 0 ;
2024-03-08 12:25:04 +01:00
uint64_t t_tokens_generation_total = 0 ;
2024-02-25 13:49:43 +01:00
uint64_t n_prompt_tokens_processed = 0 ;
uint64_t t_prompt_processing = 0 ;
2024-03-07 10:41:53 +01:00
uint64_t n_tokens_predicted = 0 ;
uint64_t t_tokens_generation = 0 ;
2024-02-25 13:49:43 +01:00
2024-03-09 16:34:15 +01:00
void init ( ) {
t_start = ggml_time_us ( ) ;
}
void on_prompt_eval ( const server_slot & slot ) {
2024-02-29 21:42:11 +01:00
n_prompt_tokens_processed_total + = slot . n_prompt_tokens_processed ;
n_prompt_tokens_processed + = slot . n_prompt_tokens_processed ;
t_prompt_processing + = slot . t_prompt_processing ;
2024-03-08 12:25:04 +01:00
t_prompt_processing_total + = slot . t_prompt_processing ;
2024-02-25 13:49:43 +01:00
}
2024-03-09 16:34:15 +01:00
void on_prediction ( const server_slot & slot ) {
2024-03-08 12:25:04 +01:00
n_tokens_predicted_total + = slot . n_decoded ;
n_tokens_predicted + = slot . n_decoded ;
t_tokens_generation + = slot . t_token_generation ;
t_tokens_generation_total + = slot . t_token_generation ;
2024-02-25 13:49:43 +01:00
}
void reset_bucket ( ) {
n_prompt_tokens_processed = 0 ;
t_prompt_processing = 0 ;
n_tokens_predicted = 0 ;
t_tokens_generation = 0 ;
}
} ;
2024-03-07 10:41:53 +01:00
struct server_queue {
int id = 0 ;
bool running ;
// queues
std : : vector < server_task > queue_tasks ;
std : : vector < server_task > queue_tasks_deferred ;
std : : vector < server_task_multi > queue_multitasks ;
std : : mutex mutex_tasks ;
std : : condition_variable condition_tasks ;
// callback functions
std : : function < void ( server_task & ) > callback_new_task ;
std : : function < void ( server_task_multi & ) > callback_finish_multitask ;
2024-03-11 10:56:41 +01:00
std : : function < void ( void ) > callback_update_slots ;
2024-03-07 10:41:53 +01:00
// Add a new task to the end of the queue
int post ( server_task task ) {
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
if ( task . id = = - 1 ) {
task . id = id + + ;
LOG_VERBOSE ( " new task id " , { { " new_id " , task . id } } ) ;
}
queue_tasks . push_back ( std : : move ( task ) ) ;
condition_tasks . notify_one ( ) ;
return task . id ;
}
// Add a new task, but defer until one slot is available
void defer ( server_task task ) {
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
queue_tasks_deferred . push_back ( std : : move ( task ) ) ;
}
// Get the next id for creating anew task
int get_new_id ( ) {
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
int new_id = id + + ;
LOG_VERBOSE ( " new task id " , { { " new_id " , new_id } } ) ;
return new_id ;
}
// Register function to process a new task
void on_new_task ( std : : function < void ( server_task & ) > callback ) {
callback_new_task = std : : move ( callback ) ;
}
// Register function to process a multitask when it is finished
void on_finish_multitask ( std : : function < void ( server_task_multi & ) > callback ) {
callback_finish_multitask = std : : move ( callback ) ;
}
// Register the function to be called when all slots data is ready to be processed
2024-03-11 10:56:41 +01:00
void on_update_slots ( std : : function < void ( void ) > callback ) {
callback_update_slots = std : : move ( callback ) ;
2024-03-07 10:41:53 +01:00
}
// Call when the state of one slot is changed
void notify_slot_changed ( ) {
// move deferred tasks back to main loop
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
for ( auto & task : queue_tasks_deferred ) {
queue_tasks . push_back ( std : : move ( task ) ) ;
}
queue_tasks_deferred . clear ( ) ;
}
// end the start_loop routine
void terminate ( ) {
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
running = false ;
condition_tasks . notify_all ( ) ;
}
/**
* Main loop consists of these steps :
* - Wait until a new task arrives
* - Process the task ( i . e . maybe copy data into slot )
* - Check if multitask is finished
2024-03-11 10:56:41 +01:00
* - Update all slots
2024-03-07 10:41:53 +01:00
*/
void start_loop ( ) {
running = true ;
while ( true ) {
LOG_VERBOSE ( " new task may arrive " , { } ) ;
while ( true ) {
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
if ( queue_tasks . empty ( ) ) {
lock . unlock ( ) ;
break ;
}
server_task task = queue_tasks . front ( ) ;
queue_tasks . erase ( queue_tasks . begin ( ) ) ;
lock . unlock ( ) ;
LOG_VERBOSE ( " callback_new_task " , { { " id_task " , task . id } } ) ;
callback_new_task ( task ) ;
}
LOG_VERBOSE ( " update_multitasks " , { } ) ;
// check if we have any finished multitasks
auto queue_iterator = queue_multitasks . begin ( ) ;
while ( queue_iterator ! = queue_multitasks . end ( ) ) {
if ( queue_iterator - > subtasks_remaining . empty ( ) ) {
// all subtasks done == multitask is done
server_task_multi current_multitask = * queue_iterator ;
callback_finish_multitask ( current_multitask ) ;
// remove this multitask
queue_iterator = queue_multitasks . erase ( queue_iterator ) ;
} else {
+ + queue_iterator ;
}
}
// all tasks in the current loop is processed, slots data is now ready
2024-03-11 10:56:41 +01:00
LOG_VERBOSE ( " callback_update_slots " , { } ) ;
2024-03-07 10:41:53 +01:00
2024-03-11 10:56:41 +01:00
callback_update_slots ( ) ;
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " wait for new task " , { } ) ;
{
std : : unique_lock < std : : mutex > lock ( mutex_tasks ) ;
if ( queue_tasks . empty ( ) ) {
if ( ! running ) {
LOG_VERBOSE ( " ending start_loop " , { } ) ;
return ;
}
condition_tasks . wait ( lock , [ & ] {
return ( ! queue_tasks . empty ( ) | | ! running ) ;
} ) ;
}
}
}
}
//
// functions to manage multitasks
//
// add a multitask by specifying the id of all subtask (subtask is a server_task)
void add_multitask ( int id_multi , std : : vector < int > & sub_ids ) {
std : : lock_guard < std : : mutex > lock ( mutex_tasks ) ;
server_task_multi multi ;
multi . id = id_multi ;
std : : copy ( sub_ids . begin ( ) , sub_ids . end ( ) , std : : inserter ( multi . subtasks_remaining , multi . subtasks_remaining . end ( ) ) ) ;
queue_multitasks . push_back ( multi ) ;
}
// updatethe remaining subtasks, while appending results to multitask
void update_multitask ( int id_multi , int id_sub , server_task_result & result ) {
std : : lock_guard < std : : mutex > lock ( mutex_tasks ) ;
for ( auto & multitask : queue_multitasks ) {
if ( multitask . id = = id_multi ) {
multitask . subtasks_remaining . erase ( id_sub ) ;
multitask . results . push_back ( result ) ;
}
}
}
} ;
struct server_response {
typedef std : : function < void ( int , int , server_task_result & ) > callback_multitask_t ;
callback_multitask_t callback_update_multitask ;
// for keeping track of all tasks waiting for the result
std : : set < int > waiting_task_ids ;
// the main result queue
std : : vector < server_task_result > queue_results ;
std : : mutex mutex_results ;
std : : condition_variable condition_results ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
// add the id_task to the list of tasks waiting for response
void add_waiting_task_id ( int id_task ) {
LOG_VERBOSE ( " waiting for task id " , { { " id_task " , id_task } } ) ;
std : : unique_lock < std : : mutex > lock ( mutex_results ) ;
waiting_task_ids . insert ( id_task ) ;
}
// when the request is finished, we can remove task associated with it
void remove_waiting_task_id ( int id_task ) {
LOG_VERBOSE ( " remove waiting for task id " , { { " id_task " , id_task } } ) ;
std : : unique_lock < std : : mutex > lock ( mutex_results ) ;
waiting_task_ids . erase ( id_task ) ;
}
// This function blocks the thread until there is a response for this id_task
server_task_result recv ( int id_task ) {
while ( true ) {
std : : unique_lock < std : : mutex > lock ( mutex_results ) ;
condition_results . wait ( lock , [ & ] {
return ! queue_results . empty ( ) ;
} ) ;
for ( int i = 0 ; i < ( int ) queue_results . size ( ) ; i + + ) {
if ( queue_results [ i ] . id = = id_task ) {
assert ( queue_results [ i ] . id_multi = = - 1 ) ;
server_task_result res = queue_results [ i ] ;
queue_results . erase ( queue_results . begin ( ) + i ) ;
return res ;
}
}
}
// should never reach here
}
// Register the function to update multitask
void on_multitask_update ( callback_multitask_t callback ) {
callback_update_multitask = std : : move ( callback ) ;
}
// Send a new result to a waiting id_task
void send ( server_task_result result ) {
LOG_VERBOSE ( " send new result " , { { " id_task " , result . id } } ) ;
std : : unique_lock < std : : mutex > lock ( mutex_results ) ;
for ( const auto & id_task : waiting_task_ids ) {
// LOG_TEE("waiting task id %i \n", id_task);
// for now, tasks that have associated parent multitasks just get erased once multitask picks up the result
if ( result . id_multi = = id_task ) {
LOG_VERBOSE ( " callback_update_multitask " , { { " id_task " , id_task } } ) ;
callback_update_multitask ( id_task , result . id , result ) ;
continue ;
}
if ( result . id = = id_task ) {
LOG_VERBOSE ( " queue_results.push_back " , { { " id_task " , id_task } } ) ;
queue_results . push_back ( result ) ;
condition_results . notify_all ( ) ;
return ;
}
}
}
} ;
struct server_context {
llama_model * model = nullptr ;
llama_context * ctx = nullptr ;
2023-10-22 21:53:08 +02:00
gpt_params params ;
llama_batch batch ;
2024-03-07 10:41:53 +01:00
bool clean_kv_cache = true ;
bool add_bos_token = true ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
int32_t n_ctx ; // total context for all clients / slots
2023-10-22 21:53:08 +02:00
// system prompt
bool system_need_update = false ;
std : : string system_prompt ;
std : : vector < llama_token > system_tokens ;
std : : string name_user ; // this should be the antiprompt
std : : string name_assistant ;
// slots / clients
2024-02-29 21:42:11 +01:00
std : : vector < server_slot > slots ;
2024-02-05 09:10:22 +01:00
json default_generation_settings_for_props ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
server_queue queue_tasks ;
server_response queue_results ;
2023-07-04 16:05:27 +02:00
2024-02-29 21:42:11 +01:00
server_metrics metrics ;
2024-02-25 13:49:43 +01:00
2024-03-07 10:41:53 +01:00
~ server_context ( ) {
if ( ctx ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
llama_free ( ctx ) ;
ctx = nullptr ;
2023-05-21 19:51:18 +02:00
}
2024-03-07 10:41:53 +01:00
if ( model ) {
2023-06-24 10:47:58 +02:00
llama_free_model ( model ) ;
model = nullptr ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2024-03-07 10:41:53 +01:00
bool load_model ( const gpt_params & params_ ) {
2023-10-22 21:53:08 +02:00
params = params_ ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
// dedicate one sequence to the system prompt
params . n_parallel + = 1 ;
2023-06-24 10:47:58 +02:00
std : : tie ( model , ctx ) = llama_init_from_gpt_params ( params ) ;
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
params . n_parallel - = 1 ; // but be sneaky about it
2024-03-07 10:41:53 +01:00
if ( model = = nullptr ) {
2023-10-22 21:53:08 +02:00
LOG_ERROR ( " unable to load model " , { { " model " , params . model } } ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
return false ;
2023-05-21 19:51:18 +02:00
}
2023-10-22 21:53:08 +02:00
2023-09-28 21:42:38 +02:00
n_ctx = llama_n_ctx ( ctx ) ;
2023-10-22 21:53:08 +02:00
2023-11-17 03:14:37 +01:00
add_bos_token = llama_should_add_bos_token ( model ) ;
2024-04-09 19:44:08 +02:00
GGML_ASSERT ( llama_add_eos_token ( model ) ! = 1 ) ;
2023-11-17 03:14:37 +01:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
return true ;
2023-05-21 19:51:18 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
bool validate_model_chat_template ( ) const {
2024-02-22 09:33:24 +01:00
llama_chat_message chat [ ] = { { " user " , " test " } } ;
2024-03-07 10:41:53 +01:00
const int res = llama_chat_apply_template ( model , nullptr , chat , 1 , true , nullptr , 0 ) ;
return res > 0 ;
2024-02-22 09:33:24 +01:00
}
2024-03-09 16:34:15 +01:00
void init ( ) {
2023-10-22 21:53:08 +02:00
const int32_t n_ctx_slot = n_ctx / params . n_parallel ;
2024-02-25 13:50:32 +01:00
LOG_INFO ( " initializing slots " , { { " n_slots " , params . n_parallel } } ) ;
2024-03-09 16:34:15 +01:00
2024-03-07 10:41:53 +01:00
for ( int i = 0 ; i < params . n_parallel ; i + + ) {
2024-02-29 21:42:11 +01:00
server_slot slot ;
2023-10-22 21:53:08 +02:00
slot . id = i ;
slot . n_ctx = n_ctx_slot ;
2024-02-18 17:30:09 +01:00
slot . n_predict = params . n_predict ;
2023-10-22 21:53:08 +02:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( " new slot " , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
2024-02-25 13:50:32 +01:00
{ " n_ctx_slot " , slot . n_ctx }
} ) ;
2024-01-27 14:38:05 +01:00
const int ga_n = params . grp_attn_n ;
const int ga_w = params . grp_attn_w ;
if ( ga_n ! = 1 ) {
2024-03-02 22:00:14 +01:00
GGML_ASSERT ( ga_n > 0 & & " ga_n must be positive " ) ; // NOLINT
GGML_ASSERT ( ga_w % ga_n = = 0 & & " ga_w must be a multiple of ga_n " ) ; // NOLINT
2024-01-27 14:38:05 +01:00
//GGML_ASSERT(n_ctx_train % ga_w == 0 && "n_ctx_train must be a multiple of ga_w"); // NOLINT
//GGML_ASSERT(n_ctx >= n_ctx_train * ga_n && "n_ctx must be at least n_ctx_train * ga_n"); // NOLINT
2024-02-25 13:50:32 +01:00
LOG_INFO ( " slot self-extend " , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
{ " ga_n " , ga_n } ,
{ " ga_w " , ga_w }
2024-02-25 13:50:32 +01:00
} ) ;
2024-01-27 14:38:05 +01:00
}
slot . ga_i = 0 ;
slot . ga_n = ga_n ;
slot . ga_w = ga_w ;
slot . reset ( ) ;
2023-10-22 21:53:08 +02:00
slots . push_back ( slot ) ;
}
2024-02-05 09:10:22 +01:00
default_generation_settings_for_props = get_formated_generation ( slots . front ( ) ) ;
default_generation_settings_for_props [ " seed " ] = - 1 ;
2024-03-13 18:54:21 +01:00
// the update_slots() logic will always submit a maximum of n_batch tokens
// note that n_batch can be > n_ctx (e.g. for non-causal attention models such as BERT where the KV cache is not used)
{
const int32_t n_batch = llama_n_batch ( ctx ) ;
llama : greatly reduce output buffer memory usage (#6122)
* llama : greatly reduce logits memory usage
* llama : more compact state saving and reloading
* llama : fix lctx.n_outputs not being set before building graph
* perplexity : adapt to the logits API changes
* perplexity : fix Winogrande, use correct logits for second choice start
The first logits used to evaluate the second choice were not from
the end of the common prefix; instead, they were the logits from the end
of the first choice. This has been corrected.
The previous implementation sometimes had outliers in the scores of
choices for some tasks, and the logic to skip choices words
in the log-likelihood evaluation probably was an attempt to reduce those,
but it was complex and didn't quite seem to be the right thing.
This is simpler now, and the outlier scores aren't there anymore.
* perplexity : normalize spaces and punctuation in Winogrande sentences
* llama : fix embedding conditions
* llama : fix llama_get_embeddings_ith when the resulting id is 0
* llama : fix wrong n_outputs in llama_set_inputs
A mismatch happened when using a smaller n_ubatch than n_batch and then using
llama_batch_get_one(). The decision of what n_outputs should be now almost
fully depends on how lctx.n_outputs is set in llama_decode_internal.
The conditions are simpler this way.
* llama : when saving the state, recalculate n_outputs
This ensures the correct number of outputs for the entire previous batch
is stored in the session file, even when n_ubatch is smaller than n_batch.
* llama : fix not-skipping outputs of non-causal models
* llama : fix running a batch with n_outputs == 0
It previously worked because lctx.inp_out_ids was not initialized,
so it pointed to some garbage address which was somehow still valid when I
ran my tests.
* llama : keep same graph topology even when n_outputs == 0
* ggml : saner ggml_can_repeat with empty tensors
* ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1
* ggml : do not multi-thread ops returning empty tensors
* ggml : make ggml_is_empty public and work with views
* llama : use a vector for ctx->output_ids
* llama : rework reallocation logic for llama_output_reserve
Now comparing the actual size with the new total size of the output buffer
to allow more efficient enabling and disabling of the embeddings
and/or logits output in the future.
* ggml : skip empty tensors in all backends
* llama : fix llama_output_reserve nullptr deref when new_size is 0
* perplexity : make Winogrande work as it does on master
The problems with the Winogrande implementation will
need to be fixed in a separate PR to ease review.
* llama : clearer error messages for invalid logits or embeddings ids
* llama : assert all models that can have inp_out_ids
Since the graph topology is now constant, this presence check
can be done even when there are no outputs.
* llama : assert logits and embd buffers exist before writing to them
* llama : handle errors from llama_output_reserve at call sites
* perplexity : make hellaswag and multiple-choice outputs identical to master
Due to how the KV cache is updated, the logprobs for tokens in a batch
are very slightly affected by the other tokens present in the batch,
so to make hellaswag and multiple-choice return exactly the same results
as on master, the last token of each sequence needs to be evaluated
even though its output is not used at all.
This will probably be changed back in the future to make these benchmarks
a tiny bit faster.
* perplexity : fix division by zero when using less than 100 multiple-choice tasks
* llama : allow loading state saved with a different ctx size
When loading a session file, the context size is now only required to be
at least enough to load the KV cells contained in that session file,
instead of requiring to use exactly the same context size as when saving.
Doing this enables the use-case of extending or shrinking the context size
of a saved session.
This breaks existing session files because the meaning of kv_buf_size
is slightly changed (previously it was the size of the whole KV cache,
now it's only the size of the saved part of it). This allows for
finer-grained sanity checks when loading in an effort to keep kv_buf_size
useful even when the kv_size is changed.
* llama : minor
ggml-ci
* readme : update recent API changes, and warn about Vulkan
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-26 15:46:41 +01:00
// only a single seq_id per token is needed
batch = llama_batch_init ( n_batch , 0 , 1 ) ;
2024-03-13 18:54:21 +01:00
}
2024-03-09 16:34:15 +01:00
metrics . init ( ) ;
2023-10-22 21:53:08 +02:00
}
2024-04-09 19:44:08 +02:00
std : : vector < llama_token > tokenize ( const json & json_prompt , bool add_special ) const {
2023-11-25 10:29:06 +01:00
// TODO: currently, we tokenize using special tokens by default
// this is not always correct (see https://github.com/ggerganov/llama.cpp/pull/4160#issuecomment-1824826216)
// but it's better compared to completely ignoring ChatML and other chat templates
const bool TMP_FORCE_SPECIAL = true ;
2023-08-23 09:12:12 +02:00
// If `add_bos` is true, we only add BOS, when json_prompt is a string,
// or the first element of the json_prompt array is a string.
std : : vector < llama_token > prompt_tokens ;
2024-03-07 10:41:53 +01:00
if ( json_prompt . is_array ( ) ) {
2023-08-23 09:12:12 +02:00
bool first = true ;
2024-03-07 10:41:53 +01:00
for ( const auto & p : json_prompt ) {
if ( p . is_string ( ) ) {
2023-08-23 09:12:12 +02:00
auto s = p . template get < std : : string > ( ) ;
2024-03-07 10:41:53 +01:00
2023-08-23 09:12:12 +02:00
std : : vector < llama_token > p ;
2024-03-07 10:41:53 +01:00
if ( first ) {
2024-04-09 19:44:08 +02:00
p = : : llama_tokenize ( ctx , s , add_special , TMP_FORCE_SPECIAL ) ;
2023-08-23 09:12:12 +02:00
first = false ;
2024-03-07 10:41:53 +01:00
} else {
2023-11-25 10:29:06 +01:00
p = : : llama_tokenize ( ctx , s , false , TMP_FORCE_SPECIAL ) ;
2023-08-23 09:12:12 +02:00
}
2024-03-07 10:41:53 +01:00
2023-08-23 09:12:12 +02:00
prompt_tokens . insert ( prompt_tokens . end ( ) , p . begin ( ) , p . end ( ) ) ;
2024-03-07 10:41:53 +01:00
} else {
if ( first ) {
2023-08-23 09:12:12 +02:00
first = false ;
}
2024-03-07 10:41:53 +01:00
2023-08-23 09:12:12 +02:00
prompt_tokens . push_back ( p . template get < llama_token > ( ) ) ;
}
}
2024-03-07 10:41:53 +01:00
} else {
2023-08-23 09:12:12 +02:00
auto s = json_prompt . template get < std : : string > ( ) ;
2024-04-09 19:44:08 +02:00
prompt_tokens = : : llama_tokenize ( ctx , s , add_special , TMP_FORCE_SPECIAL ) ;
2023-08-23 09:12:12 +02:00
}
return prompt_tokens ;
}
2024-03-07 10:41:53 +01:00
server_slot * get_slot ( int id ) {
2023-10-22 21:53:08 +02:00
int64_t t_last = ggml_time_us ( ) ;
2023-10-20 20:07:23 +02:00
2024-03-07 10:41:53 +01:00
server_slot * last_used = nullptr ;
for ( server_slot & slot : slots ) {
if ( slot . id = = id & & slot . available ( ) ) {
2023-10-22 21:53:08 +02:00
return & slot ;
}
2023-10-20 20:07:23 +02:00
2024-03-07 10:41:53 +01:00
// among all available slots, find the one that has been least recently used
if ( slot . available ( ) & & slot . t_last_used < t_last ) {
2023-10-22 21:53:08 +02:00
last_used = & slot ;
t_last = slot . t_last_used ;
}
}
2023-10-20 20:07:23 +02:00
2023-10-22 21:53:08 +02:00
return last_used ;
2023-08-08 15:29:19 +02:00
}
2024-03-11 10:56:41 +01:00
bool launch_slot_with_task ( server_slot & slot , const server_task & task ) {
2023-10-22 21:53:08 +02:00
slot_params default_params ;
llama_sampling_params default_sparams ;
2024-03-11 10:56:41 +01:00
auto & data = task . data ;
2023-10-22 21:53:08 +02:00
2023-11-25 10:29:06 +01:00
if ( data . count ( " __oaicompat " ) ! = 0 ) {
2024-03-07 10:41:53 +01:00
slot . oaicompat = true ;
slot . oaicompat_model = json_value ( data , " model " , std : : string ( DEFAULT_OAICOMPAT_MODEL ) ) ;
2023-11-25 10:29:06 +01:00
} else {
2024-03-07 10:41:53 +01:00
slot . oaicompat = false ;
slot . oaicompat_model = " " ;
}
slot . params . stream = json_value ( data , " stream " , false ) ;
slot . params . cache_prompt = json_value ( data , " cache_prompt " , false ) ;
slot . params . n_predict = json_value ( data , " n_predict " , default_params . n_predict ) ;
slot . sparams . top_k = json_value ( data , " top_k " , default_sparams . top_k ) ;
slot . sparams . top_p = json_value ( data , " top_p " , default_sparams . top_p ) ;
slot . sparams . min_p = json_value ( data , " min_p " , default_sparams . min_p ) ;
slot . sparams . tfs_z = json_value ( data , " tfs_z " , default_sparams . tfs_z ) ;
slot . sparams . typical_p = json_value ( data , " typical_p " , default_sparams . typical_p ) ;
slot . sparams . temp = json_value ( data , " temperature " , default_sparams . temp ) ;
slot . sparams . dynatemp_range = json_value ( data , " dynatemp_range " , default_sparams . dynatemp_range ) ;
slot . sparams . dynatemp_exponent = json_value ( data , " dynatemp_exponent " , default_sparams . dynatemp_exponent ) ;
slot . sparams . penalty_last_n = json_value ( data , " repeat_last_n " , default_sparams . penalty_last_n ) ;
slot . sparams . penalty_repeat = json_value ( data , " repeat_penalty " , default_sparams . penalty_repeat ) ;
slot . sparams . penalty_freq = json_value ( data , " frequency_penalty " , default_sparams . penalty_freq ) ;
slot . sparams . penalty_present = json_value ( data , " presence_penalty " , default_sparams . penalty_present ) ;
slot . sparams . mirostat = json_value ( data , " mirostat " , default_sparams . mirostat ) ;
slot . sparams . mirostat_tau = json_value ( data , " mirostat_tau " , default_sparams . mirostat_tau ) ;
slot . sparams . mirostat_eta = json_value ( data , " mirostat_eta " , default_sparams . mirostat_eta ) ;
slot . sparams . penalize_nl = json_value ( data , " penalize_nl " , default_sparams . penalize_nl ) ;
slot . params . n_keep = json_value ( data , " n_keep " , slot . params . n_keep ) ;
2024-03-26 09:47:43 +01:00
slot . params . n_discard = json_value ( data , " n_discard " , default_params . n_discard ) ;
2024-04-24 11:08:36 +02:00
slot . sparams . seed = json_value ( data , " seed " , default_sparams . seed ) ;
2024-03-25 09:42:17 +01:00
slot . sparams . n_probs = json_value ( data , " n_probs " , default_sparams . n_probs ) ;
slot . sparams . min_keep = json_value ( data , " min_keep " , default_sparams . min_keep ) ;
// process "json_schema" and "grammar"
2024-04-12 20:43:38 +02:00
if ( data . contains ( " json_schema " ) & & ! data [ " json_schema " ] . is_null ( ) & & data . contains ( " grammar " ) & & ! data [ " grammar " ] . is_null ( ) ) {
2024-03-25 09:42:17 +01:00
send_error ( task , " Either \" json_schema \" or \" grammar \" can be specified, but not both " , ERROR_TYPE_INVALID_REQUEST ) ;
return false ;
} else if ( data . contains ( " json_schema " ) & & ! data . contains ( " grammar " ) ) {
json-schema-to-grammar improvements (+ added to server) (#5978)
* json: fix arrays (disallow `[,1]`)
* json: support tuple types (`[number, string]`)
* json: support additionalProperties (`{[k: string]: [string,number][]}`)
* json: support required / optional properties
* json: add support for pattern
* json: resolve $ref (and support https schema urls)
* json: fix $ref resolution
* join: support union types (mostly for nullable types I think)
* json: support allOf + nested anyOf
* json: support any (`{}` or `{type: object}`)
* json: fix merge
* json: temp fix for escapes
* json: spaces in output and unrestricted output spaces
* json: add typings
* json:fix typo
* Create ts-type-to-grammar.sh
* json: fix _format_literal (json.dumps already escapes quotes)
* json: merge lit sequences and handle negatives
{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}
* json: handle pattern repetitions
* Update json-schema-to-grammar.mjs
* Create regex-to-grammar.py
* json: extract repeated regexp patterns to subrule
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* json: handle schema from pydantic Optional fields
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update ts-type-to-grammar.sh
* Update ts-type-to-grammar.sh
* json: simplify nullable fields handling
* json: accept duplicate identical rules
* json: revert space to 1 at most
* json: reuse regexp pattern subrules
* json: handle uuid string format
* json: fix literal escapes
* json: add --allow-fetch
* json: simplify range escapes
* json: support negative ranges in patterns
* Delete commit.txt
* json: custom regex parser, adds dot support & JS-portable
* json: rm trailing spaces
* Update json-schema-to-grammar.mjs
* json: updated server & chat `( cd examples/server && ./deps.sh )`
* json: port fixes from mjs to python
* Update ts-type-to-grammar.sh
* json: support prefixItems alongside array items
* json: add date format + fix uuid
* json: add date, time, date-time formats
* json: preserve order of props from TS defs
* json: port schema converter to C++, wire in ./server
* json: nits
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* json: fix mjs implementation + align outputs
* Update json-schema-to-grammar.mjs.hpp
* json: test C++, JS & Python versions
* json: nits + regen deps
* json: cleanup test
* json: revert from c++17 to 11
* json: nit fixes
* json: dirty include for test
* json: fix zig build
* json: pass static command to std::system in tests (fixed temp files)
* json: fix top-level $refs
* json: don't use c++20 designated initializers
* nit
* json: basic support for reserved names `{number:{number:{root:number}}}`
* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)
* json: re-ran server deps.sh
* json: simplify test
* json: support mix of additional props & required/optional
* json: add tests for some expected failures
* json: fix type=const in c++, add failure expectations for non-str const&enum
* json: test (& simplify output of) empty schema
* json: check parsing in test + fix value & string refs
* json: add server tests for OAI JSON response_format
* json: test/fix top-level anyOf
* json: improve grammar parsing failures
* json: test/fix additional props corner cases
* json: fix string patterns (was missing quotes)
* json: ws nit
* json: fix json handling in server when there's no response_format
* json: catch schema conversion errors in server
* json: don't complain about unknown format type in server if unset
* json: cleaner build of test
* json: create examples/json-schema-pydantic-example.py
* json: fix date pattern
* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common
* json: indent 4 spaces
* json: fix naming of top-level c++ function (+ drop unused one)
* json: avoid using namespace std
* json: fix zig build
* Update server.feature
* json: iostream -> fprintf
* json: space before & refs for consistency
* json: nits
2024-03-21 12:50:43 +01:00
try {
2024-03-25 09:42:17 +01:00
auto schema = json_value ( data , " json_schema " , json : : object ( ) ) ;
json-schema-to-grammar improvements (+ added to server) (#5978)
* json: fix arrays (disallow `[,1]`)
* json: support tuple types (`[number, string]`)
* json: support additionalProperties (`{[k: string]: [string,number][]}`)
* json: support required / optional properties
* json: add support for pattern
* json: resolve $ref (and support https schema urls)
* json: fix $ref resolution
* join: support union types (mostly for nullable types I think)
* json: support allOf + nested anyOf
* json: support any (`{}` or `{type: object}`)
* json: fix merge
* json: temp fix for escapes
* json: spaces in output and unrestricted output spaces
* json: add typings
* json:fix typo
* Create ts-type-to-grammar.sh
* json: fix _format_literal (json.dumps already escapes quotes)
* json: merge lit sequences and handle negatives
{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}
* json: handle pattern repetitions
* Update json-schema-to-grammar.mjs
* Create regex-to-grammar.py
* json: extract repeated regexp patterns to subrule
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* json: handle schema from pydantic Optional fields
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update ts-type-to-grammar.sh
* Update ts-type-to-grammar.sh
* json: simplify nullable fields handling
* json: accept duplicate identical rules
* json: revert space to 1 at most
* json: reuse regexp pattern subrules
* json: handle uuid string format
* json: fix literal escapes
* json: add --allow-fetch
* json: simplify range escapes
* json: support negative ranges in patterns
* Delete commit.txt
* json: custom regex parser, adds dot support & JS-portable
* json: rm trailing spaces
* Update json-schema-to-grammar.mjs
* json: updated server & chat `( cd examples/server && ./deps.sh )`
* json: port fixes from mjs to python
* Update ts-type-to-grammar.sh
* json: support prefixItems alongside array items
* json: add date format + fix uuid
* json: add date, time, date-time formats
* json: preserve order of props from TS defs
* json: port schema converter to C++, wire in ./server
* json: nits
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* json: fix mjs implementation + align outputs
* Update json-schema-to-grammar.mjs.hpp
* json: test C++, JS & Python versions
* json: nits + regen deps
* json: cleanup test
* json: revert from c++17 to 11
* json: nit fixes
* json: dirty include for test
* json: fix zig build
* json: pass static command to std::system in tests (fixed temp files)
* json: fix top-level $refs
* json: don't use c++20 designated initializers
* nit
* json: basic support for reserved names `{number:{number:{root:number}}}`
* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)
* json: re-ran server deps.sh
* json: simplify test
* json: support mix of additional props & required/optional
* json: add tests for some expected failures
* json: fix type=const in c++, add failure expectations for non-str const&enum
* json: test (& simplify output of) empty schema
* json: check parsing in test + fix value & string refs
* json: add server tests for OAI JSON response_format
* json: test/fix top-level anyOf
* json: improve grammar parsing failures
* json: test/fix additional props corner cases
* json: fix string patterns (was missing quotes)
* json: ws nit
* json: fix json handling in server when there's no response_format
* json: catch schema conversion errors in server
* json: don't complain about unknown format type in server if unset
* json: cleaner build of test
* json: create examples/json-schema-pydantic-example.py
* json: fix date pattern
* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common
* json: indent 4 spaces
* json: fix naming of top-level c++ function (+ drop unused one)
* json: avoid using namespace std
* json: fix zig build
* Update server.feature
* json: iostream -> fprintf
* json: space before & refs for consistency
* json: nits
2024-03-21 12:50:43 +01:00
slot . sparams . grammar = json_schema_to_grammar ( schema ) ;
} catch ( const std : : exception & e ) {
send_error ( task , std : : string ( " \" json_schema \" : " ) + e . what ( ) , ERROR_TYPE_INVALID_REQUEST ) ;
return false ;
}
} else {
slot . sparams . grammar = json_value ( data , " grammar " , default_sparams . grammar ) ;
}
2024-03-07 10:41:53 +01:00
if ( slot . params . cache_prompt & & slot . ga_n ! = 1 ) {
LOG_WARNING ( " cache_prompt is not supported with group-attention " , { } ) ;
slot . params . cache_prompt = false ;
}
if ( slot . n_predict > 0 & & slot . params . n_predict > slot . n_predict ) {
2024-02-18 17:30:09 +01:00
// Might be better to reject the request with a 400 ?
LOG_WARNING ( " Max tokens to predict exceeds server configuration " , {
2024-03-07 10:41:53 +01:00
{ " params.n_predict " , slot . params . n_predict } ,
{ " slot.n_predict " , slot . n_predict } ,
2024-02-18 17:30:09 +01:00
} ) ;
2024-03-07 10:41:53 +01:00
slot . params . n_predict = slot . n_predict ;
2024-02-18 17:30:09 +01:00
}
2023-10-22 21:53:08 +02:00
// infill
2024-03-07 10:41:53 +01:00
slot . params . input_prefix = json_value ( data , " input_prefix " , default_params . input_prefix ) ;
slot . params . input_suffix = json_value ( data , " input_suffix " , default_params . input_suffix ) ;
2024-03-09 12:16:53 +01:00
// get prompt
{
const auto & prompt = data . find ( " prompt " ) ;
if ( prompt = = data . end ( ) ) {
2024-03-11 10:56:41 +01:00
send_error ( task , " Either \" prompt \" or \" messages \" must be provided " , ERROR_TYPE_INVALID_REQUEST ) ;
return false ;
2024-03-09 12:16:53 +01:00
} else {
slot . prompt = * prompt ;
}
2024-03-11 10:56:41 +01:00
if ( slot . prompt . is_array ( ) & & slot . prompt . size ( ) = = 0 ) {
send_error ( task , " \" prompt \" cannot be an empty array " , ERROR_TYPE_INVALID_REQUEST ) ;
return false ;
}
2024-03-09 12:16:53 +01:00
}
2023-10-10 09:31:21 +02:00
2024-03-07 10:41:53 +01:00
// penalize user-provided tokens
2023-10-22 21:53:08 +02:00
{
2024-03-07 10:41:53 +01:00
slot . sparams . penalty_prompt_tokens . clear ( ) ;
slot . sparams . use_penalty_prompt_tokens = false ;
2023-10-20 20:07:23 +02:00
2024-03-07 10:41:53 +01:00
const auto & penalty_prompt = data . find ( " penalty_prompt " ) ;
2023-10-02 09:42:02 +02:00
2024-03-07 10:41:53 +01:00
if ( penalty_prompt ! = data . end ( ) ) {
if ( penalty_prompt - > is_string ( ) ) {
const auto penalty_prompt_string = penalty_prompt - > get < std : : string > ( ) ;
slot . sparams . penalty_prompt_tokens = llama_tokenize ( model , penalty_prompt_string , false ) ;
if ( slot . params . n_predict > 0 ) {
slot . sparams . penalty_prompt_tokens . reserve ( slot . sparams . penalty_prompt_tokens . size ( ) + slot . params . n_predict ) ;
}
slot . sparams . use_penalty_prompt_tokens = true ;
LOG_VERBOSE ( " penalty_prompt_tokens " , {
{ " id_slot " , slot . id } ,
{ " tokens " , slot . sparams . penalty_prompt_tokens } ,
} ) ;
2023-12-23 10:31:49 +01:00
}
2024-03-07 10:41:53 +01:00
else if ( penalty_prompt - > is_array ( ) ) {
const auto n_tokens = penalty_prompt - > size ( ) ;
slot . sparams . penalty_prompt_tokens . reserve ( n_tokens + std : : max ( 0 , slot . params . n_predict ) ) ;
const int n_vocab = llama_n_vocab ( model ) ;
for ( const auto & penalty_token : * penalty_prompt ) {
if ( penalty_token . is_number_integer ( ) ) {
const auto tok = penalty_token . get < llama_token > ( ) ;
if ( tok > = 0 & & tok < n_vocab ) {
slot . sparams . penalty_prompt_tokens . push_back ( tok ) ;
}
2023-12-23 10:31:49 +01:00
}
}
2024-03-07 10:41:53 +01:00
slot . sparams . use_penalty_prompt_tokens = true ;
LOG_VERBOSE ( " penalty_prompt_tokens " , {
{ " id_slot " , slot . id } ,
{ " tokens " , slot . sparams . penalty_prompt_tokens } ,
} ) ;
2023-12-23 10:31:49 +01:00
}
}
}
2023-10-02 09:42:02 +02:00
{
2024-03-07 10:41:53 +01:00
slot . sparams . logit_bias . clear ( ) ;
2023-10-02 09:42:02 +02:00
2024-03-07 10:41:53 +01:00
if ( json_value ( data , " ignore_eos " , false ) ) {
slot . sparams . logit_bias [ llama_token_eos ( model ) ] = - INFINITY ;
}
2024-02-11 14:38:14 +01:00
2024-03-07 10:41:53 +01:00
const auto & logit_bias = data . find ( " logit_bias " ) ;
if ( logit_bias ! = data . end ( ) & & logit_bias - > is_array ( ) ) {
const int n_vocab = llama_n_vocab ( model ) ;
for ( const auto & el : * logit_bias ) {
2024-03-11 10:56:41 +01:00
// TODO: we may want to throw errors here, in case "el" is incorrect
2024-03-07 10:41:53 +01:00
if ( el . is_array ( ) & & el . size ( ) = = 2 ) {
float bias ;
if ( el [ 1 ] . is_number ( ) ) {
bias = el [ 1 ] . get < float > ( ) ;
} else if ( el [ 1 ] . is_boolean ( ) & & ! el [ 1 ] . get < bool > ( ) ) {
bias = - INFINITY ;
} else {
continue ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
if ( el [ 0 ] . is_number_integer ( ) ) {
llama_token tok = el [ 0 ] . get < llama_token > ( ) ;
if ( tok > = 0 & & tok < n_vocab ) {
slot . sparams . logit_bias [ tok ] = bias ;
}
} else if ( el [ 0 ] . is_string ( ) ) {
auto toks = llama_tokenize ( model , el [ 0 ] . get < std : : string > ( ) , false ) ;
for ( auto tok : toks ) {
slot . sparams . logit_bias [ tok ] = bias ;
}
2023-10-22 21:53:08 +02:00
}
}
}
}
2023-10-02 09:42:02 +02:00
}
2023-10-20 20:07:23 +02:00
2023-10-02 09:42:02 +02:00
{
2024-03-07 10:41:53 +01:00
slot . params . antiprompt . clear ( ) ;
2023-10-02 09:42:02 +02:00
2024-03-07 10:41:53 +01:00
const auto & stop = data . find ( " stop " ) ;
if ( stop ! = data . end ( ) & & stop - > is_array ( ) ) {
for ( const auto & word : * stop ) {
if ( ! word . empty ( ) ) {
slot . params . antiprompt . push_back ( word ) ;
}
2024-02-16 12:33:25 +01:00
}
}
}
2023-10-22 21:53:08 +02:00
{
2024-03-07 10:41:53 +01:00
const auto & samplers_sequence = data . find ( " samplers " ) ;
if ( samplers_sequence ! = data . end ( ) & & samplers_sequence - > is_array ( ) ) {
std : : vector < std : : string > sampler_names ;
for ( const auto & sampler_name : * samplers_sequence ) {
if ( sampler_name . is_string ( ) ) {
sampler_names . emplace_back ( sampler_name ) ;
2023-10-22 21:53:08 +02:00
}
}
2024-03-07 10:41:53 +01:00
slot . sparams . samplers_sequence = sampler_types_from_names ( sampler_names , false ) ;
} else {
slot . sparams . samplers_sequence = default_sparams . samplers_sequence ;
2023-10-22 21:53:08 +02:00
}
}
2023-10-12 08:29:04 +02:00
2023-10-02 09:42:02 +02:00
{
2024-03-07 10:41:53 +01:00
if ( slot . ctx_sampling ! = nullptr ) {
llama_sampling_free ( slot . ctx_sampling ) ;
}
slot . ctx_sampling = llama_sampling_init ( slot . sparams ) ;
2024-03-11 10:56:41 +01:00
if ( slot . ctx_sampling = = nullptr ) {
// for now, the only error that may happen here is invalid grammar
send_error ( task , " Failed to parse grammar " , ERROR_TYPE_INVALID_REQUEST ) ;
return false ;
}
2023-10-02 09:42:02 +02:00
}
2024-03-07 10:41:53 +01:00
slot . command = SLOT_COMMAND_LOAD_PROMPT ;
slot . prompt_tokens . clear ( ) ;
2023-10-12 08:29:04 +02:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( " slot is processing task " , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
2024-02-25 13:50:32 +01:00
} ) ;
2023-10-02 09:42:02 +02:00
2023-10-22 21:53:08 +02:00
return true ;
}
void kv_cache_clear ( ) {
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " clearing KV cache " , { } ) ;
2023-10-22 21:53:08 +02:00
// clear the entire KV cache
2023-10-29 18:31:40 +01:00
llama_kv_cache_clear ( ctx ) ;
2023-10-22 21:53:08 +02:00
clean_kv_cache = false ;
2023-10-02 09:42:02 +02:00
}
2023-08-23 09:12:12 +02:00
2024-02-29 21:42:11 +01:00
void system_prompt_update ( ) {
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " system prompt update " , {
{ " system_prompt " , system_prompt } ,
} ) ;
2024-02-16 11:00:56 +01:00
kv_cache_clear ( ) ;
system_tokens . clear ( ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-16 11:00:56 +01:00
if ( ! system_prompt . empty ( ) ) {
2024-04-09 19:44:08 +02:00
system_tokens = : : llama_tokenize ( ctx , system_prompt , true ) ;
2023-10-22 21:53:08 +02:00
2024-02-16 11:00:56 +01:00
llama_batch_clear ( batch ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
for ( int i = 0 ; i < ( int ) system_tokens . size ( ) ; + + i ) {
2024-02-16 11:00:56 +01:00
llama_batch_add ( batch , system_tokens [ i ] , i , { 0 } , false ) ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-13 18:54:21 +01:00
const int32_t n_batch = llama_n_batch ( ctx ) ;
for ( int32_t i = 0 ; i < batch . n_tokens ; i + = n_batch ) {
const int32_t n_tokens = std : : min ( params . n_batch , batch . n_tokens - i ) ;
2024-02-25 19:43:50 +01:00
llama_batch batch_view = {
n_tokens ,
batch . token + i ,
nullptr ,
batch . pos + i ,
batch . n_seq_id + i ,
batch . seq_id + i ,
batch . logits + i ,
0 , 0 , 0 , // unused
} ;
2024-03-07 10:41:53 +01:00
if ( llama_decode ( ctx , batch_view ) ! = 0 ) {
2024-04-12 13:49:21 +02:00
LOG_ERROR ( " llama_decode() failed " , { } ) ;
2024-02-25 19:43:50 +01:00
return ;
}
2024-02-16 11:00:56 +01:00
}
2023-10-20 20:07:23 +02:00
2024-02-16 11:00:56 +01:00
// assign the system KV cache to all parallel sequences
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
for ( int32_t i = 1 ; i < = params . n_parallel ; + + i ) {
llama_kv_cache_seq_cp ( ctx , 0 , i , - 1 , - 1 ) ;
2024-02-16 11:00:56 +01:00
}
2023-05-21 19:51:18 +02:00
}
2023-10-22 21:53:08 +02:00
system_need_update = false ;
}
2023-09-28 18:04:36 +02:00
2024-03-07 10:41:53 +01:00
void system_prompt_set ( const json & sys_props ) {
2023-10-22 21:53:08 +02:00
system_prompt = sys_props . value ( " prompt " , " " ) ;
name_user = sys_props . value ( " anti_prompt " , " " ) ;
name_assistant = sys_props . value ( " assistant_name " , " " ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " system prompt process " , {
{ " system_prompt " , system_prompt } ,
{ " name_user " , name_user } ,
{ " name_assistant " , name_assistant } ,
} ) ;
2024-02-16 11:00:56 +01:00
2024-03-07 10:41:53 +01:00
// release all slots
for ( server_slot & slot : slots ) {
slot . release ( ) ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
system_need_update = true ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
bool process_token ( completion_token_output & result , server_slot & slot ) {
2023-10-22 21:53:08 +02:00
// remember which tokens were sampled - used for repetition penalties during sampling
2024-04-24 12:15:29 +02:00
const std : : string token_str = llama_token_to_piece ( ctx , result . tok , false ) ;
2023-10-22 21:53:08 +02:00
slot . sampled = result . tok ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
// search stop word and delete it
slot . generated_text + = token_str ;
slot . has_next_token = true ;
2024-03-07 10:41:53 +01:00
if ( slot . ctx_sampling - > params . use_penalty_prompt_tokens & & result . tok ! = - 1 ) {
2023-12-23 10:31:49 +01:00
// we can change penalty_prompt_tokens because it is always created from scratch each request
slot . ctx_sampling - > params . penalty_prompt_tokens . push_back ( result . tok ) ;
}
2023-12-13 20:57:15 +01:00
// check if there is incomplete UTF-8 character at the end
bool incomplete = false ;
2024-03-07 10:41:53 +01:00
for ( unsigned i = 1 ; i < 5 & & i < = slot . generated_text . size ( ) ; + + i ) {
2023-12-13 20:57:15 +01:00
unsigned char c = slot . generated_text [ slot . generated_text . size ( ) - i ] ;
2024-03-07 10:41:53 +01:00
if ( ( c & 0xC0 ) = = 0x80 ) {
2023-12-13 20:57:15 +01:00
// continuation byte: 10xxxxxx
continue ;
}
2024-03-07 10:41:53 +01:00
if ( ( c & 0xE0 ) = = 0xC0 ) {
2023-12-13 20:57:15 +01:00
// 2-byte character: 110xxxxx ...
incomplete = i < 2 ;
2024-03-07 10:41:53 +01:00
} else if ( ( c & 0xF0 ) = = 0xE0 ) {
2023-12-13 20:57:15 +01:00
// 3-byte character: 1110xxxx ...
incomplete = i < 3 ;
2024-03-07 10:41:53 +01:00
} else if ( ( c & 0xF8 ) = = 0xF0 ) {
2023-12-13 20:57:15 +01:00
// 4-byte character: 11110xxx ...
incomplete = i < 4 ;
2023-10-22 21:53:08 +02:00
}
2023-12-13 20:57:15 +01:00
// else 1-byte character or invalid byte
break ;
2023-05-21 19:51:18 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
if ( ! incomplete ) {
2024-02-29 21:42:11 +01:00
size_t pos = std : : min ( slot . n_sent_text , slot . generated_text . size ( ) ) ;
2024-03-07 10:41:53 +01:00
2023-10-22 21:53:08 +02:00
const std : : string str_test = slot . generated_text . substr ( pos ) ;
bool is_stop_full = false ;
2024-03-07 10:41:53 +01:00
size_t stop_pos = slot . find_stopping_strings ( str_test , token_str . size ( ) , STOP_TYPE_FULL ) ;
if ( stop_pos ! = std : : string : : npos ) {
2023-10-22 21:53:08 +02:00
is_stop_full = true ;
slot . generated_text . erase (
slot . generated_text . begin ( ) + pos + stop_pos ,
slot . generated_text . end ( ) ) ;
2024-02-29 21:42:11 +01:00
pos = std : : min ( slot . n_sent_text , slot . generated_text . size ( ) ) ;
2024-03-07 10:41:53 +01:00
} else {
2023-10-22 21:53:08 +02:00
is_stop_full = false ;
2024-03-07 10:41:53 +01:00
stop_pos = slot . find_stopping_strings ( str_test , token_str . size ( ) , STOP_TYPE_PARTIAL ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-09-28 18:04:36 +02:00
2023-10-22 21:53:08 +02:00
// check if there is any token to predict
2024-03-07 10:41:53 +01:00
if ( stop_pos = = std : : string : : npos | | ( ! slot . has_next_token & & ! is_stop_full & & stop_pos > 0 ) ) {
2023-10-22 21:53:08 +02:00
// no send the stop word in the response
result . text_to_send = slot . generated_text . substr ( pos , std : : string : : npos ) ;
2024-02-29 21:42:11 +01:00
slot . n_sent_text + = result . text_to_send . size ( ) ;
2023-10-22 21:53:08 +02:00
// add the token to slot queue and cache
}
2024-03-07 10:41:53 +01:00
2023-10-22 21:53:08 +02:00
slot . add_token_string ( result ) ;
2024-03-07 10:41:53 +01:00
if ( slot . params . stream ) {
2023-10-22 21:53:08 +02:00
send_partial_response ( slot , result ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-05-21 19:51:18 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
if ( incomplete ) {
2023-10-22 21:53:08 +02:00
slot . has_next_token = true ;
2023-06-20 00:12:39 +02:00
}
2023-10-22 21:53:08 +02:00
// check the limits
2024-03-07 10:41:53 +01:00
if ( slot . n_decoded > 0 & & slot . has_next_token & & ! slot . has_budget ( params ) ) {
slot . stopped_limit = true ;
2023-10-22 21:53:08 +02:00
slot . has_next_token = false ;
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " stopped by limit " , {
{ " id_slot " , slot . id } ,
2024-03-09 10:30:04 +01:00
{ " id_task " , slot . id_task } ,
2024-03-07 10:41:53 +01:00
{ " n_decoded " , slot . n_decoded } ,
{ " n_predict " , slot . params . n_predict } ,
} ) ;
2023-10-22 21:53:08 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-04-21 13:50:41 +02:00
if ( llama_token_is_eog ( model , result . tok ) ) {
2024-03-07 10:41:53 +01:00
slot . stopped_eos = true ;
2023-10-22 21:53:08 +02:00
slot . has_next_token = false ;
2024-03-07 10:41:53 +01:00
2023-10-22 21:53:08 +02:00
LOG_VERBOSE ( " eos token found " , { } ) ;
}
2024-04-26 12:15:30 +02:00
auto n_ctx_train = llama_n_ctx_train ( model ) ;
2024-04-27 17:50:48 +02:00
if ( slot . params . n_predict < 1 & & slot . n_predict < 1 & & slot . ga_n = = 1
2024-04-26 12:15:30 +02:00
& & slot . n_prompt_tokens + slot . n_decoded > = n_ctx_train ) {
LOG_WARNING ( " n_predict is not set and self-context extend is disabled. "
" Limiting generated tokens to n_ctx_train to avoid EOS-less generation infinite loop " , {
{ " id_slot " , slot . id } ,
{ " params.n_predict " , slot . params . n_predict } ,
{ " slot.n_prompt_tokens " , slot . n_prompt_tokens } ,
{ " slot.n_decoded " , slot . n_decoded } ,
{ " slot.n_predict " , slot . n_predict } ,
{ " n_slots " , params . n_parallel } ,
{ " slot.n_ctx " , slot . n_ctx } ,
{ " n_ctx " , n_ctx } ,
{ " n_ctx_train " , n_ctx_train } ,
{ " ga_n " , slot . ga_n } ,
} ) ;
slot . truncated = true ;
slot . stopped_limit = true ;
slot . has_next_token = false ; // stop prediction
}
2023-10-22 21:53:08 +02:00
LOG_VERBOSE ( " next token " , {
2024-03-09 10:30:04 +01:00
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
2024-03-07 10:41:53 +01:00
{ " token " , result . tok } ,
{ " token_text " , tokens_to_output_formatted_string ( ctx , result . tok ) } ,
{ " has_next_token " , slot . has_next_token } ,
{ " n_remain " , slot . n_remaining } ,
{ " n_decoded " , slot . n_decoded } ,
{ " stopped_eos " , slot . stopped_eos } ,
{ " stopped_word " , slot . stopped_word } ,
{ " stopped_limit " , slot . stopped_limit } ,
{ " stopping_word " , slot . stopping_word } ,
} ) ;
2023-10-22 21:53:08 +02:00
return slot . has_next_token ; // continue
}
2023-08-08 15:29:19 +02:00
2024-03-07 10:41:53 +01:00
json get_formated_generation ( const server_slot & slot ) const {
2023-10-23 21:40:03 +02:00
const auto eos_bias = slot . sparams . logit_bias . find ( llama_token_eos ( model ) ) ;
2024-03-07 10:41:53 +01:00
const bool ignore_eos = eos_bias ! = slot . sparams . logit_bias . end ( ) & & eos_bias - > second < 0.0f & & std : : isinf ( eos_bias - > second ) ;
2024-02-16 12:33:25 +01:00
std : : vector < std : : string > samplers_sequence ;
2024-03-07 10:41:53 +01:00
samplers_sequence . reserve ( slot . sparams . samplers_sequence . size ( ) ) ;
for ( const auto & sampler_type : slot . sparams . samplers_sequence ) {
2024-02-16 12:33:25 +01:00
samplers_sequence . emplace_back ( sampler_type_to_name_string ( sampler_type ) ) ;
}
2023-10-22 21:53:08 +02:00
return json {
2024-03-07 10:41:53 +01:00
{ " n_ctx " , slot . n_ctx } ,
{ " n_predict " , slot . n_predict } ,
{ " model " , params . model_alias } ,
{ " seed " , slot . params . seed } ,
{ " temperature " , slot . sparams . temp } ,
{ " dynatemp_range " , slot . sparams . dynatemp_range } ,
{ " dynatemp_exponent " , slot . sparams . dynatemp_exponent } ,
{ " top_k " , slot . sparams . top_k } ,
{ " top_p " , slot . sparams . top_p } ,
{ " min_p " , slot . sparams . min_p } ,
{ " tfs_z " , slot . sparams . tfs_z } ,
{ " typical_p " , slot . sparams . typical_p } ,
{ " repeat_last_n " , slot . sparams . penalty_last_n } ,
{ " repeat_penalty " , slot . sparams . penalty_repeat } ,
{ " presence_penalty " , slot . sparams . penalty_present } ,
{ " frequency_penalty " , slot . sparams . penalty_freq } ,
{ " penalty_prompt_tokens " , slot . sparams . penalty_prompt_tokens } ,
2023-12-23 10:31:49 +01:00
{ " use_penalty_prompt_tokens " , slot . sparams . use_penalty_prompt_tokens } ,
2024-03-07 10:41:53 +01:00
{ " mirostat " , slot . sparams . mirostat } ,
{ " mirostat_tau " , slot . sparams . mirostat_tau } ,
{ " mirostat_eta " , slot . sparams . mirostat_eta } ,
{ " penalize_nl " , slot . sparams . penalize_nl } ,
{ " stop " , slot . params . antiprompt } ,
2024-03-13 18:54:21 +01:00
{ " n_predict " , slot . params . n_predict } , // TODO: fix duplicate key n_predict
2024-03-22 12:12:05 +01:00
{ " n_keep " , slot . params . n_keep } ,
2024-03-26 09:47:43 +01:00
{ " n_discard " , slot . params . n_discard } ,
2024-03-07 10:41:53 +01:00
{ " ignore_eos " , ignore_eos } ,
{ " stream " , slot . params . stream } ,
{ " logit_bias " , slot . sparams . logit_bias } ,
{ " n_probs " , slot . sparams . n_probs } ,
{ " min_keep " , slot . sparams . min_keep } ,
{ " grammar " , slot . sparams . grammar } ,
{ " samplers " , samplers_sequence }
2023-10-22 21:53:08 +02:00
} ;
}
2024-03-11 10:56:41 +01:00
void send_error ( const server_task & task , const std : : string & error , const enum error_type type = ERROR_TYPE_SERVER ) {
send_error ( task . id , task . id_multi , error , type ) ;
}
void send_error ( const server_slot & slot , const std : : string & error , const enum error_type type = ERROR_TYPE_SERVER ) {
send_error ( slot . id_task , slot . id_multi , error , type ) ;
}
void send_error ( const int id_task , const int id_multi , const std : : string & error , const enum error_type type = ERROR_TYPE_SERVER ) {
2024-04-12 13:49:21 +02:00
LOG_ERROR ( " task error " , {
{ " id_multi " , id_multi } ,
{ " id_task " , id_task } ,
{ " error " , error } ,
} ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
server_task_result res ;
2024-03-11 10:56:41 +01:00
res . id = id_task ;
res . id_multi = id_multi ;
2024-03-07 10:41:53 +01:00
res . stop = false ;
res . error = true ;
2024-03-11 10:56:41 +01:00
res . data = format_error_response ( error , type ) ;
2024-03-07 10:41:53 +01:00
queue_results . send ( res ) ;
}
void send_partial_response ( server_slot & slot , completion_token_output tkn ) {
server_task_result res ;
res . id = slot . id_task ;
res . id_multi = slot . id_multi ;
res . error = false ;
res . stop = false ;
res . data = json {
2023-10-22 21:53:08 +02:00
{ " content " , tkn . text_to_send } ,
{ " stop " , false } ,
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
{ " multimodal " , false }
2023-10-22 21:53:08 +02:00
} ;
2024-03-07 10:41:53 +01:00
if ( slot . sparams . n_probs > 0 ) {
2023-10-22 21:53:08 +02:00
const std : : vector < llama_token > to_send_toks = llama_tokenize ( ctx , tkn . text_to_send , false ) ;
2024-03-07 10:41:53 +01:00
const size_t probs_pos = std : : min ( slot . n_sent_token_probs , slot . generated_token_probs . size ( ) ) ;
const size_t probs_stop_pos = std : : min ( slot . n_sent_token_probs + to_send_toks . size ( ) , slot . generated_token_probs . size ( ) ) ;
std : : vector < completion_token_output > probs_output ;
if ( probs_pos < probs_stop_pos ) {
probs_output = std : : vector < completion_token_output > (
slot . generated_token_probs . begin ( ) + probs_pos ,
slot . generated_token_probs . begin ( ) + probs_stop_pos ) ;
2023-07-02 23:38:44 +02:00
}
2024-02-29 21:42:11 +01:00
slot . n_sent_token_probs = probs_stop_pos ;
2024-03-07 10:41:53 +01:00
res . data [ " completion_probabilities " ] = probs_vector_to_json ( ctx , probs_output ) ;
2023-10-22 21:53:08 +02:00
}
2023-08-08 15:29:19 +02:00
2024-03-07 10:41:53 +01:00
if ( slot . oaicompat ) {
res . data [ " oaicompat_token_ctr " ] = slot . n_decoded ;
res . data [ " model " ] = slot . oaicompat_model ;
2023-11-25 10:29:06 +01:00
}
2024-01-26 13:42:20 +01:00
queue_results . send ( res ) ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
void send_final_response ( const server_slot & slot ) {
server_task_result res ;
res . id = slot . id_task ;
res . id_multi = slot . id_multi ;
res . error = false ;
res . stop = true ;
res . data = json {
2023-10-22 21:53:08 +02:00
{ " content " , ! slot . params . stream ? slot . generated_text : " " } ,
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
2023-10-22 21:53:08 +02:00
{ " stop " , true } ,
{ " model " , params . model_alias } ,
{ " tokens_predicted " , slot . n_decoded } ,
2024-02-29 21:42:11 +01:00
{ " tokens_evaluated " , slot . n_prompt_tokens } ,
2023-10-22 21:53:08 +02:00
{ " generation_settings " , get_formated_generation ( slot ) } ,
{ " prompt " , slot . prompt } ,
{ " truncated " , slot . truncated } ,
{ " stopped_eos " , slot . stopped_eos } ,
{ " stopped_word " , slot . stopped_word } ,
{ " stopped_limit " , slot . stopped_limit } ,
{ " stopping_word " , slot . stopping_word } ,
{ " tokens_cached " , slot . n_past } ,
{ " timings " , slot . get_formated_timings ( ) }
} ;
2024-03-07 10:41:53 +01:00
if ( slot . sparams . n_probs > 0 ) {
std : : vector < completion_token_output > probs ;
if ( ! slot . params . stream & & slot . stopped_word ) {
2023-10-22 21:53:08 +02:00
const std : : vector < llama_token > stop_word_toks = llama_tokenize ( ctx , slot . stopping_word , false ) ;
2024-03-07 10:41:53 +01:00
2023-10-22 21:53:08 +02:00
probs = std : : vector < completion_token_output > (
2024-03-07 10:41:53 +01:00
slot . generated_token_probs . begin ( ) ,
slot . generated_token_probs . end ( ) - stop_word_toks . size ( ) ) ;
} else {
probs = std : : vector < completion_token_output > (
slot . generated_token_probs . begin ( ) ,
slot . generated_token_probs . end ( ) ) ;
2023-10-05 16:02:55 +02:00
}
2024-03-07 10:41:53 +01:00
res . data [ " completion_probabilities " ] = probs_vector_to_json ( ctx , probs ) ;
2023-05-21 19:51:18 +02:00
}
2024-03-07 10:41:53 +01:00
if ( slot . oaicompat ) {
res . data [ " oaicompat_token_ctr " ] = slot . n_decoded ;
res . data [ " model " ] = slot . oaicompat_model ;
2023-11-25 10:29:06 +01:00
}
2024-01-26 13:42:20 +01:00
queue_results . send ( res ) ;
2023-10-22 21:53:08 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
void send_embedding ( const server_slot & slot , const llama_batch & batch ) {
server_task_result res ;
res . id = slot . id_task ;
res . id_multi = slot . id_multi ;
res . error = false ;
res . stop = true ;
2023-10-22 21:53:08 +02:00
const int n_embd = llama_n_embd ( model ) ;
2024-03-04 21:31:20 +01:00
2024-03-09 13:27:58 +01:00
std : : vector < float > embd_res ( n_embd , 0.0f ) ;
2024-03-07 10:41:53 +01:00
for ( int i = 0 ; i < batch . n_tokens ; + + i ) {
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
if ( ! batch . logits [ i ] | | batch . seq_id [ i ] [ 0 ] ! = slot . id + 1 ) {
2024-03-07 10:41:53 +01:00
continue ;
}
2024-03-04 21:31:20 +01:00
2024-03-07 10:41:53 +01:00
const float * embd = llama_get_embeddings_seq ( ctx , batch . seq_id [ i ] [ 0 ] ) ;
if ( embd = = NULL ) {
embd = llama_get_embeddings_ith ( ctx , i ) ;
}
2024-03-04 21:31:20 +01:00
2024-03-07 10:41:53 +01:00
if ( embd = = NULL ) {
LOG_ERROR ( " failed to get embeddings " , {
{ " token " , batch . token [ i ] } ,
{ " seq_id " , batch . seq_id [ i ] [ 0 ] }
} ) ;
res . data = json {
{ " embedding " , std : : vector < float > ( n_embd , 0.0f ) } ,
2024-03-04 21:31:20 +01:00
} ;
2024-03-07 10:41:53 +01:00
continue ;
2024-03-04 21:31:20 +01:00
}
2024-03-07 10:41:53 +01:00
2024-03-09 13:27:58 +01:00
llama_embd_normalize ( embd , embd_res . data ( ) , n_embd ) ;
2024-03-07 10:41:53 +01:00
res . data = json {
2024-03-09 13:27:58 +01:00
{ " embedding " , embd_res } ,
2024-03-07 10:41:53 +01:00
} ;
2023-05-21 19:51:18 +02:00
}
2024-03-07 10:41:53 +01:00
2024-01-26 13:42:20 +01:00
queue_results . send ( res ) ;
2023-10-22 21:53:08 +02:00
}
2023-05-21 19:51:18 +02:00
2024-03-07 10:41:53 +01:00
void request_completion ( int id_task , int id_multi , json data , bool infill , bool embedding ) {
server_task task ;
task . id = id_task ;
task . id_multi = id_multi ;
task . id_target = 0 ;
task . data = std : : move ( data ) ;
task . infill = infill ;
task . embedding = embedding ;
task . type = SERVER_TASK_TYPE_COMPLETION ;
2023-11-30 23:25:04 +01:00
// when a completion task's prompt array is not a singleton, we split it into multiple requests
// otherwise, it's a single-prompt task, we actually queue it
2024-02-06 09:16:23 +01:00
// if there's numbers in the prompt array it will be treated as an array of tokens
if ( task . data . count ( " prompt " ) ! = 0 & & task . data . at ( " prompt " ) . size ( ) > 1 ) {
bool numbers = false ;
2024-03-07 10:41:53 +01:00
for ( const auto & e : task . data . at ( " prompt " ) ) {
2024-02-06 09:16:23 +01:00
if ( e . is_number ( ) ) {
numbers = true ;
break ;
}
}
// NOTE: split_multiprompt_task() does not handle a mix of strings and numbers,
// it will completely stall the server. I don't know where the bug for this is.
//
// if there are numbers, it needs to be treated like a single prompt,
// queue_tasks handles a mix of strings and numbers just fine.
if ( numbers ) {
queue_tasks . post ( task ) ;
} else {
2024-03-07 10:41:53 +01:00
split_multiprompt_task ( id_task , task ) ;
2024-02-06 09:16:23 +01:00
}
} else {
queue_tasks . post ( task ) ;
}
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
void request_cancel ( int id_task ) {
server_task task ;
task . type = SERVER_TASK_TYPE_CANCEL ;
task . id_target = id_task ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-01-26 13:42:20 +01:00
queue_tasks . post ( task ) ;
2023-10-22 21:53:08 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
void split_multiprompt_task ( int id_multi , const server_task & multiprompt_task ) {
const int prompt_count = multiprompt_task . data . at ( " prompt " ) . size ( ) ;
2024-02-06 09:16:23 +01:00
if ( prompt_count < = 1 ) {
send_error ( multiprompt_task , " error while handling multiple prompts " ) ;
return ;
}
2023-11-30 23:25:04 +01:00
2024-01-26 13:42:20 +01:00
// generate all the ID for subtask
2023-11-30 23:25:04 +01:00
std : : vector < int > subtask_ids ( prompt_count ) ;
2024-03-07 10:41:53 +01:00
for ( int i = 0 ; i < prompt_count ; i + + ) {
2024-01-26 13:42:20 +01:00
subtask_ids [ i ] = queue_tasks . get_new_id ( ) ;
}
// queue up the multitask so we can track its subtask progression
2024-03-07 10:41:53 +01:00
queue_tasks . add_multitask ( id_multi , subtask_ids ) ;
2024-01-26 13:42:20 +01:00
// add subtasks
2024-03-07 10:41:53 +01:00
for ( int i = 0 ; i < prompt_count ; i + + ) {
2023-11-30 23:25:04 +01:00
json subtask_data = multiprompt_task . data ;
subtask_data [ " prompt " ] = subtask_data [ " prompt " ] [ i ] ;
// subtasks inherit everything else (infill mode, embedding mode, etc.)
2024-03-07 10:41:53 +01:00
request_completion ( subtask_ids [ i ] , id_multi , subtask_data , multiprompt_task . infill , multiprompt_task . embedding ) ;
2023-11-30 23:25:04 +01:00
}
}
2024-03-07 10:41:53 +01:00
void process_single_task ( const server_task & task ) {
switch ( task . type ) {
case SERVER_TASK_TYPE_COMPLETION :
2024-01-26 13:42:20 +01:00
{
2024-03-07 10:41:53 +01:00
server_slot * slot = get_slot ( json_value ( task . data , " id_slot " , - 1 ) ) ;
if ( slot = = nullptr ) {
// if no slot is available, we defer this task for processing later
LOG_VERBOSE ( " no slot is available " , { { " id_task " , task . id } } ) ;
queue_tasks . defer ( task ) ;
2024-01-13 18:31:26 +01:00
break ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
if ( task . data . contains ( " system_prompt " ) ) {
system_prompt_set ( task . data [ " system_prompt " ] ) ;
for ( server_slot & slot : slots ) {
slot . n_past = 0 ;
slot . n_past_se = 0 ;
}
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
slot - > reset ( ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
slot - > id_task = task . id ;
slot - > id_multi = task . id_multi ;
slot - > infill = task . infill ;
slot - > embedding = task . embedding ;
2023-10-22 21:53:08 +02:00
2024-03-11 10:56:41 +01:00
if ( ! launch_slot_with_task ( * slot , task ) ) {
LOG_ERROR ( " error while launching slot " , task . data ) ;
2023-10-22 21:53:08 +02:00
break ;
}
2024-03-07 10:41:53 +01:00
} break ;
case SERVER_TASK_TYPE_CANCEL :
{
// release slot linked with the task id
for ( auto & slot : slots ) {
if ( slot . id_task = = task . id_target ) {
slot . release ( ) ;
break ;
}
2024-02-24 12:28:55 +01:00
}
2024-03-07 10:41:53 +01:00
} break ;
case SERVER_TASK_TYPE_NEXT_RESPONSE :
{
// do nothing
} break ;
case SERVER_TASK_TYPE_METRICS :
{
json slots_data = json : : array ( ) ;
int n_idle_slots = 0 ;
int n_processing_slots = 0 ;
for ( server_slot & slot : slots ) {
json slot_data = get_formated_generation ( slot ) ;
slot_data [ " id " ] = slot . id ;
slot_data [ " id_task " ] = slot . id_task ;
slot_data [ " state " ] = slot . state ;
slot_data [ " prompt " ] = slot . prompt ;
slot_data [ " next_token " ] = {
{ " has_next_token " , slot . has_next_token } ,
{ " n_remain " , slot . n_remaining } ,
{ " n_decoded " , slot . n_decoded } ,
{ " stopped_eos " , slot . stopped_eos } ,
{ " stopped_word " , slot . stopped_word } ,
{ " stopped_limit " , slot . stopped_limit } ,
{ " stopping_word " , slot . stopping_word } ,
} ;
if ( slot_data [ " state " ] = = SLOT_STATE_IDLE ) {
n_idle_slots + + ;
} else {
n_processing_slots + + ;
}
slots_data . push_back ( slot_data ) ;
}
LOG_INFO ( " slot data " , {
{ " id_task " , task . id } ,
{ " n_idle_slots " , n_idle_slots } ,
{ " n_processing_slots " , n_processing_slots }
} ) ;
LOG_VERBOSE ( " slot data " , {
{ " id_task " , task . id } ,
{ " n_idle_slots " , n_idle_slots } ,
{ " n_processing_slots " , n_processing_slots } ,
{ " slots " , slots_data }
} ) ;
server_task_result res ;
res . id = task . id ;
res . id_multi = task . id_multi ;
res . stop = true ;
res . error = false ;
res . data = {
2024-02-25 13:49:43 +01:00
{ " idle " , n_idle_slots } ,
{ " processing " , n_processing_slots } ,
{ " deferred " , queue_tasks . queue_tasks_deferred . size ( ) } ,
2024-03-08 12:25:04 +01:00
{ " t_start " , metrics . t_start } ,
2024-02-25 13:49:43 +01:00
{ " n_prompt_tokens_processed_total " , metrics . n_prompt_tokens_processed_total } ,
2024-03-08 12:25:04 +01:00
{ " t_tokens_generation_total " , metrics . t_tokens_generation_total } ,
2024-02-25 13:49:43 +01:00
{ " n_tokens_predicted_total " , metrics . n_tokens_predicted_total } ,
2024-03-08 12:25:04 +01:00
{ " t_prompt_processing_total " , metrics . t_prompt_processing_total } ,
2024-02-25 13:49:43 +01:00
{ " n_prompt_tokens_processed " , metrics . n_prompt_tokens_processed } ,
{ " t_prompt_processing " , metrics . t_prompt_processing } ,
{ " n_tokens_predicted " , metrics . n_tokens_predicted } ,
{ " t_tokens_generation " , metrics . t_tokens_generation } ,
2024-02-29 21:42:11 +01:00
{ " kv_cache_tokens_count " , llama_get_kv_cache_token_count ( ctx ) } ,
{ " kv_cache_used_cells " , llama_get_kv_cache_used_cells ( ctx ) } ,
2024-02-25 13:49:43 +01:00
2024-02-29 21:42:11 +01:00
{ " slots " , slots_data } ,
2024-03-07 10:41:53 +01:00
} ;
2024-03-08 12:25:04 +01:00
if ( json_value ( task . data , " reset_bucket " , false ) ) {
metrics . reset_bucket ( ) ;
}
2024-03-07 10:41:53 +01:00
queue_results . send ( res ) ;
} break ;
2024-04-08 14:43:30 +02:00
case SERVER_TASK_TYPE_SLOT_SAVE :
{
int id_slot = task . data [ " id_slot " ] ;
server_slot * slot = get_slot ( id_slot ) ;
if ( slot = = nullptr ) {
send_error ( task , " Invalid slot ID " , ERROR_TYPE_INVALID_REQUEST ) ;
break ;
}
const size_t token_count = slot - > cache_tokens . size ( ) ;
const int64_t t_start = ggml_time_us ( ) ;
std : : string filename = task . data [ " filename " ] ;
std : : string filepath = task . data [ " filepath " ] ;
const size_t nwrite = llama_state_seq_save_file ( ctx , filepath . c_str ( ) , slot - > id + 1 , slot - > cache_tokens . data ( ) , token_count ) ;
const int64_t t_end = ggml_time_us ( ) ;
const double t_save_ms = ( t_end - t_start ) / 1000.0 ;
server_task_result result ;
result . id = task . id ;
result . stop = true ;
result . error = false ;
result . data = json {
{ " id_slot " , id_slot } ,
{ " filename " , filename } ,
{ " n_saved " , token_count } , // tokens saved
{ " n_written " , nwrite } , // bytes written
{ " timings " , {
{ " save_ms " , t_save_ms }
} }
} ;
queue_results . send ( result ) ;
} break ;
case SERVER_TASK_TYPE_SLOT_RESTORE :
{
int id_slot = task . data [ " id_slot " ] ;
server_slot * slot = get_slot ( id_slot ) ;
if ( slot = = nullptr ) {
send_error ( task , " Invalid slot ID " , ERROR_TYPE_INVALID_REQUEST ) ;
break ;
}
const int64_t t_start = ggml_time_us ( ) ;
std : : string filename = task . data [ " filename " ] ;
std : : string filepath = task . data [ " filepath " ] ;
slot - > cache_tokens . resize ( slot - > n_ctx ) ;
size_t token_count = 0 ;
size_t nread = llama_state_seq_load_file ( ctx , filepath . c_str ( ) , slot - > id + 1 , slot - > cache_tokens . data ( ) , slot - > cache_tokens . size ( ) , & token_count ) ;
if ( nread = = 0 ) {
slot - > cache_tokens . resize ( 0 ) ;
send_error ( task , " Unable to restore slot, no available space in KV cache or invalid slot save file " , ERROR_TYPE_INVALID_REQUEST ) ;
break ;
}
slot - > cache_tokens . resize ( token_count ) ;
const int64_t t_end = ggml_time_us ( ) ;
const double t_restore_ms = ( t_end - t_start ) / 1000.0 ;
server_task_result result ;
result . id = task . id ;
result . stop = true ;
result . error = false ;
result . data = json {
{ " id_slot " , id_slot } ,
{ " filename " , filename } ,
{ " n_restored " , token_count } , // tokens restored
{ " n_read " , nread } , // bytes read
{ " timings " , {
{ " restore_ms " , t_restore_ms }
} }
} ;
queue_results . send ( result ) ;
} break ;
case SERVER_TASK_TYPE_SLOT_ERASE :
{
int id_slot = task . data [ " id_slot " ] ;
server_slot * slot = get_slot ( id_slot ) ;
if ( slot = = nullptr ) {
send_error ( task , " Invalid slot ID " , ERROR_TYPE_INVALID_REQUEST ) ;
break ;
}
// Erase token cache
const size_t n_erased = slot - > cache_tokens . size ( ) ;
llama_kv_cache_seq_rm ( ctx , slot - > id + 1 , - 1 , - 1 ) ;
slot - > cache_tokens . clear ( ) ;
server_task_result result ;
result . id = task . id ;
result . stop = true ;
result . error = false ;
result . data = json {
{ " id_slot " , id_slot } ,
{ " n_erased " , n_erased }
} ;
queue_results . send ( result ) ;
} break ;
2023-07-02 23:38:44 +02:00
}
2024-01-26 13:42:20 +01:00
}
2023-11-30 23:25:04 +01:00
2024-03-07 10:41:53 +01:00
void on_finish_multitask ( const server_task_multi & multitask ) {
2024-01-26 13:42:20 +01:00
// all subtasks done == multitask is done
2024-03-07 10:41:53 +01:00
server_task_result result ;
result . id = multitask . id ;
result . stop = true ;
2024-01-26 13:42:20 +01:00
result . error = false ;
2024-01-18 21:33:05 +01:00
2024-01-26 13:42:20 +01:00
// collect json results into one json result
std : : vector < json > result_jsons ;
2024-03-07 10:41:53 +01:00
for ( const auto & subres : multitask . results ) {
result_jsons . push_back ( subres . data ) ;
2024-01-26 13:42:20 +01:00
result . error = result . error & & subres . error ;
2023-11-30 23:25:04 +01:00
}
2024-03-07 10:41:53 +01:00
result . data = json {
{ " results " , result_jsons }
} ;
2024-01-26 13:42:20 +01:00
queue_results . send ( result ) ;
2023-10-22 21:53:08 +02:00
}
2023-07-02 23:38:44 +02:00
2024-03-11 10:56:41 +01:00
void update_slots ( ) {
2024-03-07 10:41:53 +01:00
if ( system_need_update ) {
2024-02-29 21:42:11 +01:00
system_prompt_update ( ) ;
2023-07-05 22:51:13 +02:00
}
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
// release slots
for ( auto & slot : slots ) {
if ( slot . command = = SLOT_COMMAND_RELEASE ) {
slot . state = SLOT_STATE_IDLE ;
slot . command = SLOT_COMMAND_NONE ;
slot . t_last_used = ggml_time_us ( ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
LOG_INFO ( " slot released " , {
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
{ " n_ctx " , n_ctx } ,
{ " n_past " , slot . n_past } ,
{ " n_system_tokens " , system_tokens . size ( ) } ,
{ " n_cache_tokens " , slot . cache_tokens . size ( ) } ,
{ " truncated " , slot . truncated }
} ) ;
queue_tasks . notify_slot_changed ( ) ;
2023-07-05 22:51:13 +02:00
}
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
// check if all slots are idle
{
bool all_idle = true ;
for ( auto & slot : slots ) {
if ( slot . state ! = SLOT_STATE_IDLE | | slot . command ! = SLOT_COMMAND_NONE ) {
all_idle = false ;
break ;
}
}
if ( all_idle ) {
LOG_INFO ( " all slots are idle " , { } ) ;
if ( system_prompt . empty ( ) & & clean_kv_cache ) {
kv_cache_clear ( ) ;
}
2024-03-11 10:56:41 +01:00
return ;
2024-03-07 10:41:53 +01:00
}
}
2024-01-30 19:17:30 +01:00
2023-10-22 21:53:08 +02:00
{
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " posting NEXT_RESPONSE " , { } ) ;
server_task task ;
task . type = SERVER_TASK_TYPE_NEXT_RESPONSE ;
task . id_target = - 1 ;
queue_tasks . post ( task ) ;
}
// apply context-shift if needed
// TODO: simplify and improve
for ( server_slot & slot : slots ) {
if ( slot . ga_n = = 1 ) {
if ( slot . is_processing ( ) & & ( int ) system_tokens . size ( ) + slot . n_past > = slot . n_ctx - 1 ) {
2024-01-27 14:38:05 +01:00
// Shift context
2024-02-21 16:33:54 +01:00
const int n_keep = slot . params . n_keep + add_bos_token ;
2024-02-25 13:50:32 +01:00
const int n_left = ( int ) system_tokens . size ( ) + slot . n_past - n_keep ;
2024-03-26 09:47:43 +01:00
const int n_discard = slot . params . n_discard ? slot . params . n_discard : ( n_left / 2 ) ;
2023-10-22 21:53:08 +02:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( " slot context shift " , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
2024-02-25 13:50:32 +01:00
{ " n_keep " , n_keep } ,
{ " n_left " , n_left } ,
{ " n_discard " , n_discard } ,
{ " n_ctx " , n_ctx } ,
{ " n_past " , slot . n_past } ,
{ " n_system_tokens " , system_tokens . size ( ) } ,
{ " n_cache_tokens " , slot . cache_tokens . size ( ) }
} ) ;
2024-03-07 10:41:53 +01:00
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
llama_kv_cache_seq_rm ( ctx , slot . id + 1 , n_keep , n_keep + n_discard ) ;
llama_kv_cache_seq_add ( ctx , slot . id + 1 , n_keep + n_discard , system_tokens . size ( ) + slot . n_past , - n_discard ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
if ( slot . params . cache_prompt ) {
for ( size_t i = n_keep + n_discard ; i < slot . cache_tokens . size ( ) ; i + + ) {
slot . cache_tokens [ i - n_discard ] = slot . cache_tokens [ i ] ;
}
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
slot . cache_tokens . resize ( slot . cache_tokens . size ( ) - n_discard ) ;
}
2023-10-22 21:53:08 +02:00
2024-01-27 14:38:05 +01:00
slot . n_past - = n_discard ;
2023-10-22 21:53:08 +02:00
2024-01-27 14:38:05 +01:00
slot . truncated = true ;
}
2023-07-05 22:51:13 +02:00
}
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
// start populating the batch for this iteration
llama_batch_clear ( batch ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
// frist, add sampled tokens from any ongoing sequences
for ( auto & slot : slots ) {
if ( slot . state = = SLOT_STATE_IDLE ) {
2023-10-22 21:53:08 +02:00
continue ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-10-22 21:53:08 +02:00
slot . i_batch = batch . n_tokens ;
2024-01-27 14:38:05 +01:00
const int32_t slot_npast = slot . n_past_se > 0 ? slot . n_past_se : slot . n_past ;
2023-10-22 21:53:08 +02:00
2024-01-30 19:17:30 +01:00
// TODO: we always have to take into account the "system_tokens"
// this is not great and needs to be improved somehow
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
llama_batch_add ( batch , slot . sampled , system_tokens . size ( ) + slot_npast , { slot . id + 1 } , true ) ;
2024-03-07 10:41:53 +01:00
2023-10-22 21:53:08 +02:00
slot . n_past + = 1 ;
2024-03-07 10:41:53 +01:00
if ( slot . params . cache_prompt ) {
slot . cache_tokens . push_back ( slot . sampled ) ;
}
LOG_VERBOSE ( " slot decode token " , {
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
{ " n_ctx " , n_ctx } ,
{ " n_past " , slot . n_past } ,
{ " n_system_tokens " , system_tokens . size ( ) } ,
{ " n_cache_tokens " , slot . cache_tokens . size ( ) } ,
{ " truncated " , slot . truncated }
} ) ;
2023-05-21 19:51:18 +02:00
}
2023-10-22 21:53:08 +02:00
// process in chunks of params.n_batch
2024-03-22 12:08:28 +01:00
int32_t n_batch = llama_n_batch ( ctx ) ;
2024-03-13 18:54:21 +01:00
int32_t n_ubatch = llama_n_ubatch ( ctx ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
// next, batch any pending prompts without exceeding n_batch
if ( params . cont_batching | | batch . n_tokens = = 0 ) {
for ( auto & slot : slots ) {
// this slot still has a prompt to be processed
if ( slot . state = = SLOT_STATE_IDLE & & slot . command = = SLOT_COMMAND_LOAD_PROMPT ) {
auto & prompt_tokens = slot . prompt_tokens ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
// we haven't tokenized the prompt yet - do it now:
if ( prompt_tokens . empty ( ) ) {
LOG_VERBOSE ( " tokenizing prompt " , {
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task }
} ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
slot . t_start_process_prompt = ggml_time_us ( ) ;
slot . t_start_generation = 0 ;
if ( slot . infill ) {
bool suff_rm_leading_spc = true ;
if ( params . input_suffix . find_first_of ( ' ' ) = = 0 & & params . input_suffix . size ( ) > 1 ) {
params . input_suffix . erase ( 0 , 1 ) ;
suff_rm_leading_spc = false ;
}
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
auto prefix_tokens = tokenize ( slot . params . input_prefix , false ) ;
auto suffix_tokens = tokenize ( slot . params . input_suffix , false ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
const int space_token = 29871 ; // TODO: this should not be hardcoded
if ( suff_rm_leading_spc & & ! suffix_tokens . empty ( ) & & suffix_tokens [ 0 ] = = space_token ) {
suffix_tokens . erase ( suffix_tokens . begin ( ) ) ;
}
2023-11-11 06:48:21 +01:00
2024-03-07 10:41:53 +01:00
prefix_tokens . insert ( prefix_tokens . begin ( ) , llama_token_prefix ( model ) ) ;
prefix_tokens . insert ( prefix_tokens . begin ( ) , llama_token_bos ( model ) ) ; // always add BOS
prefix_tokens . insert ( prefix_tokens . end ( ) , llama_token_suffix ( model ) ) ;
prefix_tokens . insert ( prefix_tokens . end ( ) , suffix_tokens . begin ( ) , suffix_tokens . end ( ) ) ;
prefix_tokens . push_back ( llama_token_middle ( model ) ) ;
prompt_tokens = prefix_tokens ;
} else {
2024-04-09 19:44:08 +02:00
prompt_tokens = tokenize ( slot . prompt , system_prompt . empty ( ) ) ; // add BOS if there isn't system prompt
2024-03-07 10:41:53 +01:00
}
slot . n_past = 0 ;
2024-02-29 21:42:11 +01:00
slot . n_prompt_tokens = prompt_tokens . size ( ) ;
2023-11-11 06:48:21 +01:00
2024-03-09 10:30:04 +01:00
LOG_VERBOSE ( " prompt tokenized " , {
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
{ " n_ctx " , slot . n_ctx } ,
{ " n_keep " , slot . params . n_keep } ,
{ " n_prompt_tokens " , slot . n_prompt_tokens } ,
{ " prompt_tokens " , tokens_to_str ( ctx , prompt_tokens . cbegin ( ) , prompt_tokens . cend ( ) ) } ,
} ) ;
2024-03-09 11:34:18 +01:00
// empty prompt passed -> release the slot and send empty response
if ( prompt_tokens . empty ( ) ) {
LOG_INFO ( " empty prompt - releasing slot " , {
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task }
} ) ;
slot . state = SLOT_STATE_PROCESSING ;
slot . command = SLOT_COMMAND_NONE ;
slot . release ( ) ;
slot . print_timings ( ) ;
send_final_response ( slot ) ;
continue ;
}
2024-03-07 10:41:53 +01:00
if ( slot . embedding ) {
// this prompt is too large to process - discard it
2024-03-13 18:54:21 +01:00
if ( slot . n_prompt_tokens > n_ubatch ) {
2024-03-07 10:41:53 +01:00
slot . state = SLOT_STATE_PROCESSING ;
slot . command = SLOT_COMMAND_NONE ;
slot . release ( ) ;
slot . print_timings ( ) ;
send_final_response ( slot ) ;
continue ;
}
} else {
if ( slot . params . n_keep < 0 ) {
slot . params . n_keep = slot . n_prompt_tokens ;
}
slot . params . n_keep = std : : min ( slot . n_ctx - 4 , slot . params . n_keep ) ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
// if input prompt is too big, truncate it (if group attention self-extend is disabled)
if ( slot . ga_n = = 1 & & slot . n_prompt_tokens > = slot . n_ctx ) {
const int n_left = slot . n_ctx - slot . params . n_keep ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
const int n_block_size = n_left / 2 ;
const int erased_blocks = ( slot . n_prompt_tokens - slot . params . n_keep - n_block_size ) / n_block_size ;
2024-02-25 19:43:50 +01:00
2024-03-07 10:41:53 +01:00
std : : vector < llama_token > new_tokens (
prompt_tokens . begin ( ) ,
prompt_tokens . begin ( ) + slot . params . n_keep ) ;
2024-02-25 19:43:50 +01:00
2024-03-07 10:41:53 +01:00
new_tokens . insert (
new_tokens . end ( ) ,
prompt_tokens . begin ( ) + slot . params . n_keep + erased_blocks * n_block_size ,
prompt_tokens . end ( ) ) ;
prompt_tokens = std : : move ( new_tokens ) ;
slot . truncated = true ;
slot . n_prompt_tokens = prompt_tokens . size ( ) ;
LOG_VERBOSE ( " input truncated " , {
2024-03-09 10:30:04 +01:00
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
{ " n_ctx " , slot . n_ctx } ,
{ " n_keep " , slot . params . n_keep } ,
{ " n_left " , n_left } ,
{ " n_prompt_tokens " , slot . n_prompt_tokens } ,
{ " prompt_tokens " , tokens_to_str ( ctx , prompt_tokens . cbegin ( ) , prompt_tokens . cend ( ) ) } ,
2024-03-07 10:41:53 +01:00
} ) ;
GGML_ASSERT ( slot . n_prompt_tokens < slot . n_ctx ) ;
}
llama_sampling_reset ( slot . ctx_sampling ) ;
if ( ! slot . params . cache_prompt ) {
slot . n_past_se = 0 ;
slot . ga_i = 0 ;
} else {
GGML_ASSERT ( slot . ga_n = = 1 ) ;
// reuse any previously computed tokens that are common with the new prompt
slot . n_past = common_part ( slot . cache_tokens , prompt_tokens ) ;
// push the prompt into the sampling context (do not apply grammar)
for ( int i = 0 ; i < slot . n_past ; + + i ) {
llama_sampling_accept ( slot . ctx_sampling , ctx , slot . cache_tokens [ i ] , false ) ;
2024-01-27 14:38:05 +01:00
}
}
}
2024-03-07 10:41:53 +01:00
if ( slot . n_past = = slot . n_prompt_tokens & & slot . n_past > 0 ) {
// we have to evaluate at least 1 token to generate logits.
LOG_INFO ( " we have to evaluate at least 1 token to generate logits " , {
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task }
} ) ;
slot . n_past - - ;
if ( slot . ga_i > 0 ) {
slot . n_past_se - - ;
}
}
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
slot . n_prompt_tokens_processed = 0 ;
}
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
if ( slot . embedding ) {
// cannot fit the prompt in the current batch - will try next iter
if ( batch . n_tokens + slot . n_prompt_tokens > n_batch ) {
continue ;
2024-01-27 14:38:05 +01:00
}
2023-10-22 21:53:08 +02:00
}
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
// keep only the common part
int p0 = ( int ) system_tokens . size ( ) + slot . n_past ;
if ( ! llama_kv_cache_seq_rm ( ctx , slot . id + 1 , p0 , - 1 ) ) {
// could not partially delete (likely using a non-Transformer model)
llama_kv_cache_seq_rm ( ctx , slot . id + 1 , - 1 , - 1 ) ;
p0 = ( int ) system_tokens . size ( ) ;
if ( p0 ! = 0 ) {
// copy over the system prompt when there is one
llama_kv_cache_seq_cp ( ctx , 0 , slot . id + 1 , - 1 , - 1 ) ;
}
// there is no common part left (except for the system prompt)
slot . n_past = 0 ;
slot . n_past_se = 0 ;
slot . ga_i = 0 ;
// TODO: is the system prompt ever in the sampling context?
llama_sampling_reset ( slot . ctx_sampling ) ;
}
// remove the non-common part from the cache
slot . cache_tokens . resize ( slot . n_past ) ;
2024-03-07 10:41:53 +01:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( " kv cache rm [p0, end) " , {
2024-03-07 10:41:53 +01:00
{ " id_slot " , slot . id } ,
{ " id_task " , slot . id_task } ,
2024-02-25 13:50:32 +01:00
{ " p0 " , p0 }
} ) ;
2024-01-30 19:17:30 +01:00
2024-01-27 14:38:05 +01:00
int32_t slot_npast = slot . n_past_se > 0 ? slot . n_past_se : slot . n_past ;
2024-01-30 19:17:30 +01:00
int32_t ga_i = slot . ga_i ;
2024-01-27 14:38:05 +01:00
int32_t ga_n = slot . ga_n ;
int32_t ga_w = slot . ga_w ;
2024-01-30 19:17:30 +01:00
2024-03-07 10:41:53 +01:00
// add prompt tokens for processing in the current batch
// TODO: the self-extend stuff here is a mess - simplify and/or abstract it somehow
for ( ; slot . n_past < slot . n_prompt_tokens & & batch . n_tokens < n_batch ; + + slot . n_past ) {
if ( slot . ga_n ! = 1 ) {
2024-01-27 14:38:05 +01:00
while ( slot_npast > = ga_i + ga_w ) {
const int bd = ( ga_w / ga_n ) * ( ga_n - 1 ) ;
slot_npast - = bd ;
ga_i + = ga_w / ga_n ;
}
}
2024-03-07 10:41:53 +01:00
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
llama_batch_add ( batch , prompt_tokens [ slot . n_past ] , system_tokens . size ( ) + slot_npast , { slot . id + 1 } , false ) ;
2024-03-07 10:41:53 +01:00
if ( slot . params . cache_prompt ) {
slot . cache_tokens . push_back ( prompt_tokens [ slot . n_past ] ) ;
}
slot . n_prompt_tokens_processed + + ;
2024-01-30 19:17:30 +01:00
slot_npast + + ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " prompt processing progress " , {
{ " id_slot " , slot . id } ,
{ " n_past " , slot . n_past } ,
{ " n_ctx " , n_ctx } ,
{ " n_tokens " , batch . n_tokens } ,
{ " progress " , ( float ) slot . n_prompt_tokens_processed / slot . n_prompt_tokens } ,
} ) ;
// entire prompt has been processed - start decoding new tokens
if ( slot . n_past = = slot . n_prompt_tokens ) {
slot . state = SLOT_STATE_PROCESSING ;
slot . command = SLOT_COMMAND_NONE ;
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
GGML_ASSERT ( batch . n_tokens > 0 ) ;
// extract the logits only for the last token
2023-10-22 21:53:08 +02:00
batch . logits [ batch . n_tokens - 1 ] = true ;
2024-03-07 10:41:53 +01:00
slot . n_decoded = 0 ;
slot . i_batch = batch . n_tokens - 1 ;
LOG_VERBOSE ( " prompt done " , {
{ " id_slot " , slot . id } ,
{ " n_past " , slot . n_past } ,
{ " n_ctx " , n_ctx } ,
{ " n_tokens " , batch . n_tokens } ,
} ) ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
}
2023-10-22 21:53:08 +02:00
2024-03-07 10:41:53 +01:00
if ( batch . n_tokens > = n_batch ) {
break ;
2023-10-22 21:53:08 +02:00
}
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-05-21 19:51:18 +02:00
2024-03-07 10:41:53 +01:00
if ( batch . n_tokens = = 0 ) {
LOG_VERBOSE ( " no tokens to decode " , { } ) ;
2024-03-11 10:56:41 +01:00
return ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-05-21 19:51:18 +02:00
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " decoding batch " , {
{ " n_tokens " , batch . n_tokens } ,
} ) ;
// process the created batch of tokens
2024-04-26 12:15:30 +02:00
for ( int32_t i = 0 ; i < batch . n_tokens ; i + = n_batch ) {
2024-03-04 21:31:20 +01:00
const int32_t n_tokens = std : : min ( n_batch , batch . n_tokens - i ) ;
2024-01-27 14:38:05 +01:00
2024-03-07 10:41:53 +01:00
for ( auto & slot : slots ) {
if ( slot . ga_n ! = 1 ) {
2024-01-27 14:38:05 +01:00
// context extension via Self-Extend
2024-03-07 10:41:53 +01:00
// TODO: simplify and/or abstract this
while ( slot . n_past_se > = slot . ga_i + slot . ga_w ) {
2024-01-27 14:38:05 +01:00
const int ib = ( slot . ga_n * slot . ga_i ) / slot . ga_w ;
const int bd = ( slot . ga_w / slot . ga_n ) * ( slot . ga_n - 1 ) ;
const int dd = ( slot . ga_w / slot . ga_n ) - ib * bd - slot . ga_w ;
LOG_TEE ( " \n " ) ;
LOG_TEE ( " shift: [%6d, %6d] + %6d -> [%6d, %6d] \n " , slot . ga_i , slot . n_past_se , ib * bd , slot . ga_i + ib * bd , slot . n_past_se + ib * bd ) ;
LOG_TEE ( " div: [%6d, %6d] / %6d -> [%6d, %6d] \n " , slot . ga_i + ib * bd , slot . ga_i + ib * bd + slot . ga_w , slot . ga_n , ( slot . ga_i + ib * bd ) / slot . ga_n , ( slot . ga_i + ib * bd + slot . ga_w ) / slot . ga_n ) ;
LOG_TEE ( " shift: [%6d, %6d] + %6d -> [%6d, %6d] \n " , slot . ga_i + ib * bd + slot . ga_w , slot . n_past_se + ib * bd , dd , slot . ga_i + ib * bd + slot . ga_w + dd , slot . n_past_se + ib * bd + dd ) ;
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM
* mamba : begin figuring out how to (ab)use the kv cache for Mamba
* mamba : recurrent inference almost works, but incoherent
* mamba : recurrent inference WORKS!!!
* convert : optionally use d_conv and d_state from config.json for Mamba
* mamba : refactor recurrent conv, resulting in 20% perf increase
It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.
I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.
* ggml : parallelize ggml_exp
This results in 8% faster token generation for Mamba-130M.
* mamba : simplify the conv step with a self-overlapping view
Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.
Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.
Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).
* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32
Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.
* mamba : fix self-overlapping view depth stride
* mamba : handle batches of more than 1 token
This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.
Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.
* ggml: add ggml_ssm_scan to help with parallel selective scan
If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.
* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation
This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.
* mamba : very basic quantization support
Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)
Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.
Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.
* convert : fix wrong name for layer norm weight of offical Mamba models
I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")
* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator
This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.
However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.
* convert : for Mamba, also consider the "MambaLMHeadModel" arch name
It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json
* mamba : fix vocab size problems with official models
The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.
Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.
* ggml : remove ggml_exp and ggml_soft_plus
They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.
* mamba : remove some useless comments
No code change.
* convert : fix flake8 linter errors
* mamba : apply suggestions from code review
* mamba : remove unecessary branch for row-wise ssm_state and C multiplication
It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.
* ggml : in ggml_ssm_scan, use more appropriate asserts
* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32
* mamba : multiple sequences, but one at a time
This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).
The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)
Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.
* mamba : support llama_kv_cache_seq_cp
This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.
Each KV cell is dedicated to the sequence ID corresponding to its own index.
* mamba : use a state mask
It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.
inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).
* llama : replace the usage of n_ctx with kv_self.size in many places
* mamba : use n_tokens directly instead of n_tok
* mamba : in comments, properly refer to KV cells instead of slots
* mamba : reduce memory usage of ggml_ssm_scan
From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.
The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.
* mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.
This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.
However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.
* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba
This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).
Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.
Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.
* llama : add inp_s_seq as a new input tensor
The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.
The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.
Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
* mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
* mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.
* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
* mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
* mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences
This adapts to what the loaded model can provide.
* llama : add llama_n_max_seq to get the upper limit for seq_ids
Used by the perplexity example.
* batched : pass n_parallel to the model's context params
This should have been there already, but it wasn't.
* batched-bench : reserve sequences to support Mamba
* batched-bench : fix tokens being put in wrong sequences
Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
* mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.
This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
will not require breaking existing converted Mamba models again)
* gguf-py : add new KV metadata key-value pairs for Mamba
* llama : add new metadata key-value pairs for Mamba
* llama : guard against divisions by zero when n_head is 0
* mamba : rename "unlimited" KV cache property to "recurrent"
* mamba : more correctly update the "used" field of the KV cache
* ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
* convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
* mamba : support state saving and restoring
* ggml : implicitly pass src tensors through dst for Mamba-related ops
* mamba : clarify some comments
* server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.
For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
* convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
* mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".
* mamba : support mamba-*-hf models
These models share their token_embd.weight with their output.weight
* mamba : add missing spaces
This is purely a formatting change.
* convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
* readme : add Mamba to supported models, and add recent API changes
* mamba : move state_seq and state_mask views outside layer loop
A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 23:31:00 +01:00
llama_kv_cache_seq_add ( ctx , slot . id + 1 , slot . ga_i , slot . n_past_se , ib * bd ) ;
llama_kv_cache_seq_div ( ctx , slot . id + 1 , slot . ga_i + ib * bd , slot . ga_i + ib * bd + slot . ga_w , slot . ga_n ) ;
llama_kv_cache_seq_add ( ctx , slot . id + 1 , slot . ga_i + ib * bd + slot . ga_w , slot . n_past_se + ib * bd , dd ) ;
2024-01-27 14:38:05 +01:00
slot . n_past_se - = bd ;
slot . ga_i + = slot . ga_w / slot . ga_n ;
LOG_TEE ( " \n n_past_old = %d, n_past = %d, ga_i = %d \n \n " , slot . n_past_se + bd , slot . n_past_se , slot . ga_i ) ;
}
2024-03-07 10:41:53 +01:00
2024-01-27 14:38:05 +01:00
slot . n_past_se + = n_tokens ;
}
}
2024-01-30 19:17:30 +01:00
2024-03-07 10:41:53 +01:00
llama_batch batch_view = {
2023-10-22 21:53:08 +02:00
n_tokens ,
batch . token + i ,
nullptr ,
batch . pos + i ,
batch . n_seq_id + i ,
batch . seq_id + i ,
batch . logits + i ,
0 , 0 , 0 , // unused
} ;
2023-05-21 19:51:18 +02:00
2023-10-22 21:53:08 +02:00
const int ret = llama_decode ( ctx , batch_view ) ;
2024-01-27 14:38:05 +01:00
2024-03-07 10:41:53 +01:00
if ( ret ! = 0 ) {
if ( n_batch = = 1 | | ret < 0 ) {
2023-10-22 21:53:08 +02:00
// if you get here, it means the KV cache is full - try increasing it via the context size
2024-04-12 13:49:21 +02:00
LOG_ERROR ( " failed to decode the batch: KV cache is full - try increasing it via the context size " , {
{ " i " , i } ,
{ " n_batch " , ret } ,
{ " ret " , ret } ,
} ) ;
2024-03-11 10:56:41 +01:00
for ( auto & slot : slots ) {
slot . state = SLOT_STATE_PROCESSING ;
slot . command = SLOT_COMMAND_NONE ;
slot . release ( ) ;
send_error ( slot , " Input prompt is too big compared to KV size. Please try increasing KV size. " ) ;
}
break ; // break loop of n_batch
2023-10-22 21:53:08 +02:00
}
2023-06-20 00:12:39 +02:00
2023-10-22 21:53:08 +02:00
// retry with half the batch size to try to find a free slot in the KV cache
n_batch / = 2 ;
i - = n_batch ;
2024-03-07 10:41:53 +01:00
2024-04-12 13:49:21 +02:00
LOG_WARNING ( " failed to find free space in the KV cache, retrying with smaller batch size - try increasing it via the context size or enable defragmentation " , {
{ " i " , i } ,
{ " n_batch " , n_batch } ,
{ " ret " , ret } ,
} ) ;
2024-03-11 10:56:41 +01:00
continue ; // continue loop of n_batch
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
for ( auto & slot : slots ) {
if ( slot . state ! = SLOT_STATE_PROCESSING | | slot . i_batch < ( int ) i | | slot . i_batch > = ( int ) ( i + n_tokens ) ) {
2024-03-11 10:56:41 +01:00
continue ; // continue loop of slots
2023-10-22 21:53:08 +02:00
}
// prompt evaluated for embedding
2024-03-07 10:41:53 +01:00
if ( slot . embedding ) {
2024-03-04 21:31:20 +01:00
send_embedding ( slot , batch_view ) ;
2023-10-22 21:53:08 +02:00
slot . release ( ) ;
slot . i_batch = - 1 ;
2024-03-11 10:56:41 +01:00
continue ; // continue loop of slots
2023-10-22 21:53:08 +02:00
}
completion_token_output result ;
const llama_token id = llama_sampling_sample ( slot . ctx_sampling , ctx , NULL , slot . i_batch - i ) ;
llama_sampling_accept ( slot . ctx_sampling , ctx , id , true ) ;
2024-01-07 07:45:26 +01:00
slot . n_decoded + = 1 ;
2024-03-07 10:41:53 +01:00
if ( slot . n_decoded = = 1 ) {
slot . t_start_generation = ggml_time_us ( ) ;
slot . t_prompt_processing = ( slot . t_start_generation - slot . t_start_process_prompt ) / 1e3 ;
2024-02-25 13:49:43 +01:00
metrics . on_prompt_eval ( slot ) ;
2023-10-22 21:53:08 +02:00
}
llama_token_data_array cur_p = { slot . ctx_sampling - > cur . data ( ) , slot . ctx_sampling - > cur . size ( ) , false } ;
result . tok = id ;
const int32_t n_probs = slot . sparams . n_probs ;
2024-03-07 10:41:53 +01:00
if ( slot . sparams . temp < = 0 & & n_probs > 0 ) {
2023-10-22 21:53:08 +02:00
// for llama_sample_token_greedy we need to sort candidates
llama_sample_softmax ( ctx , & cur_p ) ;
}
2024-03-07 10:41:53 +01:00
for ( size_t i = 0 ; i < std : : min ( cur_p . size , ( size_t ) n_probs ) ; + + i ) {
result . probs . push_back ( {
cur_p . data [ i ] . id ,
cur_p . data [ i ] . p
} ) ;
2023-10-22 21:53:08 +02:00
}
2024-03-07 10:41:53 +01:00
if ( ! process_token ( result , slot ) ) {
2023-10-22 21:53:08 +02:00
slot . release ( ) ;
slot . print_timings ( ) ;
2023-10-24 22:08:20 +02:00
send_final_response ( slot ) ;
2024-02-25 13:49:43 +01:00
metrics . on_prediction ( slot ) ;
2023-10-22 21:53:08 +02:00
}
slot . i_batch = - 1 ;
}
2023-06-20 00:12:39 +02:00
}
2024-02-25 13:50:32 +01:00
2024-03-11 10:56:41 +01:00
LOG_VERBOSE ( " run slots completed " , { } ) ;
2023-06-20 00:12:39 +02:00
}
2024-03-02 22:00:14 +01:00
2024-03-07 10:41:53 +01:00
json model_meta ( ) const {
return json {
{ " vocab_type " , llama_vocab_type ( model ) } ,
{ " n_vocab " , llama_n_vocab ( model ) } ,
{ " n_ctx_train " , llama_n_ctx_train ( model ) } ,
{ " n_embd " , llama_n_embd ( model ) } ,
{ " n_params " , llama_model_n_params ( model ) } ,
{ " size " , llama_model_size ( model ) } ,
2024-03-02 22:00:14 +01:00
} ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
} ;
2024-03-07 10:41:53 +01:00
static void server_print_usage ( const char * argv0 , const gpt_params & params , const server_params & sparams ) {
2023-09-05 21:10:27 +02:00
printf ( " usage: %s [options] \n " , argv0 ) ;
printf ( " \n " ) ;
printf ( " options: \n " ) ;
2023-10-24 22:10:43 +02:00
printf ( " -h, --help show this help message and exit \n " ) ;
printf ( " -v, --verbose verbose output (default: %s) \n " , server_verbose ? " enabled " : " disabled " ) ;
2023-11-01 23:04:33 +01:00
printf ( " -t N, --threads N number of threads to use during computation (default: %d) \n " , params . n_threads ) ;
2023-10-24 22:10:43 +02:00
printf ( " -tb N, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads) \n " ) ;
2024-03-03 08:48:36 +01:00
printf ( " --threads-http N number of threads in the http server pool to process requests (default: max(hardware concurrency - 1, --parallel N + 2)) \n " ) ;
2023-11-01 23:04:33 +01:00
printf ( " -c N, --ctx-size N size of the prompt context (default: %d) \n " , params . n_ctx ) ;
printf ( " --rope-scaling {none,linear,yarn} \n " ) ;
printf ( " RoPE frequency scaling method, defaults to linear unless specified by the model \n " ) ;
2023-10-24 22:10:43 +02:00
printf ( " --rope-freq-base N RoPE base frequency (default: loaded from model) \n " ) ;
2023-11-01 23:04:33 +01:00
printf ( " --rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N \n " ) ;
printf ( " --yarn-ext-factor N YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation) \n " ) ;
printf ( " --yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: 1.0) \n " ) ;
printf ( " --yarn-beta-slow N YaRN: high correction dim or alpha (default: %.1f) \n " , params . yarn_beta_slow ) ;
printf ( " --yarn-beta-fast N YaRN: low correction dim or beta (default: %.1f) \n " , params . yarn_beta_fast ) ;
2024-03-07 10:41:53 +01:00
printf ( " --pooling {none,mean,cls} pooling type for embeddings, use model default if unspecified \n " ) ;
2024-03-09 23:41:49 +01:00
printf ( " -dt N, --defrag-thold N \n " ) ;
printf ( " KV cache defragmentation threshold (default: %.1f, < 0 - disabled) \n " , params . defrag_thold ) ;
2024-03-13 18:54:21 +01:00
printf ( " -b N, --batch-size N logical maximum batch size (default: %d) \n " , params . n_batch ) ;
printf ( " -ub N, --ubatch-size N physical maximum batch size (default: %d) \n " , params . n_ubatch ) ;
2024-03-07 10:41:53 +01:00
if ( llama_supports_mlock ( ) ) {
2024-01-30 19:17:30 +01:00
printf ( " --mlock force system to keep model in RAM rather than swapping or compressing \n " ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2024-03-07 10:41:53 +01:00
if ( llama_supports_mmap ( ) ) {
2024-01-30 19:17:30 +01:00
printf ( " --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock) \n " ) ;
2023-05-21 19:51:18 +02:00
}
2024-02-16 10:31:07 +01:00
printf ( " --numa TYPE attempt optimizations that help on some NUMA systems \n " ) ;
printf ( " - distribute: spread execution evenly over all nodes \n " ) ;
printf ( " - isolate: only spawn threads on CPUs on the node that execution started on \n " ) ;
printf ( " - numactl: use the CPU map provided my numactl \n " ) ;
2024-01-31 16:30:17 +01:00
if ( llama_supports_gpu_offload ( ) ) {
printf ( " -ngl N, --n-gpu-layers N \n " ) ;
printf ( " number of layers to store in VRAM \n " ) ;
printf ( " -sm SPLIT_MODE, --split-mode SPLIT_MODE \n " ) ;
printf ( " how to split the model across multiple GPUs, one of: \n " ) ;
printf ( " - none: use one GPU only \n " ) ;
printf ( " - layer (default): split layers and KV across GPUs \n " ) ;
printf ( " - row: split rows across GPUs \n " ) ;
printf ( " -ts SPLIT --tensor-split SPLIT \n " ) ;
printf ( " fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1 \n " ) ;
printf ( " -mg i, --main-gpu i the GPU to use for the model (with split-mode = none), \n " ) ;
printf ( " or for intermediate results and KV (with split-mode = row) \n " ) ;
2024-04-04 08:33:48 +02:00
printf ( " -nkvo, --no-kv-offload \n " ) ;
printf ( " disable KV offload \n " ) ;
2024-01-31 16:30:17 +01:00
}
2023-09-05 21:10:27 +02:00
printf ( " -m FNAME, --model FNAME \n " ) ;
2024-04-30 01:52:50 +02:00
printf ( " model path (default: models/$filename with filename from --hf-file or --model-url if set, otherwise %s) \n " , DEFAULT_MODEL_PATH ) ;
2024-03-17 19:12:37 +01:00
printf ( " -mu MODEL_URL, --model-url MODEL_URL \n " ) ;
2024-03-23 18:07:00 +01:00
printf ( " model download url (default: unused) \n " ) ;
printf ( " -hfr REPO, --hf-repo REPO \n " ) ;
printf ( " Hugging Face model repository (default: unused) \n " ) ;
printf ( " -hff FILE, --hf-file FILE \n " ) ;
printf ( " Hugging Face model file (default: unused) \n " ) ;
2023-09-05 21:10:27 +02:00
printf ( " -a ALIAS, --alias ALIAS \n " ) ;
2024-01-30 19:17:30 +01:00
printf ( " set an alias for the model, will be added as `model` field in completion response \n " ) ;
printf ( " --lora FNAME apply LoRA adapter (implies --no-mmap) \n " ) ;
printf ( " --lora-base FNAME optional model to use as a base for the layers modified by the LoRA adapter \n " ) ;
printf ( " --host ip address to listen (default (default: %s) \n " , sparams . hostname . c_str ( ) ) ;
printf ( " --port PORT port to listen (default (default: %d) \n " , sparams . port ) ;
2024-03-09 11:27:53 +01:00
printf ( " --path PUBLIC_PATH path from which to serve static files (default: disabled) \n " ) ;
2024-01-30 19:17:30 +01:00
printf ( " --api-key API_KEY optional api key to enhance server security. If set, requests must include this key for access. \n " ) ;
printf ( " --api-key-file FNAME path to file containing api keys delimited by new lines. If set, requests must include one of the keys for access. \n " ) ;
2024-03-09 10:57:09 +01:00
# ifdef CPPHTTPLIB_OPENSSL_SUPPORT
printf ( " --ssl-key-file FNAME path to file a PEM-encoded SSL private key \n " ) ;
printf ( " --ssl-cert-file FNAME path to file a PEM-encoded SSL certificate \n " ) ;
# endif
2024-01-30 19:17:30 +01:00
printf ( " -to N, --timeout N server read/write timeout in seconds (default: %d) \n " , sparams . read_timeout ) ;
2024-03-07 10:41:53 +01:00
printf ( " --embeddings enable embedding vector output (default: %s) \n " , params . embedding ? " enabled " : " disabled " ) ;
2024-01-30 19:17:30 +01:00
printf ( " -np N, --parallel N number of slots for process requests (default: %d) \n " , params . n_parallel ) ;
2024-03-22 12:08:28 +01:00
printf ( " -cb, --cont-batching enable continuous batching (a.k.a dynamic batching) (default: enabled) \n " ) ;
ggml : add Flash Attention (#5021)
* ggml : add ggml_flash_attn_ext API
* ggml : fix GQA support in ggml_flash_attn_ext
* ggml : online attention (CPU)
* metal : initial implementation
* metal : f16 precision
* metal : reduce branches
* metal : specialize for head size
* wip : 8 rows per simd group
* wip : 4 rows per simd group
* wip : template for rows per warp
* metal : parallelize across KV size
* metal : parallel reduce across heads
* metal : efficient flash_attn_f16 implementation
* metal : avoid redundant loads of the attention
* metal : scale and mask in matrix form
* metal : fix comment
* llama : avoid ggml_cast, use F32 query
* metal : add parallel reduce version (disabled)
* metal : move output into local memory + optimize
- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments
* metal : add tests, fix scaling, support C > 32
* metal : improve precision
* ggml : fix f16 mad
* metal : minor
* metal : support Q > 8
* tests : add ATTN tests
* metal : disable buffer allocation logs
* tests : more
* metal : faster inner loop for C == 32
* metal : fix array initialization
* tests : ifdef
* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext
* ggml : fix ggml_soft_max mask requirement
* cuda : fix soft_max to use correct mask size
* cuda : add flash_attn kernel (wip)
* metal : optimize softmax for C > 32
* metal : optimize softmax
* tests : minor fix
* cuda : avoid zeroing fragments
* tests : update dims
* cuda : fix __hisinf() result check
* cuda : avoid warp_reduce for smax
* cuda : use int instead of int64_t
Noticeably improves performance (thanks to Johannes)
* cuda : make loops use the same loop values
Thanks Johannes again for the tip
* cuda : unroll some of the loops
* cuda : avoid __hisinf branches
* cuda : use half2 in softmax
* cuda : switch to 1 warp for bs > 16
* cuda : speed-up reduce part of the kernel
* cuda : unroll Q*K^T loop
* cuda : fix -INF block check
* cuda : simplify softmax
* cuda : fix matrix names
* cuda : minor
* llama : adapt to F16 KQ_pos
* llama : adapt new models to F16 KQ_mask
* ggml : fix F16 store (ARM NEON)
* llama : fix type of KQ_mask and KQ_pos
* ggml : fix CPU soft_max
* tests : add hs=256
* cuda : fix build
* metal : improve perf via smaller int registers
* cuda : adapt soft_max to F16 mask and pos
* CUDA: faster FlashAttention, kernel for bs == 1
* 16 cols for Phi-2
* no vec for hs, no hs==256 ncols==32 for Volta
* adjust kernel selection logic
* 4 warps, 256 stride for all D
* no ncols == 64
* Multiple parallel blocks for batch size 1
* fix compile warnings
* fix excessive KQ_b loads
* fix cmake build
* fix KV cache padding, NaN from INFINITY (#6438)
* llama : flash_attn cparam + fix defrag
* server: support flash_attn param
* server: bench: enable flash_attn param
* CUDA: refactor host code, dyn. par. blocks
* fix flash_attn_vec_f16 race condition
* flush softmax exp below threshold to 0
* store temp KQ in registers
* Calculate KQ as FP32 if KQV has GGML_PREC_F32
* Add __hgt2_mask implementation for CUDA 11
* fix KQ FP32 precision fpr parallel_blocks > 1
* llama-bench : add -fa,--flash-attn arg
* metal : add BS=1 kernel for flash attention (#6508)
* metal : add BS=1 kernel for flash attention (wip)
* metal : support more than 1 warps
* metal : opts
* metal : opt
* metal : switch to parallel reduce
* metal : reduce registers
* metal : simplify
* metal : initial FA vec kernel
* metal : use F32 attention accumulators
* batched-bench : add fattn arg
* llama : simplify llama_build_kv_store
ggml-ci
* llama : adapt build_olmo to changes
* ggml : fix arm fp16 store on windows
* metal : clean-up
* metal : clean-up kernel code
* metal : minor
* tests : remove benchmarks
ggml-ci
* ggml : fix avx512 const correctness
ggml-ci
* ggml : fix soft_max with bias on CPU
ggml-ci
* common : print --flash-attn in help
* ggml : fix num dimensions in ggml_flash_attn_ext
* llama : force disable flash attention for incompatible models
* ggml : ggml_soft_max support F16/F32 mask/pos
ggml-ci
* cuda : uint -> uint32_t
* cuda : "constexpr dim3" -> "const dim3"
ggml-ci
* cuda : try to fix __hgt2_mask
ggml-ci
* ggml : add TODO's for F16/F32 mask/pos support in other backends
* llama : replace bool need_kq_pos with use_alibi
* llama : prep ALiBi support for BERT models
ggml-ci
* llama : fix n_batch requirements
ggml-ci
* cont
* server : add help for --flash-attn arg
* llama : disable FA for AMD
* tests : remove TMP_ATTN_BENCH
ggml-ci
* llama : support save/load state with FA enabled
ggml-ci
* ci : add CUDA save-load-state tests
ggml-ci
* llama : llama_kv_cache_clear zeroes data + fix save-load seq
ggml-ci
* llama : fix copy-paste errors, add TODO
* llama : disallow incompatible states
* llama : update llama_state_get_size after v_trans field
* metal : remove tmp log
* llama : add static reminder for llama_state_get_size
* metal : fix max nsg
ggml-ci
* ci : fix arg order
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
2024-04-30 11:16:08 +02:00
printf ( " -fa, --flash-attn enable Flash Attention (default: %s) \n " , params . flash_attn ? " enabled " : " disabled " ) ;
2024-01-30 19:17:30 +01:00
printf ( " -spf FNAME, --system-prompt-file FNAME \n " ) ;
printf ( " set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications. \n " ) ;
2024-02-23 20:31:54 +01:00
printf ( " -ctk TYPE, --cache-type-k TYPE \n " ) ;
printf ( " KV cache data type for K (default: f16) \n " ) ;
printf ( " -ctv TYPE, --cache-type-v TYPE \n " ) ;
printf ( " KV cache data type for V (default: f16) \n " ) ;
2024-02-25 13:50:32 +01:00
printf ( " --log-format log output format: json or text (default: json) \n " ) ;
2024-01-30 19:17:30 +01:00
printf ( " --log-disable disables logging to a file. \n " ) ;
2024-02-18 18:39:57 +01:00
printf ( " --slots-endpoint-disable disables slots monitoring endpoint. \n " ) ;
2024-02-25 13:49:43 +01:00
printf ( " --metrics enable prometheus compatible metrics endpoint (default: %s). \n " , sparams . metrics_endpoint ? " enabled " : " disabled " ) ;
2024-04-08 14:43:30 +02:00
printf ( " --slot-save-path PATH path to save slot kv cache (default: disabled) \n " ) ;
2023-09-05 21:10:27 +02:00
printf ( " \n " ) ;
2024-02-18 17:30:09 +01:00
printf ( " -n, --n-predict maximum tokens to predict (default: %d) \n " , params . n_predict ) ;
2024-01-02 11:38:15 +01:00
printf ( " --override-kv KEY=TYPE:VALUE \n " ) ;
2024-01-30 19:17:30 +01:00
printf ( " advanced option to override model metadata by key. may be specified multiple times. \n " ) ;
2024-04-26 20:06:33 +02:00
printf ( " types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false \n " ) ;
2024-03-01 08:59:43 +01:00
printf ( " -gan N, --grp-attn-n N set the group attention factor to extend context size through self-extend(default: 1=disabled), used together with group attention width `--grp-attn-w` \n " ) ;
printf ( " -gaw N, --grp-attn-w N set the group attention width to extend context size through self-extend(default: 512), used together with group attention factor `--grp-attn-n` \n " ) ;
2024-02-20 15:58:27 +01:00
printf ( " --chat-template JINJA_TEMPLATE \n " ) ;
printf ( " set custom jinja chat template (default: template taken from model's metadata) \n " ) ;
2024-03-09 21:04:00 +01:00
printf ( " only commonly used templates are accepted: \n " ) ;
printf ( " https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template \n " ) ;
2024-01-02 11:38:15 +01:00
printf ( " \n " ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2024-03-07 10:41:53 +01:00
static void server_params_parse ( int argc , char * * argv , server_params & sparams , gpt_params & params ) {
gpt_params default_params ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
server_params default_sparams ;
2024-03-07 10:41:53 +01:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
std : : string arg ;
bool invalid_param = false ;
2024-03-07 10:41:53 +01:00
for ( int i = 1 ; i < argc ; i + + ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
arg = argv [ i ] ;
2024-03-07 10:41:53 +01:00
if ( arg = = " --port " ) {
if ( + + i > = argc ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
sparams . port = std : : stoi ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --host " ) {
if ( + + i > = argc ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
sparams . hostname = argv [ i ] ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --path " ) {
if ( + + i > = argc ) {
2023-07-04 16:05:27 +02:00
invalid_param = true ;
break ;
}
sparams . public_path = argv [ i ] ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --api-key " ) {
if ( + + i > = argc ) {
2023-12-15 12:49:01 +01:00
invalid_param = true ;
break ;
}
2024-03-09 11:27:53 +01:00
sparams . api_keys . push_back ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --api-key-file " ) {
if ( + + i > = argc ) {
2024-01-11 18:51:17 +01:00
invalid_param = true ;
break ;
}
std : : ifstream key_file ( argv [ i ] ) ;
if ( ! key_file ) {
fprintf ( stderr , " error: failed to open file '%s' \n " , argv [ i ] ) ;
invalid_param = true ;
break ;
}
std : : string key ;
while ( std : : getline ( key_file , key ) ) {
if ( key . size ( ) > 0 ) {
sparams . api_keys . push_back ( key ) ;
}
}
key_file . close ( ) ;
2024-03-09 10:57:09 +01:00
}
# ifdef CPPHTTPLIB_OPENSSL_SUPPORT
else if ( arg = = " --ssl-key-file " ) {
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
sparams . ssl_key_file = argv [ i ] ;
} else if ( arg = = " --ssl-cert-file " ) {
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
sparams . ssl_cert_file = argv [ i ] ;
}
# endif
else if ( arg = = " --timeout " | | arg = = " -to " ) {
2024-03-07 10:41:53 +01:00
if ( + + i > = argc ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
sparams . read_timeout = std : : stoi ( argv [ i ] ) ;
sparams . write_timeout = std : : stoi ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " -m " | | arg = = " --model " ) {
if ( + + i > = argc ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
params . model = argv [ i ] ;
2024-03-17 19:12:37 +01:00
} else if ( arg = = " -mu " | | arg = = " --model-url " ) {
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . model_url = argv [ i ] ;
2024-03-23 18:07:00 +01:00
} else if ( arg = = " -hfr " | | arg = = " --hf-repo " ) {
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . hf_repo = argv [ i ] ;
} else if ( arg = = " -hff " | | arg = = " --hf-file " ) {
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . hf_file = argv [ i ] ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " -a " | | arg = = " --alias " ) {
if ( + + i > = argc ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
params . model_alias = argv [ i ] ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " -h " | | arg = = " --help " ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
server_print_usage ( argv [ 0 ] , default_params , default_sparams ) ;
exit ( 0 ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " -c " | | arg = = " --ctx-size " | | arg = = " --ctx_size " ) {
if ( + + i > = argc ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
params . n_ctx = std : : stoi ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --rope-scaling " ) {
if ( + + i > = argc ) {
2023-11-01 23:04:33 +01:00
invalid_param = true ;
break ;
}
std : : string value ( argv [ i ] ) ;
2024-02-25 11:09:09 +01:00
/**/ if ( value = = " none " ) { params . rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_NONE ; }
else if ( value = = " linear " ) { params . rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_LINEAR ; }
else if ( value = = " yarn " ) { params . rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_YARN ; }
2023-11-01 23:04:33 +01:00
else { invalid_param = true ; break ; }
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --rope-freq-base " ) {
if ( + + i > = argc ) {
llama : add custom RoPE (#2054)
* Implement customizable RoPE
The original RoPE has pre-defined parameters
theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]
Our customizable RoPE, ggml_rope_custom_inplace, uses
theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]
with the default matches the original
scale = 1.0
base = 10000
The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.
Recent researches show changing these two parameters extends the context limit with minimal loss.
1. Extending Context to 8K
kaiokendev
https://kaiokendev.github.io/til#extending-context-to-8k
2. Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
https://arxiv.org/abs/2306.15595
3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
https://www.reddit.com/user/bloc97
https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5
* ggml-metal: fix custom rope
* common: fix argument names in help
* llama: increase MEM_REQ_EVAL for MODEL_3B
It avoids crashing for quantized weights on CPU.
Better ways to calculate the required buffer size would be better.
* llama: make MEM_REQ_EVAL depend on n_ctx
* server: use proper Content-Type in curl examples
Without the header Content-Type: application/json, curl will POST with
Content-Type: application/x-www-form-urlencoded
Though our simple server doesn't care, the httplib.h used has a limit
with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192
With Content-Type: application/json, we can send large json data.
* style : minor fixes, mostly indentations
* ggml : fix asserts
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-15 12:34:16 +02:00
invalid_param = true ;
break ;
}
params . rope_freq_base = std : : stof ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --rope-freq-scale " ) {
if ( + + i > = argc ) {
llama : add custom RoPE (#2054)
* Implement customizable RoPE
The original RoPE has pre-defined parameters
theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]
Our customizable RoPE, ggml_rope_custom_inplace, uses
theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]
with the default matches the original
scale = 1.0
base = 10000
The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.
Recent researches show changing these two parameters extends the context limit with minimal loss.
1. Extending Context to 8K
kaiokendev
https://kaiokendev.github.io/til#extending-context-to-8k
2. Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
https://arxiv.org/abs/2306.15595
3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
https://www.reddit.com/user/bloc97
https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5
* ggml-metal: fix custom rope
* common: fix argument names in help
* llama: increase MEM_REQ_EVAL for MODEL_3B
It avoids crashing for quantized weights on CPU.
Better ways to calculate the required buffer size would be better.
* llama: make MEM_REQ_EVAL depend on n_ctx
* server: use proper Content-Type in curl examples
Without the header Content-Type: application/json, curl will POST with
Content-Type: application/x-www-form-urlencoded
Though our simple server doesn't care, the httplib.h used has a limit
with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192
With Content-Type: application/json, we can send large json data.
* style : minor fixes, mostly indentations
* ggml : fix asserts
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-15 12:34:16 +02:00
invalid_param = true ;
break ;
}
params . rope_freq_scale = std : : stof ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --yarn-ext-factor " ) {
2023-11-01 23:04:33 +01:00
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . yarn_ext_factor = std : : stof ( argv [ i ] ) ;
}
2024-03-07 10:41:53 +01:00
else if ( arg = = " --yarn-attn-factor " ) {
2023-11-01 23:04:33 +01:00
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . yarn_attn_factor = std : : stof ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --yarn-beta-fast " ) {
2023-11-01 23:04:33 +01:00
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . yarn_beta_fast = std : : stof ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --yarn-beta-slow " ) {
2023-11-01 23:04:33 +01:00
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . yarn_beta_slow = std : : stof ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --pooling " ) {
2024-03-04 21:31:20 +01:00
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
std : : string value ( argv [ i ] ) ;
/**/ if ( value = = " none " ) { params . pooling_type = LLAMA_POOLING_TYPE_NONE ; }
else if ( value = = " mean " ) { params . pooling_type = LLAMA_POOLING_TYPE_MEAN ; }
else if ( value = = " cls " ) { params . pooling_type = LLAMA_POOLING_TYPE_CLS ; }
else { invalid_param = true ; break ; }
2024-03-09 23:41:49 +01:00
} else if ( arg = = " --defrag-thold " | | arg = = " -dt " ) {
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . defrag_thold = std : : stof ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --threads " | | arg = = " -t " ) {
2023-07-05 22:51:13 +02:00
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
params . n_threads = std : : stoi ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --grp-attn-n " | | arg = = " -gan " ) {
2024-01-27 14:38:05 +01:00
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . grp_attn_n = std : : stoi ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --grp-attn-w " | | arg = = " -gaw " ) {
if ( + + i > = argc ) {
2024-01-27 14:38:05 +01:00
invalid_param = true ;
break ;
}
params . grp_attn_w = std : : stoi ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --threads-batch " | | arg = = " -tb " ) {
if ( + + i > = argc ) {
2023-10-24 22:10:43 +02:00
invalid_param = true ;
break ;
}
params . n_threads_batch = std : : stoi ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --threads-http " ) {
if ( + + i > = argc ) {
2024-03-01 10:08:08 +01:00
invalid_param = true ;
break ;
}
sparams . n_threads_http = std : : stoi ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " -b " | | arg = = " --batch-size " ) {
if ( + + i > = argc ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
params . n_batch = std : : stoi ( argv [ i ] ) ;
2024-03-13 18:54:21 +01:00
} else if ( arg = = " -ub " | | arg = = " --ubatch-size " ) {
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . n_ubatch = std : : stoi ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --gpu-layers " | | arg = = " -ngl " | | arg = = " --n-gpu-layers " ) {
if ( + + i > = argc ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
2024-01-31 16:30:17 +01:00
if ( llama_supports_gpu_offload ( ) ) {
params . n_gpu_layers = std : : stoi ( argv [ i ] ) ;
} else {
2024-03-07 10:41:53 +01:00
LOG_WARNING (
" Not compiled with GPU offload support, --n-gpu-layers option will be ignored. "
" See main README.md for information on enabling GPU BLAS support " ,
{ { " n_gpu_layers " , params . n_gpu_layers } } ) ;
2024-01-31 16:30:17 +01:00
}
2024-04-04 08:33:48 +02:00
} else if ( arg = = " -nkvo " | | arg = = " --no-kv-offload " ) {
params . no_kv_offload = true ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --split-mode " | | arg = = " -sm " ) {
2024-01-12 20:07:38 +01:00
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
std : : string arg_next = argv [ i ] ;
2024-03-07 10:41:53 +01:00
if ( arg_next = = " none " ) {
2024-02-25 11:09:09 +01:00
params . split_mode = LLAMA_SPLIT_MODE_NONE ;
2024-03-07 10:41:53 +01:00
} else if ( arg_next = = " layer " ) {
2024-02-25 11:09:09 +01:00
params . split_mode = LLAMA_SPLIT_MODE_LAYER ;
2024-03-07 10:41:53 +01:00
} else if ( arg_next = = " row " ) {
2024-02-25 11:09:09 +01:00
params . split_mode = LLAMA_SPLIT_MODE_ROW ;
2024-03-07 10:41:53 +01:00
} else {
2024-01-12 20:07:38 +01:00
invalid_param = true ;
break ;
}
2024-03-26 01:16:01 +01:00
# ifndef GGML_USE_CUDA
fprintf ( stderr , " warning: llama.cpp was compiled without CUDA. Setting the split mode has no effect. \n " ) ;
# endif // GGML_USE_CUDA
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --tensor-split " | | arg = = " -ts " ) {
if ( + + i > = argc ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
2024-03-26 01:16:01 +01:00
# if defined(GGML_USE_CUDA) || defined(GGML_USE_SYCL)
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
std : : string arg_next = argv [ i ] ;
2023-06-06 21:33:23 +02:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
// split string by , and /
2023-07-05 22:51:13 +02:00
const std : : regex regex { R " ([,/]+) " } ;
std : : sregex_token_iterator it { arg_next . begin ( ) , arg_next . end ( ) , regex , - 1 } ;
std : : vector < std : : string > split_arg { it , { } } ;
2024-01-31 16:30:17 +01:00
GGML_ASSERT ( split_arg . size ( ) < = llama_max_devices ( ) ) ;
2023-06-06 21:33:23 +02:00
2024-03-07 10:41:53 +01:00
for ( size_t i_device = 0 ; i_device < llama_max_devices ( ) ; + + i_device ) {
if ( i_device < split_arg . size ( ) ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . tensor_split [ i_device ] = std : : stof ( split_arg [ i_device ] ) ;
2024-03-07 10:41:53 +01:00
} else {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . tensor_split [ i_device ] = 0.0f ;
}
}
2023-06-06 21:33:23 +02:00
# else
2024-03-26 01:16:01 +01:00
LOG_WARNING ( " llama.cpp was compiled without CUDA. It is not possible to set a tensor split. \n " , { } ) ;
# endif // GGML_USE_CUDA
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --main-gpu " | | arg = = " -mg " ) {
if ( + + i > = argc ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
2024-03-26 01:16:01 +01:00
# if defined(GGML_USE_CUDA) || defined(GGML_USE_SYCL)
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . main_gpu = std : : stoi ( argv [ i ] ) ;
2023-06-06 21:33:23 +02:00
# else
2024-03-26 01:16:01 +01:00
LOG_WARNING ( " llama.cpp was compiled without CUDA. It is not possible to set a main GPU. " , { } ) ;
2023-05-28 19:48:57 +02:00
# endif
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --lora " ) {
if ( + + i > = argc ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
2024-02-03 12:23:37 +01:00
params . lora_adapter . emplace_back ( argv [ i ] , 1.0f ) ;
train : finetune LORA (#2632)
* fix track_max_mem in forward_batch_wo_cache_flash_attn_train
* remove unnecessary Adam(W) optimizer tensors.
reduces optimizer memory overhead from 7*modelsize to 2*modelsize.
additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t.
bumps training checkpoint file version, but old checkpoints can still be read.
new version with less tensors is saved.
* add gradient clipping to AdamW
* Fix reset of unused g->nodes and g->grads to NULL
* implement gradient checkpointing for training
reduces memory overhead from O(n_layer) to O(sqrt(n_layer))
as explained in readme of https://github.com/cybertronai/gradient-checkpointing
* remove unused compute buffer 3
* add and use function ggml_build_backward_expand to avoid stack overflows with large maximum number of nodes
GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);
* change AdamW decay parameter to work like the torch AdamW decay parameter
It is now relative to Adam learning rate `alpha*sched`.
Before that it was relative to `sched` only.
`alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1]
* change default AdamW weight decay parameter used in training to 0.1 as used in nanoGPT
* change default AdamW weight decay parameter defined in ggml to 0.0, making Adam default instead of AdamW
btw: the default weight decay parameter for torch.optim.AdamW is 0.01
* bug fixes for cross entropy loss
ggml_cross_entropy_loss: sums where not correctly added in workload of each thread
ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues
guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16
cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup.
so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance.
* fix test-grad0 for cross_entropy_loss
the second argument to cross_entropy_loss must sum up to 1 for each row
* fix test-grad0 for soft_max
dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work
instead use sum(log(soft_max()*(1-eps)+eps)); use eps to avoid log(0)
* improve finite differences of test-grad0 by using double instead of float
* change cross_entropy_loss to output average over all rows
this helps keeping the loss and gradients in a sane range
* improve gradient checkpointing
sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal.
since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different:
```
given: n, u, v
objective: minimize(a*u+b*v) where a*b=n, a>0, b>0
b=n/a
minimize(a*u+v*n/a)
diff(a*u+v*n/a, a) = u - (v*n/a)/a
diff(a*u+v*n/a, a) == 0
u - (v*n/a)/a == 0
u == v*n/(a*a)
u*a*a = v*n
a*a = v*n/u
a = sqrt(n*v/u)
```
this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage.
* disable gradient checkpointing debug output
* llama : fix rope usage in train-text-from-scratch after ChatGLM change
* add more training parameters:
--enable-restart N Only for Adam optimizer. Enable restarts of cos-decay
--disable-restart N Only for Adam optimizer. Disable restarts of cos-decay
--opt-past N Number of optimization iterations to track for delta convergence test. Disabled when zero.
--opt-delta N Maximum delta for delta convergence test. Disabled when <= zero.
--opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero.
--adam-epsf N AdamW epsilon for convergence test. Disabled when <= zero.
--adam-min-alpha N Adam minimum learning rate alpha, usually 0.1 * alpha
* replace memcpy with reshape operation so that the graph is not cut at the input
this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it
* remove unused function argument from get_example_targets_batch
* measure and print total training time
* add optimization callback to ggml_opt_resume_g
this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)).
can be used for dynamic learning schedule and setting input data for batches before each iteration
* use optimization callback in training
allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters
reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration
* add minimum number of tensor dimensions to apply weight decay (default 2)
this allows to not apply weight decay to bias parameters
* rename training parameter cos-decay-alpha to cos-decay-min and clarify that adam-min-alpha also applies to warmup
* fix increase of model.train_samples and model.train_tokens
now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations
* change sampling parameters for prediction after training to defaults of common.h
and clarify what is context for prediction and what are generated tokens
* tighten abs error bounds for cross_entropy_loss in test-grad0
* add conditional compilation of using F16 exp in flash attention
uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention
* tighten abs error bounds for flash_attn in test-grad0
* tighten abs error bounds for sqrt in test-grad0
* remove out-commented vectorized code of opt_adam
the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead
* ggml : update ggml_rms_norm_back with configurable eps
* llama training : fix ggml_rms_norm_back calls to pass configurable eps
* remove trailing whitespace
* add train function using automatic gradient checkpointing backward pass and allocator
* in train function replace add_inplace by regular add
because using add_inplace seems to result in different gradients
* don't use allocate hash_map on context
because the context has no_alloc=True when using memory allocator resulting in NULL data pointers
* correctly clone reshape and permute operations by also cloning tensor->nb values
* fix variable name and add missing type cast
* terminate recursive tensor cloning when reaching tensor without src tensors
* correctly clone view tensors by setting data pointers
without this the checkpointing would only work when being used together with memory allocator
* fix variable names
* swap arguments to commutative ops to be the same as in `forward_batch_wo_cache_flash_attn`
* add input tensors as checkpoints
so that recursive tensor cloning of gradient checkpointing terminates on input tensors
* fix variable name and add missing boolean negation
* make sure some tensors are not reallocated by inserting new temporary nodes depending on them:
output and parameter gradient tensors need to be available at the end of the graph execution
parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration
checkpoint tensors are allocated all together to reduce memory allocator fragmentation
afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs
* fix ASSERT to work with zero layers
* add training options whether to use allocator and/or unified training function
* integrate unified training function which may use memory allocator
the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing
* format name of cloned tensors with " (clone)" suffix
* set names for tensors in unified train function for easier debugging
* allocate graph on context using ggml_new_graph
* remove handwritten training functions
* remove unused training parameters "use_scratch" and "use_unified"
* remove trailing whitespace
* remove unused train params: mem_compute1_gb & mem_compute2_gb
mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions
mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented)
* remove unused forward_batch function
* add debug asserts in ggml_allocr_alloc to some common pitfalls when using this function directly
* only use ggml_allocr_alloc when tensor has NULL data and is no view
* fix test when to create temporary backward graph
temporary backward graph is only necessary when using checkpointing
* fix memory "leak" in optimizers
each iteration a new cplan with new memory for work data was allocated.
now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data.
* reverse order of for loop in ggml_build_backward_expand to save memory when using gradient checkpointing and allocator
with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory.
the computation results are the same
* add API functions to access llama model tensors
* add stub example for finetuning, based on train-text-from-scratch
* move and remove code
* add API functions to access remaining model parameters:
mult, head and rot
* first draft for LORA finetune training
* remove const model and layer arguments in API functions for accessing model tensors
* bug fixes to make finetune compile
automatic allocator does not work yet
* add debug prints for training memory improvements
* fix names of lora tensors
* avoid stack overflow resulting from big ggml_cgraph
replace stack allocation and ggml_build_forward by ggml_new_graph in combination with ggml_build_forward_expand
* replace llama API functions to get model tensors by one function to get model tensor by name
LLAMA_API struct ggml_tensor * llama_get_model_tensor(struct llama_model * model, const char * name);
* remove unused call to not existing llama_get_layer_from_model
* implement ggml_compute_forward_out_prod_q_f32
* remove trailing whitespace
* add lora finetune support on quantized base model tensors
* add ggml_add_cast API function
this function works like ggml_add, but accepts a data type for the resulting tensor.
only supported for quantized src0 input.
* use ggml_add_cast in finetuning
lora-applied weights will now have data type F32, which improves gradients when finetuning quantized base models
* bug fix: actually use result type passed to ggml_add_cast
* make sure base model tensors data cannot be used in viewable operations
memory allocator would try to make lora application inplace on base model tensors.
since those are memory mapped this will result in memory access violations
* fix bug in ggml_out_prod which resulted in wrong n_dims of result tensors
* avoid keeping in memory ALL of the gradients
The problem here stems from ggml_graph_reset. This function is called in the optimization function, before each graph computation, to reset the gradients to zero. This required a unique memory slot for each gradient: allocating memory from a previosly freed memory location might lead to non-zero input gradients.
During ggml_compute_backward the gradients are build stepwise by adding or substracting new values, starting from a OP_NONE tensor which needs to contain zero-values. This requires the graph reset.
To avoid this I now remember in ggml_build_backward_expand the original OP_NONE gradient tensors in a hash table, which is passed to ggml_compute_backward. There instead of using add (or sub or similar) I test whether the existing gradient to be changed is a zero-valued-tensor by looking up its existence in the hash table. When it is such a zero-tensor it will not be modified, but replaced by the value to be added, otherwise the regular add (not inplace, allocator will take care of this) will be used. This way none of those zero-tensor values will be necessary in the final backward graph and more importantly they won't need a unique memory slot, just to make them zero.
* remove trailing whitespace
* remove debug prints and function to compute tensor data hash
* improve optimization iteration prints
* adjust maximal values to support finetuning 3B models
* change default finetune params lora_r and lora_alpha to match the n_rank parameters of 4
* bug fix: make sure finetune input gradient is allocated at begin and kept until end
* remove unnecessary src tensor from ggml_get_rows_back
we don't need data of src[2] for computation, only to setup the correct output shape.
remove dependency on src[2], so that allocator can work more freely.
the computational graph is still completely determined, because the output shape is naturally included.
this is similar to how ggml_reshape does it.
* remove unnecessary src tensor from ggml_repeat & ggml_repeat_back
we don't need data of src[1] for computation, only to setup the correct output shape.
remove dependency on src[1], so that allocator can work more freely.
the computational graph is still completely determined, because the output shape is naturally included
* resolve todo
allocator will only make it inplace when they are of the same type
* mixing multiple LORA adapters is now possible
pass more than one '--lora FNAME' argument to apply more than one LORA.
use '--lora-scaled FNAME S' when you want to specify a user-defined scale for an adapter.
* add option to save finetune output every N iterations
* also save latest finetune output with ITERATION="LATEST" and print where files are saved
saving with LATEST makes it easier to resume training from the latest checkpoint
the string "LATEST" can be configured with command line option "--fn-latest STR"
* update checkpoint train stats before saving via "--save-every"
* add command line option `--rank-wo N` for rank of wo tensor
* update finetune README
* fix dump_non_result_info_yaml to output multiple lora adapters
* bug fix: replace GGML_TYPE_SIZE[t] by ggml_type_size(t)
* replace llama_n_mult by llama_n_ff
* finetune bug fixes to compile with merged in code from master
* remove prediction related code to reduce duplicated code with main
use main instead
* reduce large memory overhead in train-text-from-scratch
all gradients had to be pinned so that graph_reset works correctly.
this is no longer necessary with the changes to ggml_compute_backward introduced in this PR.
* add comment explaining why finetune checkpoints are allocated in one block
* make default value of float member a float literal
* handle rms_norm and rope parameters the same as in train-text-from-scratch
* remove unused code
* remove vocab related code as it is unnecessary
* add LLM_KV_TRAINING_TYPE to train-text-from-scratch checkpoints
so that they can be differentiated from lora finetune checkpoints
* add gguf constants and load/save functions from train-text-from-scratch
* add load & save lora finetune checkpoints via gguf
* add python script to convert old finetune checkpoint files to gguf
* remove old checkpoint save & load code
* remove code to print data checksums which was used to verify correctness of new gguf code
* omit tokenization when training is disabled, only save llama lora adapter
training can be disabled by passing '-n 0' to finetune
* remove trailing whitespace
* update README.md
* implement ggml_compute_forward_repeat_f16
* avoid stack overflow of large cgraphs in test-grad0
* add ggml API functions ggml_unravel_index, ggml_get_i32_nd and its analogs for set and for f32
ggml_get_i32_1d, ggml_set_i32_1d, ggml_get_f32_1d, ggml_set_f32_1d now support non-contiguous tensors.
in case of non-contiguous tensor, the 1d index is unraveled into a multi index using ggml_unravel_index to be passed to '_nd' function equivalent.
this fixes a bug in test-grad0 which happens due to ggml_build_backward not building purely contiguous tensors anymore
* increase test-grad0 context mem size to accommodate for bigger cgraph
* add sanity check to ggml_compute_backward, asserting the correct shape of gradients
* fix ggml_acc_or_set to return tensor of correct shape
* remove unused 'inplace' argument from ggml_compute_backward function
inplace operations to add gradients are no longer created by ggml_compute_backward
use allocator to automatically make inplace operations
* add missing argument 'int i0' to ggml_get_i32_nd & ggml_set_i32_nd header declarations
* fix error message in ggml_allocr_alloc to display actual max_avail
* fix check_gradient
ggml_build_backward_expand was previously replaced by ggml_build_backward, but the assignment of forward graph to backward graph missing
* use tensor->view_src instead of ggml_is_view and get_view_source
* move gradient checkpointing code into ggml, new API function:
// build gradient checkpointing backward graph gb for gf using provided checkpoints
// gb_tmp will contain original backward graph with rewritten backward process nodes,
// but without the second forward pass nodes.
GGML_API void ggml_build_backward_gradient_checkpointing(
struct ggml_context * ctx,
struct ggml_cgraph * gf,
struct ggml_cgraph * gb,
struct ggml_cgraph * gb_tmp,
struct ggml_tensor * * checkpoints,
int n_checkpoints);
* replace custom data getters and setters by ggml functions
* train-text-from-scratch can train (full finetune) gguf models
just pass the gguf model via `--checkpoint-in FN`.
after this, to continue training, pass the generated checkpoint instead of the original gguf model.
tested with smaller models, bigger models may exceed available memory.
use (LORA) finetune for those.
* remove trailing whitespace
* add option to save train-text-from-scratch output every N iterations
* update README.md
* fix warnings
* fix warnings
* remove finetune option to disable allocator
the allocator should always be used.
by making sure that it is always used it gets easier to implement automatic memory requirements computation
* add tensor checkpoints only when gradient checkpointing is enabled
* initialize opt ggml context if none was provided
* add ggml-alloc API function 'ggml_allocr_max_size' to get max size of alloc
GGML_API size_t ggml_allocr_max_size(struct ggml_allocr * alloc);
* finetune: automatically allocate all memory and changes to command line options
remove '--n_examples N' parameter, as it no longer makes sense to call optimization process multiple times in a loop.
add '--only_write_lora' command line option: will skip tokenization and training, to only write a llama.cpp comptabile LORA adapter.
remove memory buffer related command line options.
improve iteration console output.
* add finetune to Makefile
* update README.md
* print time per iteration and estimate remaining time
* increase measured alloc size by tensor_alignment
ggml_allocr_reset will reduce the given size by up to tensor_alignment-1
* fix README.md
* add some more allocator debug prints
* bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue
* revert last commit
"bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue"
"alloc was freeing an externally allocated tensor, because it calculated the end of allocator memory as alloc->data + alloc->max_size instead of alloc->data + alloc->size."
This is intentional to reduce the risk of freeing external tensors when measuring. Unless max_size is not properly calculated, I don't see why this is an issue.
* remove unnecessary "0x" before "%p" output
* move measurement memory segment to upper region of the address space
* update README.md
* fix printf format warnings
* add missing gguf_free in load_checkpoint_lora_file
* load default rms_norm and rope parameters from base model
* add gradient accumulation
specify number accumulation steps with '--grad-acc N'.
this will simulate a bigger batch size of grad_acc*batch.
* fix tracking of train_samples and train_tokens
* build : fix compile warnings
* ggml : fix L-BFGS linesearch loop
* improve finetune time measurement
fix printf warnings on system where int64_t is (long int).
change time datatypes to double because values get big with long training times.
exclude file saving from time measurement.
converge faster to actual time per iteration by removing very small first duration before first iteration was performed.
fix bug in output of total training time, the reported value was 1000 times to small.
* specify default lora rank with '--lora-r N'
'--lora-r N' will specify default rank for all tensors
'--rank-wq N', etc. will override this default rank for specific tensor types.
* fix gradient accumulation bug where the same batch was used for each microstep
* fix gradient accumulation bug where the same batch was used for each microstep
* support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back
k and v can now be repeated in q along ne[2]
in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2.
in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3].
so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads.
in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2.
since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor.
additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned.
we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions.
this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous.
since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous.
change test-grad0 to also test for repeated k/v in q.
this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable.
added a note to explain this.
* add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'.
* fix finetune to support grouped-query-attention (using flash-attention)
note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.
* support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b)
* test broadcasting mul_mat backward pass
* decouple random number generator of each operation test
when changing one test the rng of others tests is not influenced anymore
* add comment briefly describing what ggml_repeat_back does
* simplify broadcasting mul_mat backward using ggml_repeat_back
* add cgraph evaluation order member and corresponding enum type
this controls in which order ggml_build_forward visits source nodes.
by default the nodes are visited left to right, i.e. src[0] first.
in some cases it is beneficial for ggml-alloc to visit in a different order.
two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last).
* measure max compute size for each cgraph eval order and use best order
this can bring huge memory savings:
e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB
* remove unused command line options
* add sample start patterns and options to force new or by default resume last shuffling
* update shuffle rng state on reshuffle
* exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32
* remove probably unnecessary exception type flags from stringstream
* pass correct max number of tokens to llama_tokenize
* account for possible leading whitespace that will be added by tokenizer
e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12]
* use unrolled vec_mad in out_prod
y is vec_mad result vec.
x is vec_mad input vec.
v is vec_mad input scalar.
ggml_vec_mad_f32_unroll will internally loop over x and v with same y.
GGML_VEC_MAD_UNROLL is by default defined to 32.
This value is empirical optimized using performance test runs of out-prod in openllama-3b finetune with 256 context length and batch size 1. It gives 23% performance boost for out_prod.
Full measurements of out-prod runtime in ms:
unroll_xv unroll_yv
1 67014.643 87826.469
2 77117.552 89077.656
4 72091.311 109121.657
8 61077.543 88678.334
16 56914.67 79514.947
24 59024.595 84350.254
28 55952.446 83368.73
32 51476.658 85177.745
36 55973.792 84659.92
40 55139.616 93844.738
48 60736.392 93330.267
64 99856.878 116994.99
Second column is when unrollying yv instead of xv
* set lora_alpha to value of lora_r if it is not set via command line
otherwise only changing lora_r will change scaling of lora adapter used in prediction
* reshuffle original sample order instead of the previous shuffled order
otherwise resumed reshuffle will not result in same sample order
* block tiling for out-prod inspired by mul-mat
block sizes are empirically optimized
roughly doubles the flops of out-prod
* exclude some more known zero values from computations in flash_attn_f32 & flash_attn_back_f32
* add static keywords
* remove outcommented old code
* update train-text-from-scratch with tokenization, sample selection and shuffling from finetune
* remove lbfgs related train parameters
* move common train functions into common/train.[h|cpp]
* move train state into struct train_state
* move train data saving code into callback to unify code of opt_callback
train_params are still different in finetune and train-text-from-scratch, so it can't yet be moved to train.h|cpp
* move common train params into common/train
* move common opt_callback into common/train
* fix consume_common_train_arg
* save and load head_count_kv in lora checkpoints
* increase train_samples by used_samples instead of number of batches
on batch can contain more than one sample when option "fill_with_next_samples" is used
* fix usage of llama_tokenize
* remove static from process_escape since we need it exposed in header
* fix code formating of long function declarations
* fix condition in load_train_state_gguf
* use die("msg") instead of replace GGML_ASSERT(!"msg") or throw std::runtime_error("msg")
* fix saving and loading of training type
* remove terminating '\0' from tokenization
(llama_tokenize is now passed the string length instead of relying on terminating '\0')
* fix compile warnings
* fix compile warnings
* use new/delete for train_state instead of malloc/free
using malloc may result in seg faults when trying to assign string fields
* assert that sample_count > 0, avoiding division by zero
* fix frand to return value in interval [0,1)
* add train option "--sample-random-offsets"
Use samples beginning at random offsets.
The offset is only applied to the first sample in each batch context window.
Together with "--fill-with-next-samples" this may help for training endless text generation.
For example given a dataset containing samples "abcd", "ABCD", "0123".
With context size of 8 and options "--fill-with-next-samples", "--no-separate-with-eos", "--no-separate-with-bos",
the context windows of batches could only be filled with "abcdABCD", "ABCDabcd", "0123abcd", etc.
With "--sample-random-offsets" it can also be filled with "23abcdAB", "bcd0123A", etc.
* deduplicate code into function
* remove n_rot hparam, as it must always be hparam.n_embd_head()
* align code
* assert correct base model tensor shapes
* move some params from lora hparams into model hparams and load model params from gguf
this equalizes the model definition in finetune and text-from-scratch and removes the need for additional llama api functions to get model parameters
* remove now unnecessary llama API functions to get model params that where added by this PR
* train-text-from-scratch: automatically allocate model tensors, remove option '--mem-model N'
* train-text-from-scratch: automatically allocate opt context
* train-text-from-scratch: automatically allocate input tensors
* train-text-from-scratch: automatically allocate compute memory
* remove unused options and equalize train-text-from-scratch with finetune
* initialize opt->loss_after with zero
* add export-lora program
* remove trailing whitespace
* add export-lora build in Makefile
* remove unused struct tensor_info from export-lora
* add export-lora build dependency to llama
because it depends on common, which depends on llama
* update finetune README.md
* cancel optimization when specified number of epochs is completed
* improve handling of export-lora arguments
print errors and warnings when files could not be read or created
* Fix export-lora.cpp "not enough space in the context's memory pool" (#1)
* Fix export-lora.cpp "not enough space in the context's memory pool"
Without this patch, export-lora would sometimes error with "not enough space in the context's memory pool (needed 656784, available 656800)".
* increase required context size by 5*GGML_MEM_ALIGN instead of plain 16
---------
Co-authored-by: xaedes <xaedes@gmail.com>
* improve handling of not yet supported tensor types
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: meatbag-18a <145869052+meatbag-18a@users.noreply.github.com>
2023-09-28 20:40:11 +02:00
params . use_mmap = false ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --lora-scaled " ) {
if ( + + i > = argc ) {
train : finetune LORA (#2632)
* fix track_max_mem in forward_batch_wo_cache_flash_attn_train
* remove unnecessary Adam(W) optimizer tensors.
reduces optimizer memory overhead from 7*modelsize to 2*modelsize.
additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t.
bumps training checkpoint file version, but old checkpoints can still be read.
new version with less tensors is saved.
* add gradient clipping to AdamW
* Fix reset of unused g->nodes and g->grads to NULL
* implement gradient checkpointing for training
reduces memory overhead from O(n_layer) to O(sqrt(n_layer))
as explained in readme of https://github.com/cybertronai/gradient-checkpointing
* remove unused compute buffer 3
* add and use function ggml_build_backward_expand to avoid stack overflows with large maximum number of nodes
GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);
* change AdamW decay parameter to work like the torch AdamW decay parameter
It is now relative to Adam learning rate `alpha*sched`.
Before that it was relative to `sched` only.
`alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1]
* change default AdamW weight decay parameter used in training to 0.1 as used in nanoGPT
* change default AdamW weight decay parameter defined in ggml to 0.0, making Adam default instead of AdamW
btw: the default weight decay parameter for torch.optim.AdamW is 0.01
* bug fixes for cross entropy loss
ggml_cross_entropy_loss: sums where not correctly added in workload of each thread
ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues
guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16
cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup.
so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance.
* fix test-grad0 for cross_entropy_loss
the second argument to cross_entropy_loss must sum up to 1 for each row
* fix test-grad0 for soft_max
dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work
instead use sum(log(soft_max()*(1-eps)+eps)); use eps to avoid log(0)
* improve finite differences of test-grad0 by using double instead of float
* change cross_entropy_loss to output average over all rows
this helps keeping the loss and gradients in a sane range
* improve gradient checkpointing
sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal.
since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different:
```
given: n, u, v
objective: minimize(a*u+b*v) where a*b=n, a>0, b>0
b=n/a
minimize(a*u+v*n/a)
diff(a*u+v*n/a, a) = u - (v*n/a)/a
diff(a*u+v*n/a, a) == 0
u - (v*n/a)/a == 0
u == v*n/(a*a)
u*a*a = v*n
a*a = v*n/u
a = sqrt(n*v/u)
```
this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage.
* disable gradient checkpointing debug output
* llama : fix rope usage in train-text-from-scratch after ChatGLM change
* add more training parameters:
--enable-restart N Only for Adam optimizer. Enable restarts of cos-decay
--disable-restart N Only for Adam optimizer. Disable restarts of cos-decay
--opt-past N Number of optimization iterations to track for delta convergence test. Disabled when zero.
--opt-delta N Maximum delta for delta convergence test. Disabled when <= zero.
--opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero.
--adam-epsf N AdamW epsilon for convergence test. Disabled when <= zero.
--adam-min-alpha N Adam minimum learning rate alpha, usually 0.1 * alpha
* replace memcpy with reshape operation so that the graph is not cut at the input
this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it
* remove unused function argument from get_example_targets_batch
* measure and print total training time
* add optimization callback to ggml_opt_resume_g
this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)).
can be used for dynamic learning schedule and setting input data for batches before each iteration
* use optimization callback in training
allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters
reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration
* add minimum number of tensor dimensions to apply weight decay (default 2)
this allows to not apply weight decay to bias parameters
* rename training parameter cos-decay-alpha to cos-decay-min and clarify that adam-min-alpha also applies to warmup
* fix increase of model.train_samples and model.train_tokens
now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations
* change sampling parameters for prediction after training to defaults of common.h
and clarify what is context for prediction and what are generated tokens
* tighten abs error bounds for cross_entropy_loss in test-grad0
* add conditional compilation of using F16 exp in flash attention
uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention
* tighten abs error bounds for flash_attn in test-grad0
* tighten abs error bounds for sqrt in test-grad0
* remove out-commented vectorized code of opt_adam
the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead
* ggml : update ggml_rms_norm_back with configurable eps
* llama training : fix ggml_rms_norm_back calls to pass configurable eps
* remove trailing whitespace
* add train function using automatic gradient checkpointing backward pass and allocator
* in train function replace add_inplace by regular add
because using add_inplace seems to result in different gradients
* don't use allocate hash_map on context
because the context has no_alloc=True when using memory allocator resulting in NULL data pointers
* correctly clone reshape and permute operations by also cloning tensor->nb values
* fix variable name and add missing type cast
* terminate recursive tensor cloning when reaching tensor without src tensors
* correctly clone view tensors by setting data pointers
without this the checkpointing would only work when being used together with memory allocator
* fix variable names
* swap arguments to commutative ops to be the same as in `forward_batch_wo_cache_flash_attn`
* add input tensors as checkpoints
so that recursive tensor cloning of gradient checkpointing terminates on input tensors
* fix variable name and add missing boolean negation
* make sure some tensors are not reallocated by inserting new temporary nodes depending on them:
output and parameter gradient tensors need to be available at the end of the graph execution
parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration
checkpoint tensors are allocated all together to reduce memory allocator fragmentation
afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs
* fix ASSERT to work with zero layers
* add training options whether to use allocator and/or unified training function
* integrate unified training function which may use memory allocator
the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing
* format name of cloned tensors with " (clone)" suffix
* set names for tensors in unified train function for easier debugging
* allocate graph on context using ggml_new_graph
* remove handwritten training functions
* remove unused training parameters "use_scratch" and "use_unified"
* remove trailing whitespace
* remove unused train params: mem_compute1_gb & mem_compute2_gb
mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions
mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented)
* remove unused forward_batch function
* add debug asserts in ggml_allocr_alloc to some common pitfalls when using this function directly
* only use ggml_allocr_alloc when tensor has NULL data and is no view
* fix test when to create temporary backward graph
temporary backward graph is only necessary when using checkpointing
* fix memory "leak" in optimizers
each iteration a new cplan with new memory for work data was allocated.
now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data.
* reverse order of for loop in ggml_build_backward_expand to save memory when using gradient checkpointing and allocator
with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory.
the computation results are the same
* add API functions to access llama model tensors
* add stub example for finetuning, based on train-text-from-scratch
* move and remove code
* add API functions to access remaining model parameters:
mult, head and rot
* first draft for LORA finetune training
* remove const model and layer arguments in API functions for accessing model tensors
* bug fixes to make finetune compile
automatic allocator does not work yet
* add debug prints for training memory improvements
* fix names of lora tensors
* avoid stack overflow resulting from big ggml_cgraph
replace stack allocation and ggml_build_forward by ggml_new_graph in combination with ggml_build_forward_expand
* replace llama API functions to get model tensors by one function to get model tensor by name
LLAMA_API struct ggml_tensor * llama_get_model_tensor(struct llama_model * model, const char * name);
* remove unused call to not existing llama_get_layer_from_model
* implement ggml_compute_forward_out_prod_q_f32
* remove trailing whitespace
* add lora finetune support on quantized base model tensors
* add ggml_add_cast API function
this function works like ggml_add, but accepts a data type for the resulting tensor.
only supported for quantized src0 input.
* use ggml_add_cast in finetuning
lora-applied weights will now have data type F32, which improves gradients when finetuning quantized base models
* bug fix: actually use result type passed to ggml_add_cast
* make sure base model tensors data cannot be used in viewable operations
memory allocator would try to make lora application inplace on base model tensors.
since those are memory mapped this will result in memory access violations
* fix bug in ggml_out_prod which resulted in wrong n_dims of result tensors
* avoid keeping in memory ALL of the gradients
The problem here stems from ggml_graph_reset. This function is called in the optimization function, before each graph computation, to reset the gradients to zero. This required a unique memory slot for each gradient: allocating memory from a previosly freed memory location might lead to non-zero input gradients.
During ggml_compute_backward the gradients are build stepwise by adding or substracting new values, starting from a OP_NONE tensor which needs to contain zero-values. This requires the graph reset.
To avoid this I now remember in ggml_build_backward_expand the original OP_NONE gradient tensors in a hash table, which is passed to ggml_compute_backward. There instead of using add (or sub or similar) I test whether the existing gradient to be changed is a zero-valued-tensor by looking up its existence in the hash table. When it is such a zero-tensor it will not be modified, but replaced by the value to be added, otherwise the regular add (not inplace, allocator will take care of this) will be used. This way none of those zero-tensor values will be necessary in the final backward graph and more importantly they won't need a unique memory slot, just to make them zero.
* remove trailing whitespace
* remove debug prints and function to compute tensor data hash
* improve optimization iteration prints
* adjust maximal values to support finetuning 3B models
* change default finetune params lora_r and lora_alpha to match the n_rank parameters of 4
* bug fix: make sure finetune input gradient is allocated at begin and kept until end
* remove unnecessary src tensor from ggml_get_rows_back
we don't need data of src[2] for computation, only to setup the correct output shape.
remove dependency on src[2], so that allocator can work more freely.
the computational graph is still completely determined, because the output shape is naturally included.
this is similar to how ggml_reshape does it.
* remove unnecessary src tensor from ggml_repeat & ggml_repeat_back
we don't need data of src[1] for computation, only to setup the correct output shape.
remove dependency on src[1], so that allocator can work more freely.
the computational graph is still completely determined, because the output shape is naturally included
* resolve todo
allocator will only make it inplace when they are of the same type
* mixing multiple LORA adapters is now possible
pass more than one '--lora FNAME' argument to apply more than one LORA.
use '--lora-scaled FNAME S' when you want to specify a user-defined scale for an adapter.
* add option to save finetune output every N iterations
* also save latest finetune output with ITERATION="LATEST" and print where files are saved
saving with LATEST makes it easier to resume training from the latest checkpoint
the string "LATEST" can be configured with command line option "--fn-latest STR"
* update checkpoint train stats before saving via "--save-every"
* add command line option `--rank-wo N` for rank of wo tensor
* update finetune README
* fix dump_non_result_info_yaml to output multiple lora adapters
* bug fix: replace GGML_TYPE_SIZE[t] by ggml_type_size(t)
* replace llama_n_mult by llama_n_ff
* finetune bug fixes to compile with merged in code from master
* remove prediction related code to reduce duplicated code with main
use main instead
* reduce large memory overhead in train-text-from-scratch
all gradients had to be pinned so that graph_reset works correctly.
this is no longer necessary with the changes to ggml_compute_backward introduced in this PR.
* add comment explaining why finetune checkpoints are allocated in one block
* make default value of float member a float literal
* handle rms_norm and rope parameters the same as in train-text-from-scratch
* remove unused code
* remove vocab related code as it is unnecessary
* add LLM_KV_TRAINING_TYPE to train-text-from-scratch checkpoints
so that they can be differentiated from lora finetune checkpoints
* add gguf constants and load/save functions from train-text-from-scratch
* add load & save lora finetune checkpoints via gguf
* add python script to convert old finetune checkpoint files to gguf
* remove old checkpoint save & load code
* remove code to print data checksums which was used to verify correctness of new gguf code
* omit tokenization when training is disabled, only save llama lora adapter
training can be disabled by passing '-n 0' to finetune
* remove trailing whitespace
* update README.md
* implement ggml_compute_forward_repeat_f16
* avoid stack overflow of large cgraphs in test-grad0
* add ggml API functions ggml_unravel_index, ggml_get_i32_nd and its analogs for set and for f32
ggml_get_i32_1d, ggml_set_i32_1d, ggml_get_f32_1d, ggml_set_f32_1d now support non-contiguous tensors.
in case of non-contiguous tensor, the 1d index is unraveled into a multi index using ggml_unravel_index to be passed to '_nd' function equivalent.
this fixes a bug in test-grad0 which happens due to ggml_build_backward not building purely contiguous tensors anymore
* increase test-grad0 context mem size to accommodate for bigger cgraph
* add sanity check to ggml_compute_backward, asserting the correct shape of gradients
* fix ggml_acc_or_set to return tensor of correct shape
* remove unused 'inplace' argument from ggml_compute_backward function
inplace operations to add gradients are no longer created by ggml_compute_backward
use allocator to automatically make inplace operations
* add missing argument 'int i0' to ggml_get_i32_nd & ggml_set_i32_nd header declarations
* fix error message in ggml_allocr_alloc to display actual max_avail
* fix check_gradient
ggml_build_backward_expand was previously replaced by ggml_build_backward, but the assignment of forward graph to backward graph missing
* use tensor->view_src instead of ggml_is_view and get_view_source
* move gradient checkpointing code into ggml, new API function:
// build gradient checkpointing backward graph gb for gf using provided checkpoints
// gb_tmp will contain original backward graph with rewritten backward process nodes,
// but without the second forward pass nodes.
GGML_API void ggml_build_backward_gradient_checkpointing(
struct ggml_context * ctx,
struct ggml_cgraph * gf,
struct ggml_cgraph * gb,
struct ggml_cgraph * gb_tmp,
struct ggml_tensor * * checkpoints,
int n_checkpoints);
* replace custom data getters and setters by ggml functions
* train-text-from-scratch can train (full finetune) gguf models
just pass the gguf model via `--checkpoint-in FN`.
after this, to continue training, pass the generated checkpoint instead of the original gguf model.
tested with smaller models, bigger models may exceed available memory.
use (LORA) finetune for those.
* remove trailing whitespace
* add option to save train-text-from-scratch output every N iterations
* update README.md
* fix warnings
* fix warnings
* remove finetune option to disable allocator
the allocator should always be used.
by making sure that it is always used it gets easier to implement automatic memory requirements computation
* add tensor checkpoints only when gradient checkpointing is enabled
* initialize opt ggml context if none was provided
* add ggml-alloc API function 'ggml_allocr_max_size' to get max size of alloc
GGML_API size_t ggml_allocr_max_size(struct ggml_allocr * alloc);
* finetune: automatically allocate all memory and changes to command line options
remove '--n_examples N' parameter, as it no longer makes sense to call optimization process multiple times in a loop.
add '--only_write_lora' command line option: will skip tokenization and training, to only write a llama.cpp comptabile LORA adapter.
remove memory buffer related command line options.
improve iteration console output.
* add finetune to Makefile
* update README.md
* print time per iteration and estimate remaining time
* increase measured alloc size by tensor_alignment
ggml_allocr_reset will reduce the given size by up to tensor_alignment-1
* fix README.md
* add some more allocator debug prints
* bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue
* revert last commit
"bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue"
"alloc was freeing an externally allocated tensor, because it calculated the end of allocator memory as alloc->data + alloc->max_size instead of alloc->data + alloc->size."
This is intentional to reduce the risk of freeing external tensors when measuring. Unless max_size is not properly calculated, I don't see why this is an issue.
* remove unnecessary "0x" before "%p" output
* move measurement memory segment to upper region of the address space
* update README.md
* fix printf format warnings
* add missing gguf_free in load_checkpoint_lora_file
* load default rms_norm and rope parameters from base model
* add gradient accumulation
specify number accumulation steps with '--grad-acc N'.
this will simulate a bigger batch size of grad_acc*batch.
* fix tracking of train_samples and train_tokens
* build : fix compile warnings
* ggml : fix L-BFGS linesearch loop
* improve finetune time measurement
fix printf warnings on system where int64_t is (long int).
change time datatypes to double because values get big with long training times.
exclude file saving from time measurement.
converge faster to actual time per iteration by removing very small first duration before first iteration was performed.
fix bug in output of total training time, the reported value was 1000 times to small.
* specify default lora rank with '--lora-r N'
'--lora-r N' will specify default rank for all tensors
'--rank-wq N', etc. will override this default rank for specific tensor types.
* fix gradient accumulation bug where the same batch was used for each microstep
* fix gradient accumulation bug where the same batch was used for each microstep
* support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back
k and v can now be repeated in q along ne[2]
in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2.
in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3].
so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads.
in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2.
since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor.
additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned.
we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions.
this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous.
since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous.
change test-grad0 to also test for repeated k/v in q.
this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable.
added a note to explain this.
* add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'.
* fix finetune to support grouped-query-attention (using flash-attention)
note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.
* support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b)
* test broadcasting mul_mat backward pass
* decouple random number generator of each operation test
when changing one test the rng of others tests is not influenced anymore
* add comment briefly describing what ggml_repeat_back does
* simplify broadcasting mul_mat backward using ggml_repeat_back
* add cgraph evaluation order member and corresponding enum type
this controls in which order ggml_build_forward visits source nodes.
by default the nodes are visited left to right, i.e. src[0] first.
in some cases it is beneficial for ggml-alloc to visit in a different order.
two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last).
* measure max compute size for each cgraph eval order and use best order
this can bring huge memory savings:
e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB
* remove unused command line options
* add sample start patterns and options to force new or by default resume last shuffling
* update shuffle rng state on reshuffle
* exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32
* remove probably unnecessary exception type flags from stringstream
* pass correct max number of tokens to llama_tokenize
* account for possible leading whitespace that will be added by tokenizer
e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12]
* use unrolled vec_mad in out_prod
y is vec_mad result vec.
x is vec_mad input vec.
v is vec_mad input scalar.
ggml_vec_mad_f32_unroll will internally loop over x and v with same y.
GGML_VEC_MAD_UNROLL is by default defined to 32.
This value is empirical optimized using performance test runs of out-prod in openllama-3b finetune with 256 context length and batch size 1. It gives 23% performance boost for out_prod.
Full measurements of out-prod runtime in ms:
unroll_xv unroll_yv
1 67014.643 87826.469
2 77117.552 89077.656
4 72091.311 109121.657
8 61077.543 88678.334
16 56914.67 79514.947
24 59024.595 84350.254
28 55952.446 83368.73
32 51476.658 85177.745
36 55973.792 84659.92
40 55139.616 93844.738
48 60736.392 93330.267
64 99856.878 116994.99
Second column is when unrollying yv instead of xv
* set lora_alpha to value of lora_r if it is not set via command line
otherwise only changing lora_r will change scaling of lora adapter used in prediction
* reshuffle original sample order instead of the previous shuffled order
otherwise resumed reshuffle will not result in same sample order
* block tiling for out-prod inspired by mul-mat
block sizes are empirically optimized
roughly doubles the flops of out-prod
* exclude some more known zero values from computations in flash_attn_f32 & flash_attn_back_f32
* add static keywords
* remove outcommented old code
* update train-text-from-scratch with tokenization, sample selection and shuffling from finetune
* remove lbfgs related train parameters
* move common train functions into common/train.[h|cpp]
* move train state into struct train_state
* move train data saving code into callback to unify code of opt_callback
train_params are still different in finetune and train-text-from-scratch, so it can't yet be moved to train.h|cpp
* move common train params into common/train
* move common opt_callback into common/train
* fix consume_common_train_arg
* save and load head_count_kv in lora checkpoints
* increase train_samples by used_samples instead of number of batches
on batch can contain more than one sample when option "fill_with_next_samples" is used
* fix usage of llama_tokenize
* remove static from process_escape since we need it exposed in header
* fix code formating of long function declarations
* fix condition in load_train_state_gguf
* use die("msg") instead of replace GGML_ASSERT(!"msg") or throw std::runtime_error("msg")
* fix saving and loading of training type
* remove terminating '\0' from tokenization
(llama_tokenize is now passed the string length instead of relying on terminating '\0')
* fix compile warnings
* fix compile warnings
* use new/delete for train_state instead of malloc/free
using malloc may result in seg faults when trying to assign string fields
* assert that sample_count > 0, avoiding division by zero
* fix frand to return value in interval [0,1)
* add train option "--sample-random-offsets"
Use samples beginning at random offsets.
The offset is only applied to the first sample in each batch context window.
Together with "--fill-with-next-samples" this may help for training endless text generation.
For example given a dataset containing samples "abcd", "ABCD", "0123".
With context size of 8 and options "--fill-with-next-samples", "--no-separate-with-eos", "--no-separate-with-bos",
the context windows of batches could only be filled with "abcdABCD", "ABCDabcd", "0123abcd", etc.
With "--sample-random-offsets" it can also be filled with "23abcdAB", "bcd0123A", etc.
* deduplicate code into function
* remove n_rot hparam, as it must always be hparam.n_embd_head()
* align code
* assert correct base model tensor shapes
* move some params from lora hparams into model hparams and load model params from gguf
this equalizes the model definition in finetune and text-from-scratch and removes the need for additional llama api functions to get model parameters
* remove now unnecessary llama API functions to get model params that where added by this PR
* train-text-from-scratch: automatically allocate model tensors, remove option '--mem-model N'
* train-text-from-scratch: automatically allocate opt context
* train-text-from-scratch: automatically allocate input tensors
* train-text-from-scratch: automatically allocate compute memory
* remove unused options and equalize train-text-from-scratch with finetune
* initialize opt->loss_after with zero
* add export-lora program
* remove trailing whitespace
* add export-lora build in Makefile
* remove unused struct tensor_info from export-lora
* add export-lora build dependency to llama
because it depends on common, which depends on llama
* update finetune README.md
* cancel optimization when specified number of epochs is completed
* improve handling of export-lora arguments
print errors and warnings when files could not be read or created
* Fix export-lora.cpp "not enough space in the context's memory pool" (#1)
* Fix export-lora.cpp "not enough space in the context's memory pool"
Without this patch, export-lora would sometimes error with "not enough space in the context's memory pool (needed 656784, available 656800)".
* increase required context size by 5*GGML_MEM_ALIGN instead of plain 16
---------
Co-authored-by: xaedes <xaedes@gmail.com>
* improve handling of not yet supported tensor types
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: meatbag-18a <145869052+meatbag-18a@users.noreply.github.com>
2023-09-28 20:40:11 +02:00
invalid_param = true ;
break ;
}
const char * lora_adapter = argv [ i ] ;
2024-03-07 10:41:53 +01:00
if ( + + i > = argc ) {
train : finetune LORA (#2632)
* fix track_max_mem in forward_batch_wo_cache_flash_attn_train
* remove unnecessary Adam(W) optimizer tensors.
reduces optimizer memory overhead from 7*modelsize to 2*modelsize.
additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t.
bumps training checkpoint file version, but old checkpoints can still be read.
new version with less tensors is saved.
* add gradient clipping to AdamW
* Fix reset of unused g->nodes and g->grads to NULL
* implement gradient checkpointing for training
reduces memory overhead from O(n_layer) to O(sqrt(n_layer))
as explained in readme of https://github.com/cybertronai/gradient-checkpointing
* remove unused compute buffer 3
* add and use function ggml_build_backward_expand to avoid stack overflows with large maximum number of nodes
GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);
* change AdamW decay parameter to work like the torch AdamW decay parameter
It is now relative to Adam learning rate `alpha*sched`.
Before that it was relative to `sched` only.
`alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1]
* change default AdamW weight decay parameter used in training to 0.1 as used in nanoGPT
* change default AdamW weight decay parameter defined in ggml to 0.0, making Adam default instead of AdamW
btw: the default weight decay parameter for torch.optim.AdamW is 0.01
* bug fixes for cross entropy loss
ggml_cross_entropy_loss: sums where not correctly added in workload of each thread
ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues
guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16
cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup.
so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance.
* fix test-grad0 for cross_entropy_loss
the second argument to cross_entropy_loss must sum up to 1 for each row
* fix test-grad0 for soft_max
dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work
instead use sum(log(soft_max()*(1-eps)+eps)); use eps to avoid log(0)
* improve finite differences of test-grad0 by using double instead of float
* change cross_entropy_loss to output average over all rows
this helps keeping the loss and gradients in a sane range
* improve gradient checkpointing
sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal.
since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different:
```
given: n, u, v
objective: minimize(a*u+b*v) where a*b=n, a>0, b>0
b=n/a
minimize(a*u+v*n/a)
diff(a*u+v*n/a, a) = u - (v*n/a)/a
diff(a*u+v*n/a, a) == 0
u - (v*n/a)/a == 0
u == v*n/(a*a)
u*a*a = v*n
a*a = v*n/u
a = sqrt(n*v/u)
```
this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage.
* disable gradient checkpointing debug output
* llama : fix rope usage in train-text-from-scratch after ChatGLM change
* add more training parameters:
--enable-restart N Only for Adam optimizer. Enable restarts of cos-decay
--disable-restart N Only for Adam optimizer. Disable restarts of cos-decay
--opt-past N Number of optimization iterations to track for delta convergence test. Disabled when zero.
--opt-delta N Maximum delta for delta convergence test. Disabled when <= zero.
--opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero.
--adam-epsf N AdamW epsilon for convergence test. Disabled when <= zero.
--adam-min-alpha N Adam minimum learning rate alpha, usually 0.1 * alpha
* replace memcpy with reshape operation so that the graph is not cut at the input
this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it
* remove unused function argument from get_example_targets_batch
* measure and print total training time
* add optimization callback to ggml_opt_resume_g
this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)).
can be used for dynamic learning schedule and setting input data for batches before each iteration
* use optimization callback in training
allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters
reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration
* add minimum number of tensor dimensions to apply weight decay (default 2)
this allows to not apply weight decay to bias parameters
* rename training parameter cos-decay-alpha to cos-decay-min and clarify that adam-min-alpha also applies to warmup
* fix increase of model.train_samples and model.train_tokens
now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations
* change sampling parameters for prediction after training to defaults of common.h
and clarify what is context for prediction and what are generated tokens
* tighten abs error bounds for cross_entropy_loss in test-grad0
* add conditional compilation of using F16 exp in flash attention
uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention
* tighten abs error bounds for flash_attn in test-grad0
* tighten abs error bounds for sqrt in test-grad0
* remove out-commented vectorized code of opt_adam
the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead
* ggml : update ggml_rms_norm_back with configurable eps
* llama training : fix ggml_rms_norm_back calls to pass configurable eps
* remove trailing whitespace
* add train function using automatic gradient checkpointing backward pass and allocator
* in train function replace add_inplace by regular add
because using add_inplace seems to result in different gradients
* don't use allocate hash_map on context
because the context has no_alloc=True when using memory allocator resulting in NULL data pointers
* correctly clone reshape and permute operations by also cloning tensor->nb values
* fix variable name and add missing type cast
* terminate recursive tensor cloning when reaching tensor without src tensors
* correctly clone view tensors by setting data pointers
without this the checkpointing would only work when being used together with memory allocator
* fix variable names
* swap arguments to commutative ops to be the same as in `forward_batch_wo_cache_flash_attn`
* add input tensors as checkpoints
so that recursive tensor cloning of gradient checkpointing terminates on input tensors
* fix variable name and add missing boolean negation
* make sure some tensors are not reallocated by inserting new temporary nodes depending on them:
output and parameter gradient tensors need to be available at the end of the graph execution
parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration
checkpoint tensors are allocated all together to reduce memory allocator fragmentation
afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs
* fix ASSERT to work with zero layers
* add training options whether to use allocator and/or unified training function
* integrate unified training function which may use memory allocator
the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing
* format name of cloned tensors with " (clone)" suffix
* set names for tensors in unified train function for easier debugging
* allocate graph on context using ggml_new_graph
* remove handwritten training functions
* remove unused training parameters "use_scratch" and "use_unified"
* remove trailing whitespace
* remove unused train params: mem_compute1_gb & mem_compute2_gb
mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions
mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented)
* remove unused forward_batch function
* add debug asserts in ggml_allocr_alloc to some common pitfalls when using this function directly
* only use ggml_allocr_alloc when tensor has NULL data and is no view
* fix test when to create temporary backward graph
temporary backward graph is only necessary when using checkpointing
* fix memory "leak" in optimizers
each iteration a new cplan with new memory for work data was allocated.
now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data.
* reverse order of for loop in ggml_build_backward_expand to save memory when using gradient checkpointing and allocator
with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory.
the computation results are the same
* add API functions to access llama model tensors
* add stub example for finetuning, based on train-text-from-scratch
* move and remove code
* add API functions to access remaining model parameters:
mult, head and rot
* first draft for LORA finetune training
* remove const model and layer arguments in API functions for accessing model tensors
* bug fixes to make finetune compile
automatic allocator does not work yet
* add debug prints for training memory improvements
* fix names of lora tensors
* avoid stack overflow resulting from big ggml_cgraph
replace stack allocation and ggml_build_forward by ggml_new_graph in combination with ggml_build_forward_expand
* replace llama API functions to get model tensors by one function to get model tensor by name
LLAMA_API struct ggml_tensor * llama_get_model_tensor(struct llama_model * model, const char * name);
* remove unused call to not existing llama_get_layer_from_model
* implement ggml_compute_forward_out_prod_q_f32
* remove trailing whitespace
* add lora finetune support on quantized base model tensors
* add ggml_add_cast API function
this function works like ggml_add, but accepts a data type for the resulting tensor.
only supported for quantized src0 input.
* use ggml_add_cast in finetuning
lora-applied weights will now have data type F32, which improves gradients when finetuning quantized base models
* bug fix: actually use result type passed to ggml_add_cast
* make sure base model tensors data cannot be used in viewable operations
memory allocator would try to make lora application inplace on base model tensors.
since those are memory mapped this will result in memory access violations
* fix bug in ggml_out_prod which resulted in wrong n_dims of result tensors
* avoid keeping in memory ALL of the gradients
The problem here stems from ggml_graph_reset. This function is called in the optimization function, before each graph computation, to reset the gradients to zero. This required a unique memory slot for each gradient: allocating memory from a previosly freed memory location might lead to non-zero input gradients.
During ggml_compute_backward the gradients are build stepwise by adding or substracting new values, starting from a OP_NONE tensor which needs to contain zero-values. This requires the graph reset.
To avoid this I now remember in ggml_build_backward_expand the original OP_NONE gradient tensors in a hash table, which is passed to ggml_compute_backward. There instead of using add (or sub or similar) I test whether the existing gradient to be changed is a zero-valued-tensor by looking up its existence in the hash table. When it is such a zero-tensor it will not be modified, but replaced by the value to be added, otherwise the regular add (not inplace, allocator will take care of this) will be used. This way none of those zero-tensor values will be necessary in the final backward graph and more importantly they won't need a unique memory slot, just to make them zero.
* remove trailing whitespace
* remove debug prints and function to compute tensor data hash
* improve optimization iteration prints
* adjust maximal values to support finetuning 3B models
* change default finetune params lora_r and lora_alpha to match the n_rank parameters of 4
* bug fix: make sure finetune input gradient is allocated at begin and kept until end
* remove unnecessary src tensor from ggml_get_rows_back
we don't need data of src[2] for computation, only to setup the correct output shape.
remove dependency on src[2], so that allocator can work more freely.
the computational graph is still completely determined, because the output shape is naturally included.
this is similar to how ggml_reshape does it.
* remove unnecessary src tensor from ggml_repeat & ggml_repeat_back
we don't need data of src[1] for computation, only to setup the correct output shape.
remove dependency on src[1], so that allocator can work more freely.
the computational graph is still completely determined, because the output shape is naturally included
* resolve todo
allocator will only make it inplace when they are of the same type
* mixing multiple LORA adapters is now possible
pass more than one '--lora FNAME' argument to apply more than one LORA.
use '--lora-scaled FNAME S' when you want to specify a user-defined scale for an adapter.
* add option to save finetune output every N iterations
* also save latest finetune output with ITERATION="LATEST" and print where files are saved
saving with LATEST makes it easier to resume training from the latest checkpoint
the string "LATEST" can be configured with command line option "--fn-latest STR"
* update checkpoint train stats before saving via "--save-every"
* add command line option `--rank-wo N` for rank of wo tensor
* update finetune README
* fix dump_non_result_info_yaml to output multiple lora adapters
* bug fix: replace GGML_TYPE_SIZE[t] by ggml_type_size(t)
* replace llama_n_mult by llama_n_ff
* finetune bug fixes to compile with merged in code from master
* remove prediction related code to reduce duplicated code with main
use main instead
* reduce large memory overhead in train-text-from-scratch
all gradients had to be pinned so that graph_reset works correctly.
this is no longer necessary with the changes to ggml_compute_backward introduced in this PR.
* add comment explaining why finetune checkpoints are allocated in one block
* make default value of float member a float literal
* handle rms_norm and rope parameters the same as in train-text-from-scratch
* remove unused code
* remove vocab related code as it is unnecessary
* add LLM_KV_TRAINING_TYPE to train-text-from-scratch checkpoints
so that they can be differentiated from lora finetune checkpoints
* add gguf constants and load/save functions from train-text-from-scratch
* add load & save lora finetune checkpoints via gguf
* add python script to convert old finetune checkpoint files to gguf
* remove old checkpoint save & load code
* remove code to print data checksums which was used to verify correctness of new gguf code
* omit tokenization when training is disabled, only save llama lora adapter
training can be disabled by passing '-n 0' to finetune
* remove trailing whitespace
* update README.md
* implement ggml_compute_forward_repeat_f16
* avoid stack overflow of large cgraphs in test-grad0
* add ggml API functions ggml_unravel_index, ggml_get_i32_nd and its analogs for set and for f32
ggml_get_i32_1d, ggml_set_i32_1d, ggml_get_f32_1d, ggml_set_f32_1d now support non-contiguous tensors.
in case of non-contiguous tensor, the 1d index is unraveled into a multi index using ggml_unravel_index to be passed to '_nd' function equivalent.
this fixes a bug in test-grad0 which happens due to ggml_build_backward not building purely contiguous tensors anymore
* increase test-grad0 context mem size to accommodate for bigger cgraph
* add sanity check to ggml_compute_backward, asserting the correct shape of gradients
* fix ggml_acc_or_set to return tensor of correct shape
* remove unused 'inplace' argument from ggml_compute_backward function
inplace operations to add gradients are no longer created by ggml_compute_backward
use allocator to automatically make inplace operations
* add missing argument 'int i0' to ggml_get_i32_nd & ggml_set_i32_nd header declarations
* fix error message in ggml_allocr_alloc to display actual max_avail
* fix check_gradient
ggml_build_backward_expand was previously replaced by ggml_build_backward, but the assignment of forward graph to backward graph missing
* use tensor->view_src instead of ggml_is_view and get_view_source
* move gradient checkpointing code into ggml, new API function:
// build gradient checkpointing backward graph gb for gf using provided checkpoints
// gb_tmp will contain original backward graph with rewritten backward process nodes,
// but without the second forward pass nodes.
GGML_API void ggml_build_backward_gradient_checkpointing(
struct ggml_context * ctx,
struct ggml_cgraph * gf,
struct ggml_cgraph * gb,
struct ggml_cgraph * gb_tmp,
struct ggml_tensor * * checkpoints,
int n_checkpoints);
* replace custom data getters and setters by ggml functions
* train-text-from-scratch can train (full finetune) gguf models
just pass the gguf model via `--checkpoint-in FN`.
after this, to continue training, pass the generated checkpoint instead of the original gguf model.
tested with smaller models, bigger models may exceed available memory.
use (LORA) finetune for those.
* remove trailing whitespace
* add option to save train-text-from-scratch output every N iterations
* update README.md
* fix warnings
* fix warnings
* remove finetune option to disable allocator
the allocator should always be used.
by making sure that it is always used it gets easier to implement automatic memory requirements computation
* add tensor checkpoints only when gradient checkpointing is enabled
* initialize opt ggml context if none was provided
* add ggml-alloc API function 'ggml_allocr_max_size' to get max size of alloc
GGML_API size_t ggml_allocr_max_size(struct ggml_allocr * alloc);
* finetune: automatically allocate all memory and changes to command line options
remove '--n_examples N' parameter, as it no longer makes sense to call optimization process multiple times in a loop.
add '--only_write_lora' command line option: will skip tokenization and training, to only write a llama.cpp comptabile LORA adapter.
remove memory buffer related command line options.
improve iteration console output.
* add finetune to Makefile
* update README.md
* print time per iteration and estimate remaining time
* increase measured alloc size by tensor_alignment
ggml_allocr_reset will reduce the given size by up to tensor_alignment-1
* fix README.md
* add some more allocator debug prints
* bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue
* revert last commit
"bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue"
"alloc was freeing an externally allocated tensor, because it calculated the end of allocator memory as alloc->data + alloc->max_size instead of alloc->data + alloc->size."
This is intentional to reduce the risk of freeing external tensors when measuring. Unless max_size is not properly calculated, I don't see why this is an issue.
* remove unnecessary "0x" before "%p" output
* move measurement memory segment to upper region of the address space
* update README.md
* fix printf format warnings
* add missing gguf_free in load_checkpoint_lora_file
* load default rms_norm and rope parameters from base model
* add gradient accumulation
specify number accumulation steps with '--grad-acc N'.
this will simulate a bigger batch size of grad_acc*batch.
* fix tracking of train_samples and train_tokens
* build : fix compile warnings
* ggml : fix L-BFGS linesearch loop
* improve finetune time measurement
fix printf warnings on system where int64_t is (long int).
change time datatypes to double because values get big with long training times.
exclude file saving from time measurement.
converge faster to actual time per iteration by removing very small first duration before first iteration was performed.
fix bug in output of total training time, the reported value was 1000 times to small.
* specify default lora rank with '--lora-r N'
'--lora-r N' will specify default rank for all tensors
'--rank-wq N', etc. will override this default rank for specific tensor types.
* fix gradient accumulation bug where the same batch was used for each microstep
* fix gradient accumulation bug where the same batch was used for each microstep
* support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back
k and v can now be repeated in q along ne[2]
in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2.
in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3].
so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads.
in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2.
since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor.
additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned.
we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions.
this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous.
since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous.
change test-grad0 to also test for repeated k/v in q.
this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable.
added a note to explain this.
* add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'.
* fix finetune to support grouped-query-attention (using flash-attention)
note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.
* support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b)
* test broadcasting mul_mat backward pass
* decouple random number generator of each operation test
when changing one test the rng of others tests is not influenced anymore
* add comment briefly describing what ggml_repeat_back does
* simplify broadcasting mul_mat backward using ggml_repeat_back
* add cgraph evaluation order member and corresponding enum type
this controls in which order ggml_build_forward visits source nodes.
by default the nodes are visited left to right, i.e. src[0] first.
in some cases it is beneficial for ggml-alloc to visit in a different order.
two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last).
* measure max compute size for each cgraph eval order and use best order
this can bring huge memory savings:
e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB
* remove unused command line options
* add sample start patterns and options to force new or by default resume last shuffling
* update shuffle rng state on reshuffle
* exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32
* remove probably unnecessary exception type flags from stringstream
* pass correct max number of tokens to llama_tokenize
* account for possible leading whitespace that will be added by tokenizer
e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12]
* use unrolled vec_mad in out_prod
y is vec_mad result vec.
x is vec_mad input vec.
v is vec_mad input scalar.
ggml_vec_mad_f32_unroll will internally loop over x and v with same y.
GGML_VEC_MAD_UNROLL is by default defined to 32.
This value is empirical optimized using performance test runs of out-prod in openllama-3b finetune with 256 context length and batch size 1. It gives 23% performance boost for out_prod.
Full measurements of out-prod runtime in ms:
unroll_xv unroll_yv
1 67014.643 87826.469
2 77117.552 89077.656
4 72091.311 109121.657
8 61077.543 88678.334
16 56914.67 79514.947
24 59024.595 84350.254
28 55952.446 83368.73
32 51476.658 85177.745
36 55973.792 84659.92
40 55139.616 93844.738
48 60736.392 93330.267
64 99856.878 116994.99
Second column is when unrollying yv instead of xv
* set lora_alpha to value of lora_r if it is not set via command line
otherwise only changing lora_r will change scaling of lora adapter used in prediction
* reshuffle original sample order instead of the previous shuffled order
otherwise resumed reshuffle will not result in same sample order
* block tiling for out-prod inspired by mul-mat
block sizes are empirically optimized
roughly doubles the flops of out-prod
* exclude some more known zero values from computations in flash_attn_f32 & flash_attn_back_f32
* add static keywords
* remove outcommented old code
* update train-text-from-scratch with tokenization, sample selection and shuffling from finetune
* remove lbfgs related train parameters
* move common train functions into common/train.[h|cpp]
* move train state into struct train_state
* move train data saving code into callback to unify code of opt_callback
train_params are still different in finetune and train-text-from-scratch, so it can't yet be moved to train.h|cpp
* move common train params into common/train
* move common opt_callback into common/train
* fix consume_common_train_arg
* save and load head_count_kv in lora checkpoints
* increase train_samples by used_samples instead of number of batches
on batch can contain more than one sample when option "fill_with_next_samples" is used
* fix usage of llama_tokenize
* remove static from process_escape since we need it exposed in header
* fix code formating of long function declarations
* fix condition in load_train_state_gguf
* use die("msg") instead of replace GGML_ASSERT(!"msg") or throw std::runtime_error("msg")
* fix saving and loading of training type
* remove terminating '\0' from tokenization
(llama_tokenize is now passed the string length instead of relying on terminating '\0')
* fix compile warnings
* fix compile warnings
* use new/delete for train_state instead of malloc/free
using malloc may result in seg faults when trying to assign string fields
* assert that sample_count > 0, avoiding division by zero
* fix frand to return value in interval [0,1)
* add train option "--sample-random-offsets"
Use samples beginning at random offsets.
The offset is only applied to the first sample in each batch context window.
Together with "--fill-with-next-samples" this may help for training endless text generation.
For example given a dataset containing samples "abcd", "ABCD", "0123".
With context size of 8 and options "--fill-with-next-samples", "--no-separate-with-eos", "--no-separate-with-bos",
the context windows of batches could only be filled with "abcdABCD", "ABCDabcd", "0123abcd", etc.
With "--sample-random-offsets" it can also be filled with "23abcdAB", "bcd0123A", etc.
* deduplicate code into function
* remove n_rot hparam, as it must always be hparam.n_embd_head()
* align code
* assert correct base model tensor shapes
* move some params from lora hparams into model hparams and load model params from gguf
this equalizes the model definition in finetune and text-from-scratch and removes the need for additional llama api functions to get model parameters
* remove now unnecessary llama API functions to get model params that where added by this PR
* train-text-from-scratch: automatically allocate model tensors, remove option '--mem-model N'
* train-text-from-scratch: automatically allocate opt context
* train-text-from-scratch: automatically allocate input tensors
* train-text-from-scratch: automatically allocate compute memory
* remove unused options and equalize train-text-from-scratch with finetune
* initialize opt->loss_after with zero
* add export-lora program
* remove trailing whitespace
* add export-lora build in Makefile
* remove unused struct tensor_info from export-lora
* add export-lora build dependency to llama
because it depends on common, which depends on llama
* update finetune README.md
* cancel optimization when specified number of epochs is completed
* improve handling of export-lora arguments
print errors and warnings when files could not be read or created
* Fix export-lora.cpp "not enough space in the context's memory pool" (#1)
* Fix export-lora.cpp "not enough space in the context's memory pool"
Without this patch, export-lora would sometimes error with "not enough space in the context's memory pool (needed 656784, available 656800)".
* increase required context size by 5*GGML_MEM_ALIGN instead of plain 16
---------
Co-authored-by: xaedes <xaedes@gmail.com>
* improve handling of not yet supported tensor types
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: meatbag-18a <145869052+meatbag-18a@users.noreply.github.com>
2023-09-28 20:40:11 +02:00
invalid_param = true ;
break ;
}
2024-02-03 12:23:37 +01:00
params . lora_adapter . emplace_back ( lora_adapter , std : : stof ( argv [ i ] ) ) ;
2023-07-13 15:58:25 +02:00
params . use_mmap = false ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --lora-base " ) {
if ( + + i > = argc ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
params . lora_base = argv [ i ] ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " -v " | | arg = = " --verbose " ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
# if SERVER_VERBOSE != 1
LOG_WARNING ( " server.cpp is not built with verbose logging. " , { } ) ;
# else
server_verbose = true ;
# endif
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --mlock " ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . use_mlock = true ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --no-mmap " ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . use_mmap = false ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --numa " ) {
2024-02-16 10:31:07 +01:00
if ( + + i > = argc ) {
invalid_param = true ;
break ;
} else {
std : : string value ( argv [ i ] ) ;
/**/ if ( value = = " distribute " | | value = = " " ) { params . numa = GGML_NUMA_STRATEGY_DISTRIBUTE ; }
2024-03-07 10:41:53 +01:00
else if ( value = = " isolate " ) { params . numa = GGML_NUMA_STRATEGY_ISOLATE ; }
else if ( value = = " numactl " ) { params . numa = GGML_NUMA_STRATEGY_NUMACTL ; }
2024-02-16 10:31:07 +01:00
else { invalid_param = true ; break ; }
}
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --embedding " | | arg = = " --embeddings " ) {
2023-10-22 21:53:08 +02:00
params . embedding = true ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " -cb " | | arg = = " --cont-batching " ) {
2023-10-22 21:53:08 +02:00
params . cont_batching = true ;
ggml : add Flash Attention (#5021)
* ggml : add ggml_flash_attn_ext API
* ggml : fix GQA support in ggml_flash_attn_ext
* ggml : online attention (CPU)
* metal : initial implementation
* metal : f16 precision
* metal : reduce branches
* metal : specialize for head size
* wip : 8 rows per simd group
* wip : 4 rows per simd group
* wip : template for rows per warp
* metal : parallelize across KV size
* metal : parallel reduce across heads
* metal : efficient flash_attn_f16 implementation
* metal : avoid redundant loads of the attention
* metal : scale and mask in matrix form
* metal : fix comment
* llama : avoid ggml_cast, use F32 query
* metal : add parallel reduce version (disabled)
* metal : move output into local memory + optimize
- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments
* metal : add tests, fix scaling, support C > 32
* metal : improve precision
* ggml : fix f16 mad
* metal : minor
* metal : support Q > 8
* tests : add ATTN tests
* metal : disable buffer allocation logs
* tests : more
* metal : faster inner loop for C == 32
* metal : fix array initialization
* tests : ifdef
* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext
* ggml : fix ggml_soft_max mask requirement
* cuda : fix soft_max to use correct mask size
* cuda : add flash_attn kernel (wip)
* metal : optimize softmax for C > 32
* metal : optimize softmax
* tests : minor fix
* cuda : avoid zeroing fragments
* tests : update dims
* cuda : fix __hisinf() result check
* cuda : avoid warp_reduce for smax
* cuda : use int instead of int64_t
Noticeably improves performance (thanks to Johannes)
* cuda : make loops use the same loop values
Thanks Johannes again for the tip
* cuda : unroll some of the loops
* cuda : avoid __hisinf branches
* cuda : use half2 in softmax
* cuda : switch to 1 warp for bs > 16
* cuda : speed-up reduce part of the kernel
* cuda : unroll Q*K^T loop
* cuda : fix -INF block check
* cuda : simplify softmax
* cuda : fix matrix names
* cuda : minor
* llama : adapt to F16 KQ_pos
* llama : adapt new models to F16 KQ_mask
* ggml : fix F16 store (ARM NEON)
* llama : fix type of KQ_mask and KQ_pos
* ggml : fix CPU soft_max
* tests : add hs=256
* cuda : fix build
* metal : improve perf via smaller int registers
* cuda : adapt soft_max to F16 mask and pos
* CUDA: faster FlashAttention, kernel for bs == 1
* 16 cols for Phi-2
* no vec for hs, no hs==256 ncols==32 for Volta
* adjust kernel selection logic
* 4 warps, 256 stride for all D
* no ncols == 64
* Multiple parallel blocks for batch size 1
* fix compile warnings
* fix excessive KQ_b loads
* fix cmake build
* fix KV cache padding, NaN from INFINITY (#6438)
* llama : flash_attn cparam + fix defrag
* server: support flash_attn param
* server: bench: enable flash_attn param
* CUDA: refactor host code, dyn. par. blocks
* fix flash_attn_vec_f16 race condition
* flush softmax exp below threshold to 0
* store temp KQ in registers
* Calculate KQ as FP32 if KQV has GGML_PREC_F32
* Add __hgt2_mask implementation for CUDA 11
* fix KQ FP32 precision fpr parallel_blocks > 1
* llama-bench : add -fa,--flash-attn arg
* metal : add BS=1 kernel for flash attention (#6508)
* metal : add BS=1 kernel for flash attention (wip)
* metal : support more than 1 warps
* metal : opts
* metal : opt
* metal : switch to parallel reduce
* metal : reduce registers
* metal : simplify
* metal : initial FA vec kernel
* metal : use F32 attention accumulators
* batched-bench : add fattn arg
* llama : simplify llama_build_kv_store
ggml-ci
* llama : adapt build_olmo to changes
* ggml : fix arm fp16 store on windows
* metal : clean-up
* metal : clean-up kernel code
* metal : minor
* tests : remove benchmarks
ggml-ci
* ggml : fix avx512 const correctness
ggml-ci
* ggml : fix soft_max with bias on CPU
ggml-ci
* common : print --flash-attn in help
* ggml : fix num dimensions in ggml_flash_attn_ext
* llama : force disable flash attention for incompatible models
* ggml : ggml_soft_max support F16/F32 mask/pos
ggml-ci
* cuda : uint -> uint32_t
* cuda : "constexpr dim3" -> "const dim3"
ggml-ci
* cuda : try to fix __hgt2_mask
ggml-ci
* ggml : add TODO's for F16/F32 mask/pos support in other backends
* llama : replace bool need_kq_pos with use_alibi
* llama : prep ALiBi support for BERT models
ggml-ci
* llama : fix n_batch requirements
ggml-ci
* cont
* server : add help for --flash-attn arg
* llama : disable FA for AMD
* tests : remove TMP_ATTN_BENCH
ggml-ci
* llama : support save/load state with FA enabled
ggml-ci
* ci : add CUDA save-load-state tests
ggml-ci
* llama : llama_kv_cache_clear zeroes data + fix save-load seq
ggml-ci
* llama : fix copy-paste errors, add TODO
* llama : disallow incompatible states
* llama : update llama_state_get_size after v_trans field
* metal : remove tmp log
* llama : add static reminder for llama_state_get_size
* metal : fix max nsg
ggml-ci
* ci : fix arg order
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
2024-04-30 11:16:08 +02:00
} else if ( arg = = " -fa " | | arg = = " --flash-attn " ) {
params . flash_attn = true ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " -np " | | arg = = " --parallel " ) {
if ( + + i > = argc ) {
2023-10-22 21:53:08 +02:00
invalid_param = true ;
break ;
}
params . n_parallel = std : : stoi ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " -n " | | arg = = " --n-predict " ) {
if ( + + i > = argc ) {
2023-10-22 21:53:08 +02:00
invalid_param = true ;
break ;
}
params . n_predict = std : : stoi ( argv [ i ] ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " -spf " | | arg = = " --system-prompt-file " ) {
if ( + + i > = argc ) {
2023-10-22 21:53:08 +02:00
invalid_param = true ;
break ;
}
std : : ifstream file ( argv [ i ] ) ;
if ( ! file ) {
fprintf ( stderr , " error: failed to open file '%s' \n " , argv [ i ] ) ;
invalid_param = true ;
break ;
}
2024-03-07 10:41:53 +01:00
std : : string system_prompt ;
2023-10-22 21:53:08 +02:00
std : : copy (
std : : istreambuf_iterator < char > ( file ) ,
std : : istreambuf_iterator < char > ( ) ,
2024-03-07 10:41:53 +01:00
std : : back_inserter ( system_prompt )
2023-10-22 21:53:08 +02:00
) ;
2024-03-07 10:41:53 +01:00
sparams . system_prompt = system_prompt ;
} else if ( arg = = " -ctk " | | arg = = " --cache-type-k " ) {
2024-02-23 20:31:54 +01:00
params . cache_type_k = argv [ + + i ] ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " -ctv " | | arg = = " --cache-type-v " ) {
2024-02-23 20:31:54 +01:00
params . cache_type_v = argv [ + + i ] ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --log-format " ) {
if ( + + i > = argc ) {
2024-02-25 13:50:32 +01:00
invalid_param = true ;
break ;
}
2024-03-07 10:41:53 +01:00
if ( std : : strcmp ( argv [ i ] , " json " ) = = 0 ) {
2024-02-25 13:50:32 +01:00
server_log_json = true ;
2024-03-07 10:41:53 +01:00
} else if ( std : : strcmp ( argv [ i ] , " text " ) = = 0 ) {
2024-02-25 13:50:32 +01:00
server_log_json = false ;
2024-03-07 10:41:53 +01:00
} else {
2024-02-25 13:50:32 +01:00
invalid_param = true ;
break ;
}
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --log-disable " ) {
2023-11-30 23:25:49 +01:00
log_set_target ( stdout ) ;
LOG_INFO ( " logging to file is disabled. " , { } ) ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --slots-endpoint-disable " ) {
2024-02-18 18:39:57 +01:00
sparams . slots_endpoint = false ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --metrics " ) {
2024-02-25 13:49:43 +01:00
sparams . metrics_endpoint = true ;
2024-04-08 14:43:30 +02:00
} else if ( arg = = " --slot-save-path " ) {
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
sparams . slot_save_path = argv [ i ] ;
// if doesn't end with DIRECTORY_SEPARATOR, add it
if ( ! sparams . slot_save_path . empty ( ) & & sparams . slot_save_path [ sparams . slot_save_path . size ( ) - 1 ] ! = DIRECTORY_SEPARATOR ) {
sparams . slot_save_path + = DIRECTORY_SEPARATOR ;
}
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --chat-template " ) {
if ( + + i > = argc ) {
2024-02-11 11:16:22 +01:00
invalid_param = true ;
break ;
}
2024-02-20 15:58:27 +01:00
if ( ! verify_custom_template ( argv [ i ] ) ) {
fprintf ( stderr , " error: the supplied chat template is not supported: %s \n " , argv [ i ] ) ;
fprintf ( stderr , " note: llama.cpp does not use jinja parser, we only support commonly used templates \n " ) ;
2024-02-11 11:16:22 +01:00
invalid_param = true ;
break ;
}
2024-02-20 15:58:27 +01:00
sparams . chat_template = argv [ i ] ;
2024-03-07 10:41:53 +01:00
} else if ( arg = = " --override-kv " ) {
2024-01-02 11:38:15 +01:00
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
2024-04-26 20:06:33 +02:00
if ( ! parse_kv_override ( argv [ i ] , params . kv_overrides ) ) {
2024-01-02 11:38:15 +01:00
fprintf ( stderr , " error: Invalid type for KV override: %s \n " , argv [ i ] ) ;
invalid_param = true ;
break ;
}
2024-03-07 10:41:53 +01:00
} else {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
fprintf ( stderr , " error: unknown argument: %s \n " , arg . c_str ( ) ) ;
server_print_usage ( argv [ 0 ] , default_params , default_sparams ) ;
exit ( 1 ) ;
}
2023-05-21 19:51:18 +02:00
}
2024-03-07 10:41:53 +01:00
2024-04-30 01:52:50 +02:00
gpt_params_handle_model_default ( params ) ;
2024-01-02 11:38:15 +01:00
if ( ! params . kv_overrides . empty ( ) ) {
2024-02-03 12:23:37 +01:00
params . kv_overrides . emplace_back ( ) ;
2024-01-02 11:38:15 +01:00
params . kv_overrides . back ( ) . key [ 0 ] = 0 ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
if ( invalid_param ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
fprintf ( stderr , " error: invalid parameter for argument: %s \n " , arg . c_str ( ) ) ;
server_print_usage ( argv [ 0 ] , default_params , default_sparams ) ;
exit ( 1 ) ;
2023-05-21 19:51:18 +02:00
}
}
2024-03-07 10:41:53 +01:00
static void log_server_request ( const httplib : : Request & req , const httplib : : Response & res ) {
2024-02-25 13:50:32 +01:00
// skip GH copilot requests when using default port
2024-03-07 10:41:53 +01:00
if ( req . path = = " /v1/health " | | req . path = = " /v1/completions " ) {
2024-02-25 13:50:32 +01:00
return ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
LOG_INFO ( " request " , {
2024-02-25 13:50:32 +01:00
{ " remote_addr " , req . remote_addr } ,
{ " remote_port " , req . remote_port } ,
{ " status " , res . status } ,
{ " method " , req . method } ,
{ " path " , req . path } ,
{ " params " , req . params } ,
} ) ;
2023-07-04 16:05:27 +02:00
LOG_VERBOSE ( " request " , {
2024-02-25 13:50:32 +01:00
{ " request " , req . body } ,
{ " response " , res . body } ,
} ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-05-21 19:51:18 +02:00
2024-02-18 17:23:16 +01:00
std : : function < void ( int ) > shutdown_handler ;
2024-02-28 09:55:37 +01:00
std : : atomic_flag is_terminating = ATOMIC_FLAG_INIT ;
2024-03-07 10:41:53 +01:00
2024-02-28 09:55:37 +01:00
inline void signal_handler ( int signal ) {
if ( is_terminating . test_and_set ( ) ) {
// in case it hangs, we can force terminate the server by hitting Ctrl+C twice
// this is for better developer experience, we can remove when the server is stable enough
fprintf ( stderr , " Received second interrupt, terminating immediately. \n " ) ;
exit ( 1 ) ;
}
2024-03-07 10:41:53 +01:00
2024-02-28 09:55:37 +01:00
shutdown_handler ( signal ) ;
}
2024-02-18 17:23:16 +01:00
2024-03-07 10:41:53 +01:00
int main ( int argc , char * * argv ) {
2023-12-17 16:02:16 +01:00
# if SERVER_VERBOSE != 1
log_disable ( ) ;
# endif
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
// own arguments required by this example
2024-03-07 10:41:53 +01:00
gpt_params params ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
server_params sparams ;
// struct that contains llama context and inference
2024-03-07 10:41:53 +01:00
server_context ctx_server ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
server_params_parse ( argc , argv , sparams , params ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
if ( ! sparams . system_prompt . empty ( ) ) {
ctx_server . system_prompt_set ( json : : parse ( sparams . system_prompt ) ) ;
}
if ( params . model_alias = = " unknown " ) {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . model_alias = params . model ;
}
2024-02-16 10:31:07 +01:00
llama_backend_init ( ) ;
llama_numa_init ( params . numa ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
LOG_INFO ( " build info " , {
{ " build " , LLAMA_BUILD_NUMBER } ,
{ " commit " , LLAMA_COMMIT }
} ) ;
2023-10-22 21:53:08 +02:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
LOG_INFO ( " system info " , {
2024-03-07 10:41:53 +01:00
{ " n_threads " , params . n_threads } ,
{ " n_threads_batch " , params . n_threads_batch } ,
{ " total_threads " , std : : thread : : hardware_concurrency ( ) } ,
{ " system_info " , llama_print_system_info ( ) } ,
} ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-09 10:57:09 +01:00
std : : unique_ptr < httplib : : Server > svr ;
# ifdef CPPHTTPLIB_OPENSSL_SUPPORT
if ( sparams . ssl_key_file ! = " " & & sparams . ssl_cert_file ! = " " ) {
LOG_INFO ( " Running with SSL " , { { " key " , sparams . ssl_key_file } , { " cert " , sparams . ssl_cert_file } } ) ;
svr . reset (
new httplib : : SSLServer ( sparams . ssl_cert_file . c_str ( ) , sparams . ssl_key_file . c_str ( ) )
) ;
} else {
LOG_INFO ( " Running without SSL " , { } ) ;
svr . reset ( new httplib : : Server ( ) ) ;
}
# else
svr . reset ( new httplib : : Server ( ) ) ;
# endif
2024-01-10 20:56:05 +01:00
2024-01-11 08:10:34 +01:00
std : : atomic < server_state > state { SERVER_STATE_LOADING_MODEL } ;
2024-01-10 20:56:05 +01:00
2024-03-09 10:57:09 +01:00
svr - > set_default_headers ( { { " Server " , " llama.cpp " } } ) ;
2024-01-11 19:02:48 +01:00
// CORS preflight
2024-03-09 10:57:09 +01:00
svr - > Options ( R " (.*) " , [ ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
2024-01-11 19:02:48 +01:00
res . set_header ( " Access-Control-Allow-Credentials " , " true " ) ;
2024-03-07 10:41:53 +01:00
res . set_header ( " Access-Control-Allow-Methods " , " POST " ) ;
res . set_header ( " Access-Control-Allow-Headers " , " * " ) ;
2024-03-13 11:39:11 +01:00
return res . set_content ( " " , " application/json; charset=utf-8 " ) ;
2024-01-11 19:02:48 +01:00
} ) ;
2024-01-10 20:56:05 +01:00
2024-03-09 10:57:09 +01:00
svr - > set_logger ( log_server_request ) ;
2024-01-10 20:56:05 +01:00
2024-03-11 10:56:41 +01:00
auto res_error = [ ] ( httplib : : Response & res , json error_data ) {
json final_response { { " error " , error_data } } ;
res . set_content ( final_response . dump ( ) , " application/json; charset=utf-8 " ) ;
res . status = json_value ( error_data , " code " , 500 ) ;
} ;
2024-01-10 20:56:05 +01:00
2024-03-11 10:56:41 +01:00
svr - > set_exception_handler ( [ & res_error ] ( const httplib : : Request & , httplib : : Response & res , std : : exception_ptr ep ) {
std : : string message ;
2024-03-07 10:41:53 +01:00
try {
std : : rethrow_exception ( std : : move ( ep ) ) ;
2024-03-11 10:56:41 +01:00
} catch ( std : : exception & e ) {
message = e . what ( ) ;
2024-03-07 10:41:53 +01:00
} catch ( . . . ) {
2024-03-11 10:56:41 +01:00
message = " Unknown Exception " ;
2024-03-07 10:41:53 +01:00
}
2024-03-11 10:56:41 +01:00
json formatted_error = format_error_response ( message , ERROR_TYPE_SERVER ) ;
LOG_VERBOSE ( " Got exception " , formatted_error ) ;
res_error ( res , formatted_error ) ;
2024-03-07 10:41:53 +01:00
} ) ;
2024-03-11 10:56:41 +01:00
svr - > set_error_handler ( [ & res_error ] ( const httplib : : Request & , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
if ( res . status = = 404 ) {
2024-03-11 10:56:41 +01:00
res_error ( res , format_error_response ( " File Not Found " , ERROR_TYPE_NOT_FOUND ) ) ;
2024-03-07 10:41:53 +01:00
}
2024-03-11 10:56:41 +01:00
// for other error codes, we skip processing here because it's already done by res_error()
2024-03-07 10:41:53 +01:00
} ) ;
2024-01-10 20:56:05 +01:00
// set timeouts and change hostname and port
2024-03-09 10:57:09 +01:00
svr - > set_read_timeout ( sparams . read_timeout ) ;
svr - > set_write_timeout ( sparams . write_timeout ) ;
2024-01-10 20:56:05 +01:00
2024-03-09 10:57:09 +01:00
if ( ! svr - > bind_to_port ( sparams . hostname , sparams . port ) ) {
2024-01-10 20:56:05 +01:00
fprintf ( stderr , " \n couldn't bind to server socket: hostname=%s port=%d \n \n " , sparams . hostname . c_str ( ) , sparams . port ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
return 1 ;
}
2024-01-10 20:56:05 +01:00
std : : unordered_map < std : : string , std : : string > log_data ;
2024-03-07 10:41:53 +01:00
2024-01-10 20:56:05 +01:00
log_data [ " hostname " ] = sparams . hostname ;
2024-03-07 10:41:53 +01:00
log_data [ " port " ] = std : : to_string ( sparams . port ) ;
2024-01-10 20:56:05 +01:00
2024-01-11 18:51:17 +01:00
if ( sparams . api_keys . size ( ) = = 1 ) {
2024-03-09 11:27:53 +01:00
auto key = sparams . api_keys [ 0 ] ;
log_data [ " api_key " ] = " api_key: **** " + key . substr ( std : : max ( ( int ) ( key . length ( ) - 4 ) , 0 ) ) ;
2024-01-11 18:51:17 +01:00
} else if ( sparams . api_keys . size ( ) > 1 ) {
log_data [ " api_key " ] = " api_key: " + std : : to_string ( sparams . api_keys . size ( ) ) + " keys loaded " ;
2024-01-10 20:56:05 +01:00
}
// load the model
2024-03-07 10:41:53 +01:00
if ( ! ctx_server . load_model ( params ) ) {
2024-01-11 08:10:34 +01:00
state . store ( SERVER_STATE_ERROR ) ;
2024-01-10 20:56:05 +01:00
return 1 ;
} else {
2024-03-09 16:34:15 +01:00
ctx_server . init ( ) ;
2024-01-11 08:10:34 +01:00
state . store ( SERVER_STATE_READY ) ;
2024-01-10 20:56:05 +01:00
}
2024-03-07 10:41:53 +01:00
LOG_INFO ( " model loaded " , { } ) ;
const auto model_meta = ctx_server . model_meta ( ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-09 21:04:00 +01:00
// if a custom chat template is not supplied, we will use the one that comes with the model (if any)
if ( sparams . chat_template . empty ( ) ) {
2024-03-07 10:41:53 +01:00
if ( ! ctx_server . validate_model_chat_template ( ) ) {
LOG_ERROR ( " The chat template that comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses " , { } ) ;
sparams . chat_template = " chatml " ;
}
2024-02-22 09:33:24 +01:00
}
2024-03-09 21:04:00 +01:00
// print sample chat example to make it clear which template is used
{
json chat ;
chat . push_back ( { { " role " , " system " } , { " content " , " You are a helpful assistant " } } ) ;
chat . push_back ( { { " role " , " user " } , { " content " , " Hello " } } ) ;
chat . push_back ( { { " role " , " assistant " } , { " content " , " Hi there " } } ) ;
chat . push_back ( { { " role " , " user " } , { " content " , " How are you? " } } ) ;
const std : : string chat_example = format_chat ( ctx_server . model , sparams . chat_template , chat ) ;
LOG_INFO ( " chat template " , {
{ " chat_example " , chat_example } ,
{ " built_in " , sparams . chat_template . empty ( ) } ,
} ) ;
}
2024-03-09 11:27:53 +01:00
//
// Middlewares
//
2024-03-11 10:56:41 +01:00
auto middleware_validate_api_key = [ & sparams , & res_error ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-09 11:27:53 +01:00
// TODO: should we apply API key to all endpoints, including "/health" and "/models"?
static const std : : set < std : : string > protected_endpoints = {
" /props " ,
" /completion " ,
" /completions " ,
" /v1/completions " ,
" /chat/completions " ,
" /v1/chat/completions " ,
" /infill " ,
" /tokenize " ,
" /detokenize " ,
" /embedding " ,
" /embeddings " ,
" /v1/embeddings " ,
} ;
2023-12-15 12:49:01 +01:00
// If API key is not set, skip validation
2024-01-11 18:51:17 +01:00
if ( sparams . api_keys . empty ( ) ) {
2023-12-15 12:49:01 +01:00
return true ;
}
2024-03-09 11:27:53 +01:00
// If path is not in protected_endpoints list, skip validation
if ( protected_endpoints . find ( req . path ) = = protected_endpoints . end ( ) ) {
return true ;
}
2023-12-15 12:49:01 +01:00
// Check for API key in the header
auto auth_header = req . get_header_value ( " Authorization " ) ;
2024-03-07 10:41:53 +01:00
2023-12-15 12:49:01 +01:00
std : : string prefix = " Bearer " ;
if ( auth_header . substr ( 0 , prefix . size ( ) ) = = prefix ) {
std : : string received_api_key = auth_header . substr ( prefix . size ( ) ) ;
2024-01-11 18:51:17 +01:00
if ( std : : find ( sparams . api_keys . begin ( ) , sparams . api_keys . end ( ) , received_api_key ) ! = sparams . api_keys . end ( ) ) {
2023-12-15 12:49:01 +01:00
return true ; // API key is valid
}
}
// API key is invalid or not provided
2024-03-09 11:27:53 +01:00
// TODO: make another middleware for CORS related logic
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
2024-03-11 10:56:41 +01:00
res_error ( res , format_error_response ( " Invalid API Key " , ERROR_TYPE_AUTHENTICATION ) ) ;
2023-12-15 12:49:01 +01:00
LOG_WARNING ( " Unauthorized: Invalid API Key " , { } ) ;
return false ;
} ;
2024-03-09 11:27:53 +01:00
// register server middlewares
svr - > set_pre_routing_handler ( [ & middleware_validate_api_key ] ( const httplib : : Request & req , httplib : : Response & res ) {
if ( ! middleware_validate_api_key ( req , res ) ) {
return httplib : : Server : : HandlerResponse : : Handled ;
}
return httplib : : Server : : HandlerResponse : : Unhandled ;
2024-03-07 10:41:53 +01:00
} ) ;
2023-07-05 22:51:13 +02:00
2024-03-09 11:27:53 +01:00
//
// Route handlers (or controllers)
//
2023-07-04 16:05:27 +02:00
2024-03-09 11:27:53 +01:00
const auto handle_health = [ & ] ( const httplib : : Request & req , httplib : : Response & res ) {
server_state current_state = state . load ( ) ;
switch ( current_state ) {
case SERVER_STATE_READY :
{
// request slots data using task queue
server_task task ;
task . id = ctx_server . queue_tasks . get_new_id ( ) ;
task . type = SERVER_TASK_TYPE_METRICS ;
task . id_target = - 1 ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-09 11:27:53 +01:00
ctx_server . queue_results . add_waiting_task_id ( task . id ) ;
ctx_server . queue_tasks . post ( task ) ;
// get the result
server_task_result result = ctx_server . queue_results . recv ( task . id ) ;
ctx_server . queue_results . remove_waiting_task_id ( task . id ) ;
const int n_idle_slots = result . data [ " idle " ] ;
const int n_processing_slots = result . data [ " processing " ] ;
json health = {
{ " status " , " ok " } ,
{ " slots_idle " , n_idle_slots } ,
{ " slots_processing " , n_processing_slots }
} ;
res . status = 200 ; // HTTP OK
if ( sparams . slots_endpoint & & req . has_param ( " include_slots " ) ) {
health [ " slots " ] = result . data [ " slots " ] ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-09 11:27:53 +01:00
if ( n_idle_slots = = 0 ) {
health [ " status " ] = " no slot available " ;
if ( req . has_param ( " fail_on_no_slot " ) ) {
res . status = 503 ; // HTTP Service Unavailable
}
}
res . set_content ( health . dump ( ) , " application/json " ) ;
break ;
}
case SERVER_STATE_LOADING_MODEL :
{
2024-03-11 10:56:41 +01:00
res_error ( res , format_error_response ( " Loading model " , ERROR_TYPE_UNAVAILABLE ) ) ;
2024-03-09 11:27:53 +01:00
} break ;
case SERVER_STATE_ERROR :
{
2024-03-11 10:56:41 +01:00
res_error ( res , format_error_response ( " Model failed to load " , ERROR_TYPE_SERVER ) ) ;
2024-03-09 11:27:53 +01:00
} break ;
}
} ;
const auto handle_slots = [ & ] ( const httplib : : Request & , httplib : : Response & res ) {
if ( ! sparams . slots_endpoint ) {
2024-03-11 10:56:41 +01:00
res_error ( res , format_error_response ( " This server does not support slots endpoint. " , ERROR_TYPE_NOT_SUPPORTED ) ) ;
2024-03-09 11:27:53 +01:00
return ;
}
// request slots data using task queue
server_task task ;
task . id = ctx_server . queue_tasks . get_new_id ( ) ;
task . id_multi = - 1 ;
task . id_target = - 1 ;
task . type = SERVER_TASK_TYPE_METRICS ;
ctx_server . queue_results . add_waiting_task_id ( task . id ) ;
ctx_server . queue_tasks . post ( task ) ;
// get the result
server_task_result result = ctx_server . queue_results . recv ( task . id ) ;
ctx_server . queue_results . remove_waiting_task_id ( task . id ) ;
res . set_content ( result . data [ " slots " ] . dump ( ) , " application/json " ) ;
res . status = 200 ; // HTTP OK
} ;
const auto handle_metrics = [ & ] ( const httplib : : Request & , httplib : : Response & res ) {
if ( ! sparams . metrics_endpoint ) {
2024-03-11 10:56:41 +01:00
res_error ( res , format_error_response ( " This server does not support metrics endpoint. " , ERROR_TYPE_NOT_SUPPORTED ) ) ;
2024-03-09 11:27:53 +01:00
return ;
}
// request slots data using task queue
server_task task ;
task . id = ctx_server . queue_tasks . get_new_id ( ) ;
task . id_multi = - 1 ;
task . id_target = - 1 ;
task . type = SERVER_TASK_TYPE_METRICS ;
task . data . push_back ( { { " reset_bucket " , true } } ) ;
ctx_server . queue_results . add_waiting_task_id ( task . id ) ;
ctx_server . queue_tasks . post ( task ) ;
// get the result
server_task_result result = ctx_server . queue_results . recv ( task . id ) ;
ctx_server . queue_results . remove_waiting_task_id ( task . id ) ;
json data = result . data ;
const uint64_t n_prompt_tokens_processed = data [ " n_prompt_tokens_processed " ] ;
const uint64_t t_prompt_processing = data [ " t_prompt_processing " ] ;
const uint64_t n_tokens_predicted = data [ " n_tokens_predicted " ] ;
const uint64_t t_tokens_generation = data [ " t_tokens_generation " ] ;
const int32_t kv_cache_used_cells = data [ " kv_cache_used_cells " ] ;
// metrics definition: https://prometheus.io/docs/practices/naming/#metric-names
json all_metrics_def = json {
{ " counter " , { {
{ " name " , " prompt_tokens_total " } ,
{ " help " , " Number of prompt tokens processed. " } ,
{ " value " , ( uint64_t ) data [ " n_prompt_tokens_processed_total " ] }
} , {
{ " name " , " prompt_seconds_total " } ,
{ " help " , " Prompt process time " } ,
{ " value " , ( uint64_t ) data [ " t_prompt_processing_total " ] / 1.e3 }
} , {
{ " name " , " tokens_predicted_total " } ,
{ " help " , " Number of generation tokens processed. " } ,
{ " value " , ( uint64_t ) data [ " n_tokens_predicted_total " ] }
} , {
{ " name " , " tokens_predicted_seconds_total " } ,
{ " help " , " Predict process time " } ,
{ " value " , ( uint64_t ) data [ " t_tokens_generation_total " ] / 1.e3 }
} } } ,
{ " gauge " , { {
{ " name " , " prompt_tokens_seconds " } ,
{ " help " , " Average prompt throughput in tokens/s. " } ,
{ " value " , n_prompt_tokens_processed ? 1.e3 / t_prompt_processing * n_prompt_tokens_processed : 0. }
} , {
{ " name " , " predicted_tokens_seconds " } ,
{ " help " , " Average generation throughput in tokens/s. " } ,
{ " value " , n_tokens_predicted ? 1.e3 / t_tokens_generation * n_tokens_predicted : 0. }
} , {
{ " name " , " kv_cache_usage_ratio " } ,
{ " help " , " KV-cache usage. 1 means 100 percent usage. " } ,
{ " value " , 1. * kv_cache_used_cells / params . n_ctx }
} , {
{ " name " , " kv_cache_tokens " } ,
{ " help " , " KV-cache tokens. " } ,
{ " value " , ( uint64_t ) data [ " kv_cache_tokens_count " ] }
} , {
{ " name " , " requests_processing " } ,
{ " help " , " Number of request processing. " } ,
{ " value " , ( uint64_t ) data [ " processing " ] }
} , {
{ " name " , " requests_deferred " } ,
{ " help " , " Number of request deferred. " } ,
{ " value " , ( uint64_t ) data [ " deferred " ] }
} } }
} ;
std : : stringstream prometheus ;
for ( const auto & el : all_metrics_def . items ( ) ) {
const auto & type = el . key ( ) ;
const auto & metrics_def = el . value ( ) ;
for ( const auto & metric_def : metrics_def ) {
const std : : string name = metric_def [ " name " ] ;
const std : : string help = metric_def [ " help " ] ;
auto value = json_value ( metric_def , " value " , 0. ) ;
prometheus < < " # HELP llamacpp: " < < name < < " " < < help < < " \n "
< < " # TYPE llamacpp: " < < name < < " " < < type < < " \n "
< < " llamacpp: " < < name < < " " < < value < < " \n " ;
}
}
const int64_t t_start = data [ " t_start " ] ;
res . set_header ( " Process-Start-Time-Unix " , std : : to_string ( t_start ) ) ;
res . set_content ( prometheus . str ( ) , " text/plain; version=0.0.4 " ) ;
res . status = 200 ; // HTTP OK
} ;
2024-04-08 14:43:30 +02:00
const auto handle_slots_save = [ & ctx_server , & res_error , & sparams ] ( const httplib : : Request & req , httplib : : Response & res , int id_slot ) {
json request_data = json : : parse ( req . body ) ;
std : : string filename = request_data [ " filename " ] ;
if ( ! validate_file_name ( filename ) ) {
res_error ( res , format_error_response ( " Invalid filename " , ERROR_TYPE_INVALID_REQUEST ) ) ;
return ;
}
std : : string filepath = sparams . slot_save_path + filename ;
server_task task ;
task . type = SERVER_TASK_TYPE_SLOT_SAVE ;
task . data = {
{ " id_slot " , id_slot } ,
{ " filename " , filename } ,
{ " filepath " , filepath }
} ;
const int id_task = ctx_server . queue_tasks . post ( task ) ;
ctx_server . queue_results . add_waiting_task_id ( id_task ) ;
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
if ( result . error ) {
res_error ( res , result . data ) ;
} else {
res . set_content ( result . data . dump ( ) , " application/json " ) ;
}
} ;
const auto handle_slots_restore = [ & ctx_server , & res_error , & sparams ] ( const httplib : : Request & req , httplib : : Response & res , int id_slot ) {
json request_data = json : : parse ( req . body ) ;
std : : string filename = request_data [ " filename " ] ;
if ( ! validate_file_name ( filename ) ) {
res_error ( res , format_error_response ( " Invalid filename " , ERROR_TYPE_INVALID_REQUEST ) ) ;
return ;
}
std : : string filepath = sparams . slot_save_path + filename ;
server_task task ;
task . type = SERVER_TASK_TYPE_SLOT_RESTORE ;
task . data = {
{ " id_slot " , id_slot } ,
{ " filename " , filename } ,
{ " filepath " , filepath }
} ;
const int id_task = ctx_server . queue_tasks . post ( task ) ;
ctx_server . queue_results . add_waiting_task_id ( id_task ) ;
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
if ( result . error ) {
res_error ( res , result . data ) ;
} else {
res . set_content ( result . data . dump ( ) , " application/json " ) ;
}
} ;
const auto handle_slots_erase = [ & ctx_server , & res_error ] ( const httplib : : Request & /* req */ , httplib : : Response & res , int id_slot ) {
server_task task ;
task . type = SERVER_TASK_TYPE_SLOT_ERASE ;
task . data = {
{ " id_slot " , id_slot } ,
} ;
const int id_task = ctx_server . queue_tasks . post ( task ) ;
ctx_server . queue_results . add_waiting_task_id ( id_task ) ;
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
if ( result . error ) {
res_error ( res , result . data ) ;
} else {
res . set_content ( result . data . dump ( ) , " application/json " ) ;
}
} ;
const auto handle_slots_action = [ & res_error , & handle_slots_save , & handle_slots_restore , & handle_slots_erase ] ( const httplib : : Request & req , httplib : : Response & res ) {
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
std : : string id_slot_str = req . path_params . at ( " id_slot " ) ;
int id_slot ;
try {
id_slot = std : : stoi ( id_slot_str ) ;
} catch ( const std : : exception & ) {
res_error ( res , format_error_response ( " Invalid slot ID " , ERROR_TYPE_INVALID_REQUEST ) ) ;
return ;
}
std : : string action = req . get_param_value ( " action " ) ;
if ( action = = " save " ) {
handle_slots_save ( req , res , id_slot ) ;
} else if ( action = = " restore " ) {
handle_slots_restore ( req , res , id_slot ) ;
} else if ( action = = " erase " ) {
handle_slots_erase ( req , res , id_slot ) ;
} else {
res_error ( res , format_error_response ( " Invalid action " , ERROR_TYPE_INVALID_REQUEST ) ) ;
}
} ;
2024-03-09 11:27:53 +01:00
const auto handle_props = [ & ctx_server ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
json data = {
{ " user_name " , ctx_server . name_user . c_str ( ) } ,
{ " assistant_name " , ctx_server . name_assistant . c_str ( ) } ,
{ " default_generation_settings " , ctx_server . default_generation_settings_for_props } ,
{ " total_slots " , ctx_server . params . n_parallel }
} ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
res . set_content ( data . dump ( ) , " application/json; charset=utf-8 " ) ;
2024-03-09 11:27:53 +01:00
} ;
2024-03-07 10:41:53 +01:00
2024-03-11 10:56:41 +01:00
const auto handle_completions = [ & ctx_server , & res_error ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
json data = json : : parse ( req . body ) ;
const int id_task = ctx_server . queue_tasks . get_new_id ( ) ;
ctx_server . queue_results . add_waiting_task_id ( id_task ) ;
ctx_server . request_completion ( id_task , - 1 , data , false , false ) ;
if ( ! json_value ( data , " stream " , false ) ) {
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
if ( ! result . error & & result . stop ) {
res . set_content ( result . data . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) , " application/json; charset=utf-8 " ) ;
} else {
2024-03-11 10:56:41 +01:00
res_error ( res , result . data ) ;
2024-03-07 10:41:53 +01:00
}
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
} else {
const auto chunked_content_provider = [ id_task , & ctx_server ] ( size_t , httplib : : DataSink & sink ) {
while ( true ) {
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
if ( ! result . error ) {
const std : : string str =
" data: " +
result . data . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) +
" \n \n " ;
LOG_VERBOSE ( " data stream " , {
{ " to_send " , str }
} ) ;
if ( ! sink . write ( str . c_str ( ) , str . size ( ) ) ) {
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
return false ;
2023-08-25 12:32:45 +02:00
}
2024-01-26 13:42:20 +01:00
2024-03-07 10:41:53 +01:00
if ( result . stop ) {
break ;
}
} else {
const std : : string str =
" error: " +
result . data . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) +
" \n \n " ;
2023-08-25 12:32:45 +02:00
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " data stream " , {
{ " to_send " , str }
} ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
if ( ! sink . write ( str . c_str ( ) , str . size ( ) ) ) {
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
return false ;
}
break ;
}
2023-10-22 21:53:08 +02:00
}
2023-07-02 23:38:44 +02:00
2024-03-07 10:41:53 +01:00
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
sink . done ( ) ;
2023-11-25 10:29:06 +01:00
2024-03-07 10:41:53 +01:00
return true ;
} ;
2023-11-25 10:29:06 +01:00
2024-03-07 10:41:53 +01:00
auto on_complete = [ id_task , & ctx_server ] ( bool ) {
// cancel
ctx_server . request_cancel ( id_task ) ;
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
} ;
res . set_chunked_content_provider ( " text/event-stream " , chunked_content_provider , on_complete ) ;
}
2024-03-07 11:42:39 +01:00
} ;
2024-03-09 11:27:53 +01:00
const auto handle_models = [ & params , & model_meta ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
json models = {
{ " object " , " list " } ,
{ " data " , {
{
{ " id " , params . model_alias } ,
{ " object " , " model " } ,
{ " created " , std : : time ( 0 ) } ,
{ " owned_by " , " llamacpp " } ,
{ " meta " , model_meta }
} ,
} }
} ;
res . set_content ( models . dump ( ) , " application/json; charset=utf-8 " ) ;
2024-03-09 11:27:53 +01:00
} ;
2024-03-07 10:41:53 +01:00
2024-03-11 10:56:41 +01:00
const auto handle_chat_completions = [ & ctx_server , & sparams , & res_error ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-02-28 09:39:15 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
2024-03-07 10:41:53 +01:00
json data = oaicompat_completion_params_parse ( ctx_server . model , json : : parse ( req . body ) , sparams . chat_template ) ;
const int id_task = ctx_server . queue_tasks . get_new_id ( ) ;
ctx_server . queue_results . add_waiting_task_id ( id_task ) ;
ctx_server . request_completion ( id_task , - 1 , data , false , false ) ;
2023-11-25 10:29:06 +01:00
2024-03-11 09:09:32 +01:00
const auto completion_id = gen_chatcmplid ( ) ;
2024-02-28 09:39:15 +01:00
if ( ! json_value ( data , " stream " , false ) ) {
2024-03-07 10:41:53 +01:00
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
2023-11-25 10:29:06 +01:00
2024-02-28 09:39:15 +01:00
if ( ! result . error & & result . stop ) {
2024-03-11 09:09:32 +01:00
json result_oai = format_final_response_oaicompat ( data , result . data , completion_id ) ;
2023-11-25 10:29:06 +01:00
2024-03-07 10:41:53 +01:00
res . set_content ( result_oai . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) , " application/json; charset=utf-8 " ) ;
2024-02-28 09:39:15 +01:00
} else {
2024-03-11 10:56:41 +01:00
res_error ( res , result . data ) ;
2024-02-28 09:39:15 +01:00
}
2024-03-07 10:41:53 +01:00
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
2024-02-28 09:39:15 +01:00
} else {
2024-03-11 09:09:32 +01:00
const auto chunked_content_provider = [ id_task , & ctx_server , completion_id ] ( size_t , httplib : : DataSink & sink ) {
2024-02-28 09:39:15 +01:00
while ( true ) {
2024-03-07 10:41:53 +01:00
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
if ( ! result . error ) {
2024-03-11 09:09:32 +01:00
std : : vector < json > result_array = format_partial_response_oaicompat ( result . data , completion_id ) ;
2023-11-25 10:29:06 +01:00
2024-03-07 10:41:53 +01:00
for ( auto it = result_array . begin ( ) ; it ! = result_array . end ( ) ; + + it ) {
2024-02-28 09:39:15 +01:00
if ( ! it - > empty ( ) ) {
2023-11-25 10:29:06 +01:00
const std : : string str =
2024-02-28 09:39:15 +01:00
" data: " +
it - > dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) +
2023-11-25 10:29:06 +01:00
" \n \n " ;
LOG_VERBOSE ( " data stream " , { { " to_send " , str } } ) ;
if ( ! sink . write ( str . c_str ( ) , str . size ( ) ) ) {
2024-03-07 10:41:53 +01:00
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
2023-11-25 10:29:06 +01:00
return false ;
}
}
}
2024-03-07 10:41:53 +01:00
if ( result . stop ) {
2024-02-28 09:39:15 +01:00
break ;
}
} else {
const std : : string str =
" error: " +
2024-03-07 10:41:53 +01:00
result . data . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) +
2024-02-28 09:39:15 +01:00
" \n \n " ;
LOG_VERBOSE ( " data stream " , { { " to_send " , str } } ) ;
if ( ! sink . write ( str . c_str ( ) , str . size ( ) ) ) {
2024-03-07 10:41:53 +01:00
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
2024-02-28 09:39:15 +01:00
return false ;
}
break ;
}
}
sink . done ( ) ;
2024-03-07 10:41:53 +01:00
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
2024-02-28 09:39:15 +01:00
return true ;
} ;
2023-11-25 10:29:06 +01:00
2024-03-07 10:41:53 +01:00
auto on_complete = [ id_task , & ctx_server ] ( bool ) {
2024-02-28 09:39:15 +01:00
// cancel request
2024-03-07 10:41:53 +01:00
ctx_server . request_cancel ( id_task ) ;
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
2024-02-28 09:39:15 +01:00
} ;
2023-11-25 10:29:06 +01:00
2024-02-28 09:39:15 +01:00
res . set_chunked_content_provider ( " text/event-stream " , chunked_content_provider , on_complete ) ;
}
} ;
2024-03-11 10:56:41 +01:00
const auto handle_infill = [ & ctx_server , & res_error ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
2023-10-02 09:42:02 +02:00
2024-03-07 10:41:53 +01:00
json data = json : : parse ( req . body ) ;
2023-10-02 09:42:02 +02:00
2024-03-07 10:41:53 +01:00
const int id_task = ctx_server . queue_tasks . get_new_id ( ) ;
2023-10-02 09:42:02 +02:00
2024-03-07 10:41:53 +01:00
ctx_server . queue_results . add_waiting_task_id ( id_task ) ;
ctx_server . request_completion ( id_task , - 1 , data , true , false ) ;
2023-10-02 09:42:02 +02:00
2024-03-07 10:41:53 +01:00
if ( ! json_value ( data , " stream " , false ) ) {
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
if ( ! result . error & & result . stop ) {
res . set_content ( result . data . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) , " application/json; charset=utf-8 " ) ;
} else {
2024-03-11 10:56:41 +01:00
res_error ( res , result . data ) ;
2024-03-07 10:41:53 +01:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-03-07 10:41:53 +01:00
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
} else {
const auto chunked_content_provider = [ id_task , & ctx_server ] ( size_t , httplib : : DataSink & sink ) {
while ( true ) {
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
if ( ! result . error ) {
const std : : string str =
" data: " +
result . data . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) +
" \n \n " ;
2023-10-20 20:07:23 +02:00
2024-03-07 10:41:53 +01:00
LOG_VERBOSE ( " data stream " , {
{ " to_send " , str }
} ) ;
2023-10-20 20:07:23 +02:00
2024-03-07 10:41:53 +01:00
if ( ! sink . write ( str . c_str ( ) , str . size ( ) ) ) {
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
return false ;
}
2023-06-20 00:12:39 +02:00
2024-03-07 10:41:53 +01:00
if ( result . stop ) {
break ;
}
} else {
break ;
}
2023-10-22 21:53:08 +02:00
}
2023-12-29 15:22:10 +01:00
2024-03-07 10:41:53 +01:00
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
sink . done ( ) ;
2023-12-29 15:22:10 +01:00
2024-03-07 10:41:53 +01:00
return true ;
} ;
2024-01-26 13:42:20 +01:00
2024-03-07 10:41:53 +01:00
auto on_complete = [ id_task , & ctx_server ] ( bool ) {
ctx_server . request_cancel ( id_task ) ;
} ;
2024-01-26 13:42:20 +01:00
2024-03-07 10:41:53 +01:00
res . set_chunked_content_provider ( " text/event-stream " , chunked_content_provider , on_complete ) ;
}
2024-03-09 11:27:53 +01:00
} ;
2024-01-29 14:48:10 +01:00
2024-03-09 11:27:53 +01:00
const auto handle_tokenize = [ & ctx_server ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
const json body = json : : parse ( req . body ) ;
2024-01-29 14:48:10 +01:00
2024-03-07 10:41:53 +01:00
std : : vector < llama_token > tokens ;
if ( body . count ( " content " ) ! = 0 ) {
tokens = ctx_server . tokenize ( body [ " content " ] , false ) ;
}
const json data = format_tokenizer_response ( tokens ) ;
return res . set_content ( data . dump ( ) , " application/json; charset=utf-8 " ) ;
2024-03-09 11:27:53 +01:00
} ;
2024-01-29 14:48:10 +01:00
2024-03-09 11:27:53 +01:00
const auto handle_detokenize = [ & ctx_server ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
const json body = json : : parse ( req . body ) ;
2024-01-29 14:48:10 +01:00
2024-03-07 10:41:53 +01:00
std : : string content ;
if ( body . count ( " tokens " ) ! = 0 ) {
const std : : vector < llama_token > tokens = body [ " tokens " ] ;
content = tokens_to_str ( ctx_server . ctx , tokens . cbegin ( ) , tokens . cend ( ) ) ;
}
2024-01-29 14:48:10 +01:00
2024-03-07 10:41:53 +01:00
const json data = format_detokenized_response ( content ) ;
return res . set_content ( data . dump ( ) , " application/json; charset=utf-8 " ) ;
2024-03-09 11:27:53 +01:00
} ;
2024-01-29 14:48:10 +01:00
2024-03-11 10:56:41 +01:00
const auto handle_embeddings = [ & params , & ctx_server , & res_error ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-03-07 10:41:53 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
if ( ! params . embedding ) {
res . status = 501 ;
res . set_content ( " This server does not support embeddings. Start it with `--embeddings` " , " text/plain; charset=utf-8 " ) ;
return ;
}
2024-01-29 14:48:10 +01:00
2024-03-07 10:41:53 +01:00
const json body = json : : parse ( req . body ) ;
2024-03-09 11:27:53 +01:00
bool is_openai = false ;
2024-03-07 10:41:53 +01:00
2024-03-13 11:39:11 +01:00
// an input prompt can be a string or a list of tokens (integer)
json prompt ;
2024-03-09 11:27:53 +01:00
if ( body . count ( " input " ) ! = 0 ) {
is_openai = true ;
2024-03-13 11:39:11 +01:00
prompt = body [ " input " ] ;
2024-03-09 11:27:53 +01:00
} else if ( body . count ( " content " ) ! = 0 ) {
2024-03-13 11:39:11 +01:00
// with "content", we only support single prompt
prompt = std : : vector < std : : string > { body [ " content " ] } ;
2024-03-07 10:41:53 +01:00
} else {
2024-03-11 10:56:41 +01:00
res_error ( res , format_error_response ( " \" input \" or \" content \" must be provided " , ERROR_TYPE_INVALID_REQUEST ) ) ;
return ;
2024-03-07 10:41:53 +01:00
}
2024-03-13 11:39:11 +01:00
// create and queue the task
json responses ;
{
2024-03-09 11:27:53 +01:00
const int id_task = ctx_server . queue_tasks . get_new_id ( ) ;
ctx_server . queue_results . add_waiting_task_id ( id_task ) ;
2024-03-13 11:39:11 +01:00
ctx_server . request_completion ( id_task , - 1 , { { " prompt " , prompt } } , false , true ) ;
2024-03-07 10:41:53 +01:00
2024-03-09 11:27:53 +01:00
// get the result
server_task_result result = ctx_server . queue_results . recv ( id_task ) ;
ctx_server . queue_results . remove_waiting_task_id ( id_task ) ;
2024-03-11 10:56:41 +01:00
if ( ! result . error ) {
2024-03-13 11:39:11 +01:00
if ( result . data . count ( " results " ) ) {
// result for multi-task
responses = result . data [ " results " ] ;
} else {
// result for single task
responses = std : : vector < json > { result . data } ;
}
2024-03-11 10:56:41 +01:00
} else {
// error received, ignore everything else
res_error ( res , result . data ) ;
return ;
}
2024-03-09 11:27:53 +01:00
}
// write JSON response
2024-03-13 11:39:11 +01:00
json root = is_openai
? format_embeddings_response_oaicompat ( body , responses )
: responses [ 0 ] ;
2024-03-09 11:27:53 +01:00
return res . set_content ( root . dump ( ) , " application/json; charset=utf-8 " ) ;
} ;
2024-03-07 10:41:53 +01:00
2024-03-13 11:39:11 +01:00
auto handle_static_file = [ ] ( unsigned char * content , size_t len , const char * mime_type ) {
return [ content , len , mime_type ] ( const httplib : : Request & , httplib : : Response & res ) {
res . set_content ( reinterpret_cast < const char * > ( content ) , len , mime_type ) ;
return false ;
} ;
} ;
2024-03-09 11:27:53 +01:00
//
// Router
//
2024-03-07 10:41:53 +01:00
2024-03-09 11:27:53 +01:00
// register static assets routes
if ( ! sparams . public_path . empty ( ) ) {
// Set the base directory for serving static files
svr - > set_base_dir ( sparams . public_path ) ;
}
2024-03-07 10:41:53 +01:00
2024-03-09 11:27:53 +01:00
// using embedded static files
svr - > Get ( " / " , handle_static_file ( index_html , index_html_len , " text/html; charset=utf-8 " ) ) ;
svr - > Get ( " /index.js " , handle_static_file ( index_js , index_js_len , " text/javascript; charset=utf-8 " ) ) ;
svr - > Get ( " /completion.js " , handle_static_file ( completion_js , completion_js_len , " text/javascript; charset=utf-8 " ) ) ;
svr - > Get ( " /json-schema-to-grammar.mjs " , handle_static_file (
json_schema_to_grammar_mjs , json_schema_to_grammar_mjs_len , " text/javascript; charset=utf-8 " ) ) ;
// register API routes
svr - > Get ( " /health " , handle_health ) ;
svr - > Get ( " /slots " , handle_slots ) ;
svr - > Get ( " /metrics " , handle_metrics ) ;
svr - > Get ( " /props " , handle_props ) ;
svr - > Get ( " /v1/models " , handle_models ) ;
svr - > Post ( " /completion " , handle_completions ) ; // legacy
svr - > Post ( " /completions " , handle_completions ) ;
svr - > Post ( " /v1/completions " , handle_completions ) ;
svr - > Post ( " /chat/completions " , handle_chat_completions ) ;
svr - > Post ( " /v1/chat/completions " , handle_chat_completions ) ;
svr - > Post ( " /infill " , handle_infill ) ;
svr - > Post ( " /embedding " , handle_embeddings ) ; // legacy
svr - > Post ( " /embeddings " , handle_embeddings ) ;
svr - > Post ( " /v1/embeddings " , handle_embeddings ) ;
svr - > Post ( " /tokenize " , handle_tokenize ) ;
svr - > Post ( " /detokenize " , handle_detokenize ) ;
2024-04-08 14:43:30 +02:00
if ( ! sparams . slot_save_path . empty ( ) ) {
// only enable slot endpoints if slot_save_path is set
svr - > Post ( " /slots/:id_slot " , handle_slots_action ) ;
}
2023-05-21 19:51:18 +02:00
2024-03-09 11:27:53 +01:00
//
// Start the server
//
2024-03-03 08:48:36 +01:00
if ( sparams . n_threads_http < 1 ) {
// +2 threads for monitoring endpoints
sparams . n_threads_http = std : : max ( params . n_parallel + 2 , ( int32_t ) std : : thread : : hardware_concurrency ( ) - 1 ) ;
2024-03-01 10:08:08 +01:00
}
2024-03-03 08:48:36 +01:00
log_data [ " n_threads_http " ] = std : : to_string ( sparams . n_threads_http ) ;
2024-03-09 10:57:09 +01:00
svr - > new_task_queue = [ & sparams ] { return new httplib : : ThreadPool ( sparams . n_threads_http ) ; } ;
2024-03-01 10:08:08 +01:00
2024-02-24 12:28:55 +01:00
LOG_INFO ( " HTTP server listening " , log_data ) ;
2024-03-07 10:41:53 +01:00
2024-02-24 12:28:55 +01:00
// run the HTTP server in a thread - see comment below
2024-03-07 10:41:53 +01:00
std : : thread t ( [ & ] ( ) {
2024-03-09 10:57:09 +01:00
if ( ! svr - > listen_after_bind ( ) ) {
2024-03-07 10:41:53 +01:00
state . store ( SERVER_STATE_ERROR ) ;
return 1 ;
}
2024-02-24 12:28:55 +01:00
2024-03-07 10:41:53 +01:00
return 0 ;
} ) ;
2024-02-24 12:28:55 +01:00
2024-03-07 10:41:53 +01:00
ctx_server . queue_tasks . on_new_task ( std : : bind (
& server_context : : process_single_task , & ctx_server , std : : placeholders : : _1 ) ) ;
ctx_server . queue_tasks . on_finish_multitask ( std : : bind (
& server_context : : on_finish_multitask , & ctx_server , std : : placeholders : : _1 ) ) ;
2024-03-11 10:56:41 +01:00
ctx_server . queue_tasks . on_update_slots ( std : : bind (
2024-03-07 10:41:53 +01:00
& server_context : : update_slots , & ctx_server ) ) ;
ctx_server . queue_results . on_multitask_update ( std : : bind (
& server_queue : : update_multitask ,
& ctx_server . queue_tasks ,
2024-01-26 13:42:20 +01:00
std : : placeholders : : _1 ,
std : : placeholders : : _2 ,
std : : placeholders : : _3
) ) ;
2024-02-18 17:23:16 +01:00
shutdown_handler = [ & ] ( int ) {
2024-03-07 10:41:53 +01:00
ctx_server . queue_tasks . terminate ( ) ;
2024-02-18 17:23:16 +01:00
} ;
# if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
struct sigaction sigint_action ;
sigint_action . sa_handler = signal_handler ;
sigemptyset ( & sigint_action . sa_mask ) ;
sigint_action . sa_flags = 0 ;
sigaction ( SIGINT , & sigint_action , NULL ) ;
2024-03-28 09:50:48 +01:00
sigaction ( SIGTERM , & sigint_action , NULL ) ;
2024-02-18 17:23:16 +01:00
# elif defined (_WIN32)
auto console_ctrl_handler = + [ ] ( DWORD ctrl_type ) - > BOOL {
return ( ctrl_type = = CTRL_C_EVENT ) ? ( signal_handler ( SIGINT ) , true ) : false ;
} ;
SetConsoleCtrlHandler ( reinterpret_cast < PHANDLER_ROUTINE > ( console_ctrl_handler ) , true ) ;
# endif
2024-03-07 10:41:53 +01:00
ctx_server . queue_tasks . start_loop ( ) ;
2024-03-09 10:57:09 +01:00
svr - > stop ( ) ;
2023-10-22 21:53:08 +02:00
t . join ( ) ;
2023-07-10 17:49:56 +02:00
2023-10-22 21:53:08 +02:00
llama_backend_free ( ) ;
2024-03-07 10:41:53 +01:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
return 0 ;
2023-05-21 19:51:18 +02:00
}