2023-05-21 19:51:18 +02:00
# include "common.h"
# include "llama.h"
2023-10-22 21:53:08 +02:00
# include "grammar-parser.h"
2024-01-26 13:42:20 +01:00
# include "utils.hpp"
# include "oai.hpp"
2023-10-22 21:53:08 +02:00
# include "../llava/clip.h"
2024-02-20 20:07:22 +01:00
# include "../llava/llava.h"
2023-10-22 21:53:08 +02:00
# include "stb_image.h"
2023-05-21 19:51:18 +02:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
# ifndef NDEBUG
// crash the server in debug mode, otherwise send an http 500 error
# define CPPHTTPLIB_NO_EXCEPTIONS 1
# endif
2023-12-17 15:54:37 +01:00
// increase max payload length to allow use of larger context size
# define CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 1048576
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
# include "httplib.h"
# include "json.hpp"
2023-07-04 16:05:27 +02:00
// auto generated files (update with ./deps.sh)
# include "index.html.hpp"
# include "index.js.hpp"
# include "completion.js.hpp"
2023-08-15 00:14:14 +02:00
# include "json-schema-to-grammar.mjs.hpp"
2023-07-04 16:05:27 +02:00
2023-09-01 15:34:50 +02:00
# include <cstddef>
2023-10-22 21:53:08 +02:00
# include <thread>
# include <chrono>
2023-12-29 15:24:12 +01:00
# include <condition_variable>
2024-01-10 20:56:05 +01:00
# include <atomic>
2024-02-18 17:23:16 +01:00
# include <signal.h>
2023-09-01 15:34:50 +02:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
using json = nlohmann : : json ;
2024-02-29 21:42:11 +01:00
struct server_params {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
std : : string hostname = " 127.0.0.1 " ;
2024-01-11 18:51:17 +01:00
std : : vector < std : : string > api_keys ;
2023-07-04 16:05:27 +02:00
std : : string public_path = " examples/server/public " ;
2024-02-20 15:58:27 +01:00
std : : string chat_template = " " ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
int32_t port = 8080 ;
int32_t read_timeout = 600 ;
int32_t write_timeout = 600 ;
2024-02-18 18:39:57 +01:00
bool slots_endpoint = true ;
2024-02-25 13:49:43 +01:00
bool metrics_endpoint = false ;
2024-03-01 10:08:08 +01:00
int n_threads_http = - 1 ;
2023-05-21 19:51:18 +02:00
} ;
2024-01-26 13:42:20 +01:00
bool server_verbose = false ;
2024-02-25 13:50:32 +01:00
bool server_log_json = true ;
2023-07-02 23:38:44 +02:00
2024-02-29 21:42:11 +01:00
enum stop_type {
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
STOP_FULL ,
STOP_PARTIAL ,
} ;
2024-02-29 21:42:11 +01:00
// TODO: can become bool if we can't find use of more states
enum slot_state {
IDLE ,
PROCESSING ,
} ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-29 21:42:11 +01:00
enum slot_command {
NONE ,
LOAD_PROMPT ,
RELEASE ,
} ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-29 21:42:11 +01:00
struct slot_params {
bool stream = true ;
bool cache_prompt = false ; // remember the prompt to avoid reprocessing all prompt
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-29 21:42:11 +01:00
uint32_t seed = - 1 ; // RNG seed
int32_t n_keep = 0 ; // number of tokens to keep from initial prompt
int32_t n_predict = - 1 ; // new tokens to predict
2023-07-02 23:38:44 +02:00
2024-02-29 21:42:11 +01:00
std : : vector < std : : string > antiprompt ;
2023-07-02 23:38:44 +02:00
2024-02-29 21:42:11 +01:00
json input_prefix ;
json input_suffix ;
} ;
struct slot_image {
int32_t id ;
bool request_encode_image = false ;
float * image_embedding = nullptr ;
int32_t image_tokens = 0 ;
clip_image_u8 * img_data ;
std : : string prefix_prompt ; // before of this image
} ;
struct server_slot {
2023-10-22 21:53:08 +02:00
int id ;
int task_id = - 1 ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
struct slot_params params ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
slot_state state = IDLE ;
slot_command command = NONE ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
// used to determine the slot that has been used the longest
int64_t t_last_used = - 1 ;
2023-10-20 20:07:23 +02:00
2023-10-22 21:53:08 +02:00
// generation props
int32_t n_ctx = 0 ; // context size per slot
int32_t n_past = 0 ;
int32_t n_decoded = 0 ;
int32_t n_remaining = - 1 ;
int32_t i_batch = - 1 ;
2024-02-18 17:30:09 +01:00
int32_t n_predict = - 1 ;
2023-10-20 20:07:23 +02:00
2024-02-29 21:42:11 +01:00
int32_t n_prompt_tokens = 0 ;
int32_t n_prompt_tokens_processed = 0 ;
2023-10-22 21:53:08 +02:00
json prompt ;
std : : string generated_text ;
llama_token sampled ;
std : : vector < llama_token > cache_tokens ;
std : : vector < completion_token_output > generated_token_probs ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
bool infill = false ;
2023-11-01 10:28:28 +01:00
bool embedding = false ;
2023-10-22 21:53:08 +02:00
bool has_next_token = true ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
bool truncated = false ;
bool stopped_eos = false ;
bool stopped_word = false ;
bool stopped_limit = false ;
2023-10-22 21:53:08 +02:00
2023-11-25 10:29:06 +01:00
bool oaicompat = false ;
std : : string oaicompat_model ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
std : : string stopping_word ;
2023-10-22 21:53:08 +02:00
// sampling
struct llama_sampling_params sparams ;
llama_sampling_context * ctx_sampling = nullptr ;
2024-01-27 14:38:05 +01:00
int32_t ga_i = 0 ; // group-attention state
2024-01-30 19:17:30 +01:00
int32_t ga_n = 1 ; // group-attention factor
2024-01-27 14:38:05 +01:00
int32_t ga_w = 512 ; // group-attention width
int32_t n_past_se = 0 ; // self-extend
2023-10-22 21:53:08 +02:00
// multimodal
std : : vector < slot_image > images ;
// stats
2024-02-29 21:42:11 +01:00
size_t n_sent_text = 0 ; // number of sent text character
size_t n_sent_token_probs = 0 ;
2023-10-22 21:53:08 +02:00
int64_t t_start_process_prompt ;
int64_t t_start_genereration ;
double t_prompt_processing ; // ms
double t_token_generation ; // ms
2023-11-30 23:25:04 +01:00
// multitasks
int multitask_id = - 1 ;
2023-10-22 21:53:08 +02:00
void reset ( ) {
2024-02-29 21:42:11 +01:00
n_prompt_tokens = 0 ;
2023-10-22 21:53:08 +02:00
generated_text = " " ;
truncated = false ;
stopped_eos = false ;
stopped_word = false ;
stopped_limit = false ;
stopping_word = " " ;
n_past = 0 ;
2024-02-29 21:42:11 +01:00
n_sent_text = 0 ;
n_sent_token_probs = 0 ;
2023-10-22 21:53:08 +02:00
infill = false ;
2024-01-27 14:38:05 +01:00
ga_i = 0 ;
2024-01-30 19:17:30 +01:00
n_past_se = 0 ;
2023-10-22 21:53:08 +02:00
generated_token_probs . clear ( ) ;
2024-02-29 21:42:11 +01:00
for ( slot_image & img : images ) {
2023-10-22 21:53:08 +02:00
free ( img . image_embedding ) ;
2023-12-30 22:24:42 +01:00
if ( img . img_data ) {
clip_image_u8_free ( img . img_data ) ;
}
2023-10-22 21:53:08 +02:00
img . prefix_prompt = " " ;
}
images . clear ( ) ;
}
bool has_budget ( gpt_params & global_params ) {
2024-02-29 21:42:11 +01:00
if ( params . n_predict = = - 1 & & global_params . n_predict = = - 1 ) {
2024-01-07 07:45:26 +01:00
return true ; // limitless
}
2023-10-22 21:53:08 +02:00
n_remaining = - 1 ;
2024-01-07 07:45:26 +01:00
2024-02-29 21:42:11 +01:00
if ( params . n_predict ! = - 1 ) {
2023-10-22 21:53:08 +02:00
n_remaining = params . n_predict - n_decoded ;
2024-02-29 21:42:11 +01:00
} else if ( global_params . n_predict ! = - 1 ) {
2023-10-22 21:53:08 +02:00
n_remaining = global_params . n_predict - n_decoded ;
}
2024-01-07 07:45:26 +01:00
return n_remaining > 0 ; // no budget
2023-10-22 21:53:08 +02:00
}
bool available ( ) const {
return state = = IDLE & & command = = NONE ;
}
bool is_processing ( ) const {
return ( state = = IDLE & & command = = LOAD_PROMPT ) | | state = = PROCESSING ;
}
void add_token_string ( const completion_token_output & token ) {
2024-02-29 21:42:11 +01:00
if ( command = = RELEASE ) {
2023-10-22 21:53:08 +02:00
return ;
}
cache_tokens . push_back ( token . tok ) ;
generated_token_probs . push_back ( token ) ;
}
void release ( ) {
2024-01-26 13:42:20 +01:00
if ( state = = PROCESSING )
2023-10-22 21:53:08 +02:00
{
t_token_generation = ( ggml_time_us ( ) - t_start_genereration ) / 1e3 ;
command = RELEASE ;
}
}
json get_formated_timings ( ) {
return json
{
2024-02-29 21:42:11 +01:00
{ " prompt_n " , n_prompt_tokens_processed } ,
2023-10-22 21:53:08 +02:00
{ " prompt_ms " , t_prompt_processing } ,
2024-02-29 21:42:11 +01:00
{ " prompt_per_token_ms " , t_prompt_processing / n_prompt_tokens_processed } ,
{ " prompt_per_second " , 1e3 / t_prompt_processing * n_prompt_tokens_processed } ,
2023-10-22 21:53:08 +02:00
{ " predicted_n " , n_decoded } ,
{ " predicted_ms " , t_token_generation } ,
{ " predicted_per_token_ms " , t_token_generation / n_decoded } ,
{ " predicted_per_second " , 1e3 / t_token_generation * n_decoded } ,
} ;
}
2023-11-25 10:29:06 +01:00
void print_timings ( ) const {
2024-02-25 13:50:32 +01:00
char buffer [ 512 ] ;
2024-02-29 21:42:11 +01:00
double t_token = t_prompt_processing / n_prompt_tokens_processed ;
double n_tokens_second = 1e3 / t_prompt_processing * n_prompt_tokens_processed ;
2024-02-25 13:50:32 +01:00
sprintf ( buffer , " prompt eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second) " ,
2024-02-29 21:42:11 +01:00
t_prompt_processing , n_prompt_tokens_processed ,
2024-02-25 13:50:32 +01:00
t_token , n_tokens_second ) ;
LOG_INFO ( buffer , {
2024-02-29 21:42:11 +01:00
{ " slot_id " , id } ,
{ " task_id " , task_id } ,
{ " t_prompt_processing " , t_prompt_processing } ,
{ " n_prompt_tokens_processed " , n_prompt_tokens_processed } ,
{ " t_token " , t_token } ,
{ " n_tokens_second " , n_tokens_second } ,
2024-02-25 13:50:32 +01:00
} ) ;
t_token = t_token_generation / n_decoded ;
n_tokens_second = 1e3 / t_token_generation * n_decoded ;
sprintf ( buffer , " generation eval time = %10.2f ms / %5d runs (%8.2f ms per token, %8.2f tokens per second) " ,
t_token_generation , n_decoded ,
t_token , n_tokens_second ) ;
LOG_INFO ( buffer , {
{ " slot_id " , id } ,
{ " task_id " , task_id } ,
{ " t_token_generation " , t_token_generation } ,
{ " n_decoded " , n_decoded } ,
{ " t_token " , t_token } ,
{ " n_tokens_second " , n_tokens_second } ,
} ) ;
sprintf ( buffer , " total time = %10.2f ms " , t_prompt_processing + t_token_generation ) ;
LOG_INFO ( buffer , {
{ " slot_id " , id } ,
{ " task_id " , task_id } ,
{ " t_prompt_processing " , t_prompt_processing } ,
{ " t_token_generation " , t_token_generation } ,
{ " t_total " , t_prompt_processing + t_token_generation } ,
} ) ;
2023-07-04 16:05:27 +02:00
}
2023-10-22 21:53:08 +02:00
} ;
2024-02-29 21:42:11 +01:00
struct server_metrics {
2024-02-25 13:49:43 +01:00
uint64_t n_prompt_tokens_processed_total = 0 ;
uint64_t n_tokens_predicted_total = 0 ;
uint64_t n_prompt_tokens_processed = 0 ;
uint64_t t_prompt_processing = 0 ;
uint64_t n_tokens_predicted = 0 ;
uint64_t t_tokens_generation = 0 ;
2024-02-29 21:42:11 +01:00
void on_prompt_eval ( const server_slot & slot ) {
n_prompt_tokens_processed_total + = slot . n_prompt_tokens_processed ;
n_prompt_tokens_processed + = slot . n_prompt_tokens_processed ;
t_prompt_processing + = slot . t_prompt_processing ;
2024-02-25 13:49:43 +01:00
}
2024-02-29 21:42:11 +01:00
void on_prediction ( const server_slot & slot ) {
2024-02-25 13:49:43 +01:00
n_tokens_predicted_total + = slot . n_decoded ;
2024-02-29 21:42:11 +01:00
n_tokens_predicted + = slot . n_decoded ;
t_tokens_generation + = slot . t_token_generation ;
2024-02-25 13:49:43 +01:00
}
void reset_bucket ( ) {
n_prompt_tokens_processed = 0 ;
t_prompt_processing = 0 ;
n_tokens_predicted = 0 ;
t_tokens_generation = 0 ;
}
} ;
2023-10-22 21:53:08 +02:00
struct llama_server_context
{
llama_model * model = nullptr ;
llama_context * ctx = nullptr ;
clip_ctx * clp_ctx = nullptr ;
gpt_params params ;
llama_batch batch ;
bool multimodal = false ;
bool clean_kv_cache = true ;
bool all_slots_are_idle = false ;
2023-11-17 03:14:37 +01:00
bool add_bos_token = true ;
2023-10-22 21:53:08 +02:00
int32_t n_ctx ; // total context for all clients / slots
// system prompt
bool system_need_update = false ;
std : : string system_prompt ;
std : : vector < llama_token > system_tokens ;
std : : string name_user ; // this should be the antiprompt
std : : string name_assistant ;
// slots / clients
2024-02-29 21:42:11 +01:00
std : : vector < server_slot > slots ;
2024-02-05 09:10:22 +01:00
json default_generation_settings_for_props ;
2023-10-22 21:53:08 +02:00
2024-02-29 21:42:11 +01:00
llama_server_queue queue_tasks ;
2024-01-26 13:42:20 +01:00
llama_server_response queue_results ;
2023-07-04 16:05:27 +02:00
2024-02-29 21:42:11 +01:00
server_metrics metrics ;
2024-02-25 13:49:43 +01:00
2023-07-05 22:51:13 +02:00
~ llama_server_context ( )
{
if ( ctx )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
llama_free ( ctx ) ;
ctx = nullptr ;
2023-05-21 19:51:18 +02:00
}
2023-07-05 22:51:13 +02:00
if ( model )
{
2023-06-24 10:47:58 +02:00
llama_free_model ( model ) ;
model = nullptr ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-10-22 21:53:08 +02:00
bool load_model ( const gpt_params & params_ )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
params = params_ ;
if ( ! params . mmproj . empty ( ) ) {
multimodal = true ;
2024-02-25 13:50:32 +01:00
LOG_INFO ( " Multi Modal Mode Enabled " , { } ) ;
2023-10-22 21:53:08 +02:00
clp_ctx = clip_model_load ( params . mmproj . c_str ( ) , /*verbosity=*/ 1 ) ;
if ( clp_ctx = = nullptr ) {
LOG_ERROR ( " unable to load clip model " , { { " model " , params . mmproj } } ) ;
return false ;
}
2023-08-10 12:16:38 +02:00
2023-10-22 21:53:08 +02:00
if ( params . n_ctx < 2048 ) { // request larger context for the image embedding
params . n_ctx = 2048 ;
}
2023-08-10 12:16:38 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-06-24 10:47:58 +02:00
std : : tie ( model , ctx ) = llama_init_from_gpt_params ( params ) ;
2023-07-05 22:51:13 +02:00
if ( model = = nullptr )
{
2023-10-22 21:53:08 +02:00
LOG_ERROR ( " unable to load model " , { { " model " , params . model } } ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
return false ;
2023-05-21 19:51:18 +02:00
}
2023-10-22 21:53:08 +02:00
if ( multimodal ) {
const int n_embd_clip = clip_n_mmproj_embd ( clp_ctx ) ;
const int n_embd_llm = llama_n_embd ( model ) ;
if ( n_embd_clip ! = n_embd_llm ) {
LOG_TEE ( " %s: embedding dim of the multimodal projector (%d) is not equal to that of LLaMA (%d). Make sure that you use the correct mmproj file. \n " , __func__ , n_embd_clip , n_embd_llm ) ;
llama_free ( ctx ) ;
llama_free_model ( model ) ;
return false ;
}
}
2023-09-28 21:42:38 +02:00
n_ctx = llama_n_ctx ( ctx ) ;
2023-10-22 21:53:08 +02:00
2023-11-17 03:14:37 +01:00
add_bos_token = llama_should_add_bos_token ( model ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
return true ;
2023-05-21 19:51:18 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-22 09:33:24 +01:00
void validate_model_chat_template ( server_params & sparams ) {
llama_chat_message chat [ ] = { { " user " , " test " } } ;
std : : vector < char > buf ( 1 ) ;
int res = llama_chat_apply_template ( model , nullptr , chat , 1 , true , buf . data ( ) , buf . size ( ) ) ;
if ( res < 0 ) {
LOG_ERROR ( " The chat template comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses " , { } ) ;
sparams . chat_template = " <|im_start|> " ; // llama_chat_apply_template only checks if <|im_start|> exist in the template
}
}
2023-10-22 21:53:08 +02:00
void initialize ( ) {
// create slots
all_slots_are_idle = true ;
const int32_t n_ctx_slot = n_ctx / params . n_parallel ;
2024-02-25 13:50:32 +01:00
LOG_INFO ( " initializing slots " , { { " n_slots " , params . n_parallel } } ) ;
2023-10-22 21:53:08 +02:00
for ( int i = 0 ; i < params . n_parallel ; i + + )
{
2024-02-29 21:42:11 +01:00
server_slot slot ;
2023-10-22 21:53:08 +02:00
slot . id = i ;
slot . n_ctx = n_ctx_slot ;
2024-02-18 17:30:09 +01:00
slot . n_predict = params . n_predict ;
2023-10-22 21:53:08 +02:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( " new slot " , {
{ " slot_id " , slot . id } ,
{ " n_ctx_slot " , slot . n_ctx }
} ) ;
2024-01-27 14:38:05 +01:00
const int ga_n = params . grp_attn_n ;
const int ga_w = params . grp_attn_w ;
if ( ga_n ! = 1 ) {
GGML_ASSERT ( ga_n > 0 & & " ga_n must be positive " ) ; // NOLINT
GGML_ASSERT ( ga_w % ga_n = = 0 & & " ga_w must be a multiple of ga_n " ) ; // NOLINT
//GGML_ASSERT(n_ctx_train % ga_w == 0 && "n_ctx_train must be a multiple of ga_w"); // NOLINT
//GGML_ASSERT(n_ctx >= n_ctx_train * ga_n && "n_ctx must be at least n_ctx_train * ga_n"); // NOLINT
2024-02-25 13:50:32 +01:00
LOG_INFO ( " slot self-extend " , {
{ " slot_id " , slot . id } ,
{ " ga_n " , ga_n } ,
{ " ga_w " , ga_w }
} ) ;
2024-01-27 14:38:05 +01:00
}
slot . ga_i = 0 ;
slot . ga_n = ga_n ;
slot . ga_w = ga_w ;
slot . reset ( ) ;
2023-10-22 21:53:08 +02:00
slots . push_back ( slot ) ;
}
2024-02-05 09:10:22 +01:00
default_generation_settings_for_props = get_formated_generation ( slots . front ( ) ) ;
default_generation_settings_for_props [ " seed " ] = - 1 ;
2023-10-22 21:53:08 +02:00
batch = llama_batch_init ( n_ctx , 0 , params . n_parallel ) ;
}
2023-09-07 19:22:29 +02:00
std : : vector < llama_token > tokenize ( const json & json_prompt , bool add_bos ) const
2023-08-23 09:12:12 +02:00
{
2023-11-25 10:29:06 +01:00
// TODO: currently, we tokenize using special tokens by default
// this is not always correct (see https://github.com/ggerganov/llama.cpp/pull/4160#issuecomment-1824826216)
// but it's better compared to completely ignoring ChatML and other chat templates
const bool TMP_FORCE_SPECIAL = true ;
2023-08-23 09:12:12 +02:00
// If `add_bos` is true, we only add BOS, when json_prompt is a string,
// or the first element of the json_prompt array is a string.
std : : vector < llama_token > prompt_tokens ;
if ( json_prompt . is_array ( ) )
{
bool first = true ;
for ( const auto & p : json_prompt )
{
if ( p . is_string ( ) )
{
auto s = p . template get < std : : string > ( ) ;
std : : vector < llama_token > p ;
if ( first )
{
2023-11-25 10:29:06 +01:00
p = : : llama_tokenize ( ctx , s , add_bos , TMP_FORCE_SPECIAL ) ;
2023-08-23 09:12:12 +02:00
first = false ;
}
else
{
2023-11-25 10:29:06 +01:00
p = : : llama_tokenize ( ctx , s , false , TMP_FORCE_SPECIAL ) ;
2023-08-23 09:12:12 +02:00
}
prompt_tokens . insert ( prompt_tokens . end ( ) , p . begin ( ) , p . end ( ) ) ;
}
else
{
if ( first )
{
first = false ;
}
prompt_tokens . push_back ( p . template get < llama_token > ( ) ) ;
}
}
}
else
{
auto s = json_prompt . template get < std : : string > ( ) ;
2023-11-25 10:29:06 +01:00
prompt_tokens = : : llama_tokenize ( ctx , s , add_bos , TMP_FORCE_SPECIAL ) ;
2023-08-23 09:12:12 +02:00
}
return prompt_tokens ;
}
2024-02-29 21:42:11 +01:00
server_slot * get_slot ( int id ) {
2023-10-22 21:53:08 +02:00
int64_t t_last = ggml_time_us ( ) ;
2024-02-29 21:42:11 +01:00
server_slot * last_used = nullptr ;
2023-10-20 20:07:23 +02:00
2024-02-29 21:42:11 +01:00
for ( server_slot & slot : slots )
2023-10-22 21:53:08 +02:00
{
if ( slot . id = = id & & slot . available ( ) )
{
return & slot ;
}
2023-10-20 20:07:23 +02:00
2023-10-22 21:53:08 +02:00
if ( slot . available ( ) & & slot . t_last_used < t_last )
{
last_used = & slot ;
t_last = slot . t_last_used ;
}
}
2023-10-20 20:07:23 +02:00
2023-10-22 21:53:08 +02:00
return last_used ;
2023-08-08 15:29:19 +02:00
}
2024-02-29 21:42:11 +01:00
bool launch_slot_with_data ( server_slot * & slot , json data ) {
2023-10-22 21:53:08 +02:00
slot_params default_params ;
llama_sampling_params default_sparams ;
2023-11-25 10:29:06 +01:00
if ( data . count ( " __oaicompat " ) ! = 0 ) {
slot - > oaicompat = true ;
slot - > oaicompat_model = json_value ( data , " model " , std : : string ( DEFAULT_OAICOMPAT_MODEL ) ) ;
} else {
slot - > oaicompat = false ;
slot - > oaicompat_model = " " ;
}
2024-02-06 10:20:00 +01:00
slot - > params . stream = json_value ( data , " stream " , false ) ;
slot - > params . cache_prompt = json_value ( data , " cache_prompt " , false ) ;
slot - > params . n_predict = json_value ( data , " n_predict " , default_params . n_predict ) ;
slot - > sparams . top_k = json_value ( data , " top_k " , default_sparams . top_k ) ;
slot - > sparams . top_p = json_value ( data , " top_p " , default_sparams . top_p ) ;
slot - > sparams . min_p = json_value ( data , " min_p " , default_sparams . min_p ) ;
slot - > sparams . tfs_z = json_value ( data , " tfs_z " , default_sparams . tfs_z ) ;
slot - > sparams . typical_p = json_value ( data , " typical_p " , default_sparams . typical_p ) ;
slot - > sparams . temp = json_value ( data , " temperature " , default_sparams . temp ) ;
slot - > sparams . dynatemp_range = json_value ( data , " dynatemp_range " , default_sparams . dynatemp_range ) ;
slot - > sparams . dynatemp_exponent = json_value ( data , " dynatemp_exponent " , default_sparams . dynatemp_exponent ) ;
slot - > sparams . penalty_last_n = json_value ( data , " repeat_last_n " , default_sparams . penalty_last_n ) ;
slot - > sparams . penalty_repeat = json_value ( data , " repeat_penalty " , default_sparams . penalty_repeat ) ;
slot - > sparams . penalty_freq = json_value ( data , " frequency_penalty " , default_sparams . penalty_freq ) ;
slot - > sparams . penalty_present = json_value ( data , " presence_penalty " , default_sparams . penalty_present ) ;
slot - > sparams . mirostat = json_value ( data , " mirostat " , default_sparams . mirostat ) ;
slot - > sparams . mirostat_tau = json_value ( data , " mirostat_tau " , default_sparams . mirostat_tau ) ;
slot - > sparams . mirostat_eta = json_value ( data , " mirostat_eta " , default_sparams . mirostat_eta ) ;
slot - > sparams . penalize_nl = json_value ( data , " penalize_nl " , default_sparams . penalize_nl ) ;
slot - > params . n_keep = json_value ( data , " n_keep " , slot - > params . n_keep ) ;
slot - > params . seed = json_value ( data , " seed " , default_params . seed ) ;
slot - > sparams . grammar = json_value ( data , " grammar " , default_sparams . grammar ) ;
slot - > sparams . n_probs = json_value ( data , " n_probs " , default_sparams . n_probs ) ;
2024-02-18 20:11:16 +01:00
slot - > sparams . min_keep = json_value ( data , " min_keep " , default_sparams . min_keep ) ;
2023-10-22 21:53:08 +02:00
2024-02-18 17:30:09 +01:00
if ( slot - > n_predict > 0 & & slot - > params . n_predict > slot - > n_predict ) {
// Might be better to reject the request with a 400 ?
LOG_WARNING ( " Max tokens to predict exceeds server configuration " , {
{ " params.n_predict " , slot - > params . n_predict } ,
{ " slot.n_predict " , slot - > n_predict } ,
} ) ;
slot - > params . n_predict = slot - > n_predict ;
}
2023-10-22 21:53:08 +02:00
// infill
if ( data . count ( " input_prefix " ) ! = 0 )
{
slot - > params . input_prefix = data [ " input_prefix " ] ;
}
else
{
slot - > params . input_prefix = " " ;
2023-10-10 09:31:21 +02:00
}
2023-10-22 21:53:08 +02:00
if ( data . count ( " input_suffix " ) ! = 0 )
{
slot - > params . input_suffix = data [ " input_suffix " ] ;
}
else
{
slot - > params . input_suffix = " " ;
2023-10-10 09:31:21 +02:00
}
2023-10-20 20:07:23 +02:00
2023-10-22 21:53:08 +02:00
if ( data . count ( " prompt " ) ! = 0 )
{
slot - > prompt = data [ " prompt " ] ;
}
else
{
slot - > prompt = " " ;
}
2023-10-02 09:42:02 +02:00
2023-12-23 10:31:49 +01:00
slot - > sparams . penalty_prompt_tokens . clear ( ) ;
slot - > sparams . use_penalty_prompt_tokens = false ;
const auto & penalty_prompt = data . find ( " penalty_prompt " ) ;
if ( penalty_prompt ! = data . end ( ) )
{
if ( penalty_prompt - > is_string ( ) )
{
const auto penalty_prompt_string = penalty_prompt - > get < std : : string > ( ) ;
auto penalty_tokens = llama_tokenize ( model , penalty_prompt_string , false ) ;
slot - > sparams . penalty_prompt_tokens . swap ( penalty_tokens ) ;
if ( slot - > params . n_predict > 0 )
{
slot - > sparams . penalty_prompt_tokens . reserve ( slot - > sparams . penalty_prompt_tokens . size ( ) + slot - > params . n_predict ) ;
}
slot - > sparams . use_penalty_prompt_tokens = true ;
}
else if ( penalty_prompt - > is_array ( ) )
{
const auto n_tokens = penalty_prompt - > size ( ) ;
slot - > sparams . penalty_prompt_tokens . reserve ( n_tokens + std : : max ( 0 , slot - > params . n_predict ) ) ;
const int n_vocab = llama_n_vocab ( model ) ;
for ( const auto & penalty_token : * penalty_prompt )
{
if ( penalty_token . is_number_integer ( ) )
{
const auto tok = penalty_token . get < llama_token > ( ) ;
if ( tok > = 0 & & tok < n_vocab )
{
slot - > sparams . penalty_prompt_tokens . push_back ( tok ) ;
}
}
}
slot - > sparams . use_penalty_prompt_tokens = true ;
}
}
2023-10-22 21:53:08 +02:00
slot - > sparams . logit_bias . clear ( ) ;
2023-10-02 09:42:02 +02:00
2023-10-22 21:53:08 +02:00
if ( json_value ( data , " ignore_eos " , false ) )
2023-10-02 09:42:02 +02:00
{
2023-10-23 21:40:03 +02:00
slot - > sparams . logit_bias [ llama_token_eos ( model ) ] = - INFINITY ;
2023-10-02 09:42:02 +02:00
}
2023-10-22 21:53:08 +02:00
const auto & logit_bias = data . find ( " logit_bias " ) ;
if ( logit_bias ! = data . end ( ) & & logit_bias - > is_array ( ) )
2023-10-02 09:42:02 +02:00
{
2023-10-22 21:53:08 +02:00
const int n_vocab = llama_n_vocab ( model ) ;
for ( const auto & el : * logit_bias )
{
2024-02-11 14:38:14 +01:00
if ( el . is_array ( ) & & el . size ( ) = = 2 )
2023-10-22 21:53:08 +02:00
{
2024-02-11 14:38:14 +01:00
float bias ;
if ( el [ 1 ] . is_number ( ) )
2023-10-22 21:53:08 +02:00
{
2024-02-11 14:38:14 +01:00
bias = el [ 1 ] . get < float > ( ) ;
}
else if ( el [ 1 ] . is_boolean ( ) & & ! el [ 1 ] . get < bool > ( ) )
{
bias = - INFINITY ;
}
else
{
continue ;
}
if ( el [ 0 ] . is_number_integer ( ) )
{
llama_token tok = el [ 0 ] . get < llama_token > ( ) ;
if ( tok > = 0 & & tok < n_vocab )
2023-10-22 21:53:08 +02:00
{
2024-02-11 14:38:14 +01:00
slot - > sparams . logit_bias [ tok ] = bias ;
2023-10-22 21:53:08 +02:00
}
2024-02-11 14:38:14 +01:00
}
else if ( el [ 0 ] . is_string ( ) )
{
auto toks = llama_tokenize ( model , el [ 0 ] . get < std : : string > ( ) , false ) ;
for ( auto tok : toks )
2023-10-22 21:53:08 +02:00
{
2024-02-11 14:38:14 +01:00
slot - > sparams . logit_bias [ tok ] = bias ;
2023-10-22 21:53:08 +02:00
}
}
}
}
2023-10-02 09:42:02 +02:00
}
2023-10-20 20:07:23 +02:00
2023-10-22 21:53:08 +02:00
slot - > params . antiprompt . clear ( ) ;
2023-10-24 22:08:20 +02:00
2023-10-22 21:53:08 +02:00
const auto & stop = data . find ( " stop " ) ;
if ( stop ! = data . end ( ) & & stop - > is_array ( ) )
2023-10-02 09:42:02 +02:00
{
2023-10-22 21:53:08 +02:00
for ( const auto & word : * stop )
{
if ( ! word . empty ( ) )
{
slot - > params . antiprompt . push_back ( word ) ;
}
}
2023-10-02 09:42:02 +02:00
}
2024-02-16 12:33:25 +01:00
const auto & samplers_sequence = data . find ( " samplers " ) ;
if ( samplers_sequence ! = data . end ( ) & & samplers_sequence - > is_array ( ) )
{
std : : vector < std : : string > sampler_names ;
for ( const auto & sampler_name : * samplers_sequence )
{
if ( sampler_name . is_string ( ) )
{
sampler_names . emplace_back ( sampler_name ) ;
}
}
slot - > sparams . samplers_sequence = sampler_types_from_names ( sampler_names , false ) ;
}
else
{
slot - > sparams . samplers_sequence = default_sparams . samplers_sequence ;
}
2023-10-22 21:53:08 +02:00
if ( multimodal )
{
const auto & images_data = data . find ( " image_data " ) ;
if ( images_data ! = data . end ( ) & & images_data - > is_array ( ) )
{
for ( const auto & img : * images_data )
{
2023-12-30 22:24:42 +01:00
const std : : vector < uint8_t > image_buffer = base64_decode ( img [ " data " ] . get < std : : string > ( ) ) ;
2023-10-22 21:53:08 +02:00
slot_image img_sl ;
img_sl . id = img . count ( " id " ) ! = 0 ? img [ " id " ] . get < int > ( ) : slot - > images . size ( ) ;
2023-12-30 22:24:42 +01:00
img_sl . img_data = clip_image_u8_init ( ) ;
if ( ! clip_image_load_from_bytes ( image_buffer . data ( ) , image_buffer . size ( ) , img_sl . img_data ) )
{
2024-02-25 13:50:32 +01:00
LOG_ERROR ( " failed to load image " , {
{ " slot_id " , slot - > id } ,
{ " img_sl_id " , img_sl . id }
} ) ;
2023-10-22 21:53:08 +02:00
return false ;
}
2024-02-25 13:50:32 +01:00
LOG_VERBOSE ( " image loaded " , {
{ " slot_id " , slot - > id } ,
{ " img_sl_id " , img_sl . id }
} ) ;
2023-10-22 21:53:08 +02:00
img_sl . request_encode_image = true ;
slot - > images . push_back ( img_sl ) ;
}
// process prompt
// example: system prompt [img-102] user [img-103] describe [img-134] -> [{id: 102, prefix: 'system prompt '}, {id: 103, prefix: ' user '}, {id: 134, prefix: ' describe '}]}
if ( slot - > images . size ( ) > 0 & & ! slot - > prompt . is_array ( ) )
{
std : : string prompt = slot - > prompt . get < std : : string > ( ) ;
size_t pos = 0 , begin_prefix = 0 ;
std : : string pattern = " [img- " ;
while ( ( pos = prompt . find ( pattern , pos ) ) ! = std : : string : : npos ) {
size_t end_prefix = pos ;
pos + = pattern . length ( ) ;
2024-01-27 15:25:55 +01:00
size_t end_pos = prompt . find ( ' ] ' , pos ) ;
2023-10-22 21:53:08 +02:00
if ( end_pos ! = std : : string : : npos )
{
std : : string image_id = prompt . substr ( pos , end_pos - pos ) ;
try
{
int img_id = std : : stoi ( image_id ) ;
bool found = false ;
for ( slot_image & img : slot - > images )
{
if ( img . id = = img_id ) {
found = true ;
img . prefix_prompt = prompt . substr ( begin_prefix , end_prefix - begin_prefix ) ;
begin_prefix = end_pos + 1 ;
break ;
}
}
if ( ! found ) {
LOG_TEE ( " ERROR: Image with id: %i, not found. \n " , img_id ) ;
slot - > images . clear ( ) ;
return false ;
}
} catch ( const std : : invalid_argument & e ) {
LOG_TEE ( " Invalid image number id in prompt \n " ) ;
slot - > images . clear ( ) ;
return false ;
}
}
}
slot - > prompt = " " ;
slot - > params . input_suffix = prompt . substr ( begin_prefix ) ;
slot - > params . cache_prompt = false ; // multimodal doesn't support cache prompt
}
}
}
2023-10-12 08:29:04 +02:00
2023-10-22 21:53:08 +02:00
if ( slot - > ctx_sampling ! = nullptr )
2023-10-02 09:42:02 +02:00
{
2023-10-22 21:53:08 +02:00
llama_sampling_free ( slot - > ctx_sampling ) ;
2023-10-02 09:42:02 +02:00
}
2023-10-22 21:53:08 +02:00
slot - > ctx_sampling = llama_sampling_init ( slot - > sparams ) ;
2023-12-28 20:20:00 +01:00
llama_set_rng_seed ( ctx , slot - > params . seed ) ;
2023-10-22 21:53:08 +02:00
slot - > command = LOAD_PROMPT ;
2023-10-02 09:42:02 +02:00
2023-10-22 21:53:08 +02:00
all_slots_are_idle = false ;
2023-10-12 08:29:04 +02:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( " slot is processing task " , {
{ " slot_id " , slot - > id } ,
{ " task_id " , slot - > task_id } ,
} ) ;
2023-10-02 09:42:02 +02:00
2023-10-22 21:53:08 +02:00
return true ;
}
void kv_cache_clear ( ) {
// clear the entire KV cache
2023-10-29 18:31:40 +01:00
llama_kv_cache_clear ( ctx ) ;
2023-10-22 21:53:08 +02:00
clean_kv_cache = false ;
2023-10-02 09:42:02 +02:00
}
2023-08-23 09:12:12 +02:00
2024-02-29 21:42:11 +01:00
void system_prompt_update ( ) {
2024-02-16 11:00:56 +01:00
kv_cache_clear ( ) ;
system_tokens . clear ( ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-16 11:00:56 +01:00
if ( ! system_prompt . empty ( ) ) {
system_tokens = : : llama_tokenize ( ctx , system_prompt , add_bos_token ) ;
2023-10-22 21:53:08 +02:00
2024-02-16 11:00:56 +01:00
llama_batch_clear ( batch ) ;
2023-10-22 21:53:08 +02:00
2024-02-16 11:00:56 +01:00
for ( int i = 0 ; i < ( int ) system_tokens . size ( ) ; + + i )
{
llama_batch_add ( batch , system_tokens [ i ] , i , { 0 } , false ) ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-25 19:43:50 +01:00
for ( int32_t i = 0 ; i < ( int32_t ) batch . n_tokens ; i + = params . n_batch )
2024-02-16 11:00:56 +01:00
{
2024-02-25 19:43:50 +01:00
const int32_t n_tokens = std : : min ( params . n_batch , ( int32_t ) ( batch . n_tokens - i ) ) ;
llama_batch batch_view = {
n_tokens ,
batch . token + i ,
nullptr ,
batch . pos + i ,
batch . n_seq_id + i ,
batch . seq_id + i ,
batch . logits + i ,
0 , 0 , 0 , // unused
} ;
if ( llama_decode ( ctx , batch_view ) ! = 0 )
{
LOG_TEE ( " %s: llama_decode() failed \n " , __func__ ) ;
return ;
}
2024-02-16 11:00:56 +01:00
}
2023-10-20 20:07:23 +02:00
2024-02-16 11:00:56 +01:00
// assign the system KV cache to all parallel sequences
for ( int32_t i = 1 ; i < params . n_parallel ; + + i )
{
llama_kv_cache_seq_cp ( ctx , 0 , i , 0 , system_tokens . size ( ) ) ;
}
2023-05-21 19:51:18 +02:00
}
2023-10-22 21:53:08 +02:00
LOG_TEE ( " system prompt updated \n " ) ;
system_need_update = false ;
}
2023-09-28 18:04:36 +02:00
2024-02-29 21:42:11 +01:00
void system_prompt_notify ( ) {
2023-10-22 21:53:08 +02:00
// release all slots
2024-02-29 21:42:11 +01:00
for ( server_slot & slot : slots )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
slot . release ( ) ;
2023-05-21 19:51:18 +02:00
}
2023-10-22 21:53:08 +02:00
system_need_update = true ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2024-02-29 21:42:11 +01:00
void system_prompt_process ( const json & sys_props ) {
2023-10-22 21:53:08 +02:00
system_prompt = sys_props . value ( " prompt " , " " ) ;
name_user = sys_props . value ( " anti_prompt " , " " ) ;
name_assistant = sys_props . value ( " assistant_name " , " " ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-16 11:00:56 +01:00
2024-02-29 21:42:11 +01:00
system_prompt_notify ( ) ;
2023-10-22 21:53:08 +02:00
}
2023-09-28 18:04:36 +02:00
2023-10-22 21:53:08 +02:00
static size_t find_stopping_strings ( const std : : string & text , const size_t last_token_size ,
2024-02-29 21:42:11 +01:00
const stop_type type , server_slot & slot )
2023-10-22 21:53:08 +02:00
{
size_t stop_pos = std : : string : : npos ;
2023-09-28 18:04:36 +02:00
2023-10-22 21:53:08 +02:00
for ( const std : : string & word : slot . params . antiprompt )
{
size_t pos ;
if ( type = = STOP_FULL )
{
const size_t tmp = word . size ( ) + last_token_size ;
const size_t from_pos = text . size ( ) > tmp ? text . size ( ) - tmp : 0 ;
pos = text . find ( word , from_pos ) ;
}
else
2023-09-28 18:04:36 +02:00
{
2023-10-22 21:53:08 +02:00
pos = find_partial_stop_string ( word , text ) ;
2023-09-28 18:04:36 +02:00
}
2023-10-22 21:53:08 +02:00
if ( pos ! = std : : string : : npos & &
( stop_pos = = std : : string : : npos | | pos < stop_pos ) )
{
if ( type = = STOP_FULL )
{
2024-02-29 21:42:11 +01:00
slot . stopped_word = true ;
slot . stopping_word = word ;
2023-10-22 21:53:08 +02:00
slot . has_next_token = false ;
}
stop_pos = pos ;
}
}
return stop_pos ;
}
2024-02-29 21:42:11 +01:00
bool process_token ( completion_token_output & result , server_slot & slot ) {
2023-10-22 21:53:08 +02:00
// remember which tokens were sampled - used for repetition penalties during sampling
const std : : string token_str = llama_token_to_piece ( ctx , result . tok ) ;
slot . sampled = result . tok ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
// search stop word and delete it
slot . generated_text + = token_str ;
slot . has_next_token = true ;
2023-12-23 10:31:49 +01:00
if ( slot . ctx_sampling - > params . use_penalty_prompt_tokens & & result . tok ! = - 1 )
{
// we can change penalty_prompt_tokens because it is always created from scratch each request
slot . ctx_sampling - > params . penalty_prompt_tokens . push_back ( result . tok ) ;
}
2023-12-13 20:57:15 +01:00
// check if there is incomplete UTF-8 character at the end
bool incomplete = false ;
for ( unsigned i = 1 ; i < 5 & & i < = slot . generated_text . size ( ) ; + + i )
2023-10-22 21:53:08 +02:00
{
2023-12-13 20:57:15 +01:00
unsigned char c = slot . generated_text [ slot . generated_text . size ( ) - i ] ;
if ( ( c & 0xC0 ) = = 0x80 )
{
// continuation byte: 10xxxxxx
continue ;
}
2023-10-22 21:53:08 +02:00
if ( ( c & 0xE0 ) = = 0xC0 )
{
2023-12-13 20:57:15 +01:00
// 2-byte character: 110xxxxx ...
incomplete = i < 2 ;
2023-10-22 21:53:08 +02:00
}
else if ( ( c & 0xF0 ) = = 0xE0 )
{
2023-12-13 20:57:15 +01:00
// 3-byte character: 1110xxxx ...
incomplete = i < 3 ;
2023-10-22 21:53:08 +02:00
}
else if ( ( c & 0xF8 ) = = 0xF0 )
{
2023-12-13 20:57:15 +01:00
// 4-byte character: 11110xxx ...
incomplete = i < 4 ;
2023-10-22 21:53:08 +02:00
}
2023-12-13 20:57:15 +01:00
// else 1-byte character or invalid byte
break ;
2023-05-21 19:51:18 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-12-13 20:57:15 +01:00
if ( ! incomplete )
2023-07-05 22:51:13 +02:00
{
2024-02-29 21:42:11 +01:00
size_t pos = std : : min ( slot . n_sent_text , slot . generated_text . size ( ) ) ;
2023-10-22 21:53:08 +02:00
const std : : string str_test = slot . generated_text . substr ( pos ) ;
bool is_stop_full = false ;
size_t stop_pos = find_stopping_strings ( str_test , token_str . size ( ) , STOP_FULL , slot ) ;
if ( stop_pos ! = std : : string : : npos )
{
is_stop_full = true ;
slot . generated_text . erase (
slot . generated_text . begin ( ) + pos + stop_pos ,
slot . generated_text . end ( ) ) ;
2024-02-29 21:42:11 +01:00
pos = std : : min ( slot . n_sent_text , slot . generated_text . size ( ) ) ;
2023-10-22 21:53:08 +02:00
}
else
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
is_stop_full = false ;
stop_pos = find_stopping_strings ( str_test , token_str . size ( ) , STOP_PARTIAL , slot ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-09-28 18:04:36 +02:00
2023-10-22 21:53:08 +02:00
// check if there is any token to predict
if ( stop_pos = = std : : string : : npos | | ( ! slot . has_next_token & & ! is_stop_full & & stop_pos > 0 ) )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
// no send the stop word in the response
result . text_to_send = slot . generated_text . substr ( pos , std : : string : : npos ) ;
2024-02-29 21:42:11 +01:00
slot . n_sent_text + = result . text_to_send . size ( ) ;
2023-10-22 21:53:08 +02:00
// add the token to slot queue and cache
}
slot . add_token_string ( result ) ;
if ( slot . params . stream )
{
send_partial_response ( slot , result ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-05-21 19:51:18 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-12-13 20:57:15 +01:00
if ( incomplete )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
slot . has_next_token = true ;
2023-06-20 00:12:39 +02:00
}
2023-10-22 21:53:08 +02:00
// check the limits
2024-01-07 07:45:26 +01:00
if ( slot . n_decoded > 0 & & slot . has_next_token & & ! slot . has_budget ( params ) )
2023-05-21 19:51:18 +02:00
{
2023-10-22 21:53:08 +02:00
slot . stopped_limit = true ;
slot . has_next_token = false ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-23 21:40:03 +02:00
if ( ! slot . cache_tokens . empty ( ) & & result . tok = = llama_token_eos ( model ) )
2023-10-22 21:53:08 +02:00
{
slot . stopped_eos = true ;
slot . has_next_token = false ;
LOG_VERBOSE ( " eos token found " , { } ) ;
}
LOG_VERBOSE ( " next token " , {
{ " token " , result . tok } ,
{ " token_text " , tokens_to_output_formatted_string ( ctx , result . tok ) } ,
{ " has_next_token " , slot . has_next_token } ,
{ " n_remain " , slot . n_remaining } ,
{ " num_tokens_predicted " , slot . n_decoded } ,
{ " stopped_eos " , slot . stopped_eos } ,
{ " stopped_word " , slot . stopped_word } ,
{ " stopped_limit " , slot . stopped_limit } ,
{ " stopping_word " , slot . stopping_word } ,
} ) ;
return slot . has_next_token ; // continue
}
2023-08-08 15:29:19 +02:00
2024-02-29 21:42:11 +01:00
bool process_images ( server_slot & slot ) const
2023-10-22 21:53:08 +02:00
{
for ( slot_image & img : slot . images )
{
if ( ! img . request_encode_image )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
continue ;
2023-08-08 15:29:19 +02:00
}
llava : support v1.6 (#5267)
* Create llava-survery-v2.py
* Update convert-image-encoder-to-gguf.py
* Update convert-image-encoder-to-gguf.py
* Rename llava-survery-v2.py to llava-surgery-v2.py
* Update convert-image-encoder-to-gguf.py
will now search for projector
* Update convert-image-encoder-to-gguf.py
whoops
* Update llava-surgery-v2.py
* Clip: Bugfix for normalization (it did not loat the 3 std and mean values)
Clip: bicubic resize function
Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images
Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6)
Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints
Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported
llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final
convert-image-encoder: fixed image-grid flattening
* whitespace corrections
* ws
* Tensors are now properly permuted.
Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference.
* ws
* added verbose_prompt support into cli
added stopwords for llava-1.6 into cli
* moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed
* ws
* convert : skip unknown tensors (need for LLaVA)
* llava : update readme
* llava : fix compile warnings
* llava : style
* convert : add --skip-unknown CLI arg
* server : remove clip structs
* bugfix for non llava-1.6
It should now work with llava-1.5 as well
* clip : minor code rearrange
* llava : update readme a bit
---------
Co-authored-by: John <cmt-nct@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-14 08:38:35 +01:00
2024-02-20 20:07:22 +01:00
if ( ! llava_image_embed_make_with_clip_img ( clp_ctx , params . n_threads , img . img_data , & img . image_embedding , & img . image_tokens ) ) {
LOG_TEE ( " Error processing the given image " ) ;
2023-10-22 21:53:08 +02:00
return false ;
}
llava : support v1.6 (#5267)
* Create llava-survery-v2.py
* Update convert-image-encoder-to-gguf.py
* Update convert-image-encoder-to-gguf.py
* Rename llava-survery-v2.py to llava-surgery-v2.py
* Update convert-image-encoder-to-gguf.py
will now search for projector
* Update convert-image-encoder-to-gguf.py
whoops
* Update llava-surgery-v2.py
* Clip: Bugfix for normalization (it did not loat the 3 std and mean values)
Clip: bicubic resize function
Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images
Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6)
Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints
Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported
llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final
convert-image-encoder: fixed image-grid flattening
* whitespace corrections
* ws
* Tensors are now properly permuted.
Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference.
* ws
* added verbose_prompt support into cli
added stopwords for llava-1.6 into cli
* moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed
* ws
* convert : skip unknown tensors (need for LLaVA)
* llava : update readme
* llava : fix compile warnings
* llava : style
* convert : add --skip-unknown CLI arg
* server : remove clip structs
* bugfix for non llava-1.6
It should now work with llava-1.5 as well
* clip : minor code rearrange
* llava : update readme a bit
---------
Co-authored-by: John <cmt-nct@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-14 08:38:35 +01:00
2023-10-22 21:53:08 +02:00
img . request_encode_image = false ;
}
2023-08-08 15:29:19 +02:00
2023-10-22 21:53:08 +02:00
return slot . images . size ( ) > 0 ;
}
2024-01-13 18:31:26 +01:00
void send_error ( task_server & task , const std : : string & error )
2023-10-22 21:53:08 +02:00
{
2024-01-13 18:31:26 +01:00
LOG_TEE ( " task %i - error: %s \n " , task . id , error . c_str ( ) ) ;
2023-10-22 21:53:08 +02:00
task_result res ;
2023-11-30 23:25:04 +01:00
res . id = task . id ;
res . multitask_id = task . multitask_id ;
2023-11-23 22:56:53 +01:00
res . stop = false ;
2023-10-22 21:53:08 +02:00
res . error = true ;
res . result_json = { { " content " , error } } ;
2024-01-26 13:42:20 +01:00
queue_results . send ( res ) ;
2023-11-30 23:25:04 +01:00
}
2024-02-29 21:42:11 +01:00
json get_formated_generation ( server_slot & slot )
2023-10-22 21:53:08 +02:00
{
2023-10-23 21:40:03 +02:00
const auto eos_bias = slot . sparams . logit_bias . find ( llama_token_eos ( model ) ) ;
2023-10-22 21:53:08 +02:00
const bool ignore_eos = eos_bias ! = slot . sparams . logit_bias . end ( ) & &
eos_bias - > second < 0.0f & & std : : isinf ( eos_bias - > second ) ;
2024-02-16 12:33:25 +01:00
std : : vector < std : : string > samplers_sequence ;
for ( const auto & sampler_type : slot . sparams . samplers_sequence )
{
samplers_sequence . emplace_back ( sampler_type_to_name_string ( sampler_type ) ) ;
}
2023-10-22 21:53:08 +02:00
return json {
{ " n_ctx " , slot . n_ctx } ,
2024-02-18 17:30:09 +01:00
{ " n_predict " , slot . n_predict } ,
2023-10-22 21:53:08 +02:00
{ " model " , params . model_alias } ,
{ " seed " , slot . params . seed } ,
2023-12-28 20:20:00 +01:00
{ " temperature " , slot . sparams . temp } ,
2024-02-06 10:20:00 +01:00
{ " dynatemp_range " , slot . sparams . dynatemp_range } ,
{ " dynatemp_exponent " , slot . sparams . dynatemp_exponent } ,
2023-10-22 21:53:08 +02:00
{ " top_k " , slot . sparams . top_k } ,
{ " top_p " , slot . sparams . top_p } ,
2023-11-09 03:00:34 +01:00
{ " min_p " , slot . sparams . min_p } ,
2023-10-22 21:53:08 +02:00
{ " tfs_z " , slot . sparams . tfs_z } ,
{ " typical_p " , slot . sparams . typical_p } ,
{ " repeat_last_n " , slot . sparams . penalty_last_n } ,
{ " repeat_penalty " , slot . sparams . penalty_repeat } ,
{ " presence_penalty " , slot . sparams . penalty_present } ,
{ " frequency_penalty " , slot . sparams . penalty_freq } ,
2023-12-23 10:31:49 +01:00
{ " penalty_prompt_tokens " , slot . sparams . penalty_prompt_tokens } ,
{ " use_penalty_prompt_tokens " , slot . sparams . use_penalty_prompt_tokens } ,
2023-10-22 21:53:08 +02:00
{ " mirostat " , slot . sparams . mirostat } ,
{ " mirostat_tau " , slot . sparams . mirostat_tau } ,
{ " mirostat_eta " , slot . sparams . mirostat_eta } ,
{ " penalize_nl " , slot . sparams . penalize_nl } ,
{ " stop " , slot . params . antiprompt } ,
{ " n_predict " , slot . params . n_predict } ,
{ " n_keep " , params . n_keep } ,
{ " ignore_eos " , ignore_eos } ,
{ " stream " , slot . params . stream } ,
{ " logit_bias " , slot . sparams . logit_bias } ,
{ " n_probs " , slot . sparams . n_probs } ,
2024-02-18 20:11:16 +01:00
{ " min_keep " , slot . sparams . min_keep } ,
2023-10-22 21:53:08 +02:00
{ " grammar " , slot . sparams . grammar } ,
2024-02-16 12:33:25 +01:00
{ " samplers " , samplers_sequence }
2023-10-22 21:53:08 +02:00
} ;
}
2024-02-29 21:42:11 +01:00
void send_partial_response ( server_slot & slot , completion_token_output tkn )
2023-10-22 21:53:08 +02:00
{
task_result res ;
res . id = slot . task_id ;
2023-11-30 23:25:04 +01:00
res . multitask_id = slot . multitask_id ;
2023-10-22 21:53:08 +02:00
res . error = false ;
res . stop = false ;
res . result_json = json
{
{ " content " , tkn . text_to_send } ,
{ " stop " , false } ,
{ " slot_id " , slot . id } ,
{ " multimodal " , multimodal }
} ;
if ( slot . sparams . n_probs > 0 )
{
std : : vector < completion_token_output > probs_output = { } ;
const std : : vector < llama_token > to_send_toks = llama_tokenize ( ctx , tkn . text_to_send , false ) ;
2024-02-29 21:42:11 +01:00
size_t probs_pos = std : : min ( slot . n_sent_token_probs , slot . generated_token_probs . size ( ) ) ;
size_t probs_stop_pos = std : : min ( slot . n_sent_token_probs + to_send_toks . size ( ) , slot . generated_token_probs . size ( ) ) ;
2023-10-22 21:53:08 +02:00
if ( probs_pos < probs_stop_pos )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
probs_output = std : : vector < completion_token_output > ( slot . generated_token_probs . begin ( ) + probs_pos , slot . generated_token_probs . begin ( ) + probs_stop_pos ) ;
2023-07-02 23:38:44 +02:00
}
2024-02-29 21:42:11 +01:00
slot . n_sent_token_probs = probs_stop_pos ;
2023-10-22 21:53:08 +02:00
res . result_json [ " completion_probabilities " ] = probs_vector_to_json ( ctx , probs_output ) ;
}
2023-08-08 15:29:19 +02:00
2023-11-25 10:29:06 +01:00
if ( slot . oaicompat )
{
res . result_json [ " oaicompat_token_ctr " ] = slot . n_decoded ;
res . result_json [ " model " ] = slot . oaicompat_model ;
}
2024-01-26 13:42:20 +01:00
queue_results . send ( res ) ;
2023-10-22 21:53:08 +02:00
}
2024-02-29 21:42:11 +01:00
void send_final_response ( server_slot & slot )
2023-10-22 21:53:08 +02:00
{
task_result res ;
res . id = slot . task_id ;
2023-11-30 23:25:04 +01:00
res . multitask_id = slot . multitask_id ;
2023-10-22 21:53:08 +02:00
res . error = false ;
res . stop = true ;
2023-10-18 15:21:57 +02:00
2023-10-22 21:53:08 +02:00
res . result_json = json
{
{ " content " , ! slot . params . stream ? slot . generated_text : " " } ,
{ " slot_id " , slot . id } ,
{ " stop " , true } ,
{ " model " , params . model_alias } ,
{ " tokens_predicted " , slot . n_decoded } ,
2024-02-29 21:42:11 +01:00
{ " tokens_evaluated " , slot . n_prompt_tokens } ,
2023-10-22 21:53:08 +02:00
{ " generation_settings " , get_formated_generation ( slot ) } ,
{ " prompt " , slot . prompt } ,
{ " truncated " , slot . truncated } ,
{ " stopped_eos " , slot . stopped_eos } ,
{ " stopped_word " , slot . stopped_word } ,
{ " stopped_limit " , slot . stopped_limit } ,
{ " stopping_word " , slot . stopping_word } ,
{ " tokens_cached " , slot . n_past } ,
{ " timings " , slot . get_formated_timings ( ) }
} ;
if ( slot . sparams . n_probs > 0 )
{
std : : vector < completion_token_output > probs = { } ;
if ( ! slot . params . stream & & slot . stopped_word )
{
const std : : vector < llama_token > stop_word_toks = llama_tokenize ( ctx , slot . stopping_word , false ) ;
probs = std : : vector < completion_token_output > ( slot . generated_token_probs . begin ( ) , slot . generated_token_probs . end ( ) - stop_word_toks . size ( ) ) ;
}
else
{
probs = std : : vector < completion_token_output > (
slot . generated_token_probs . begin ( ) ,
2024-01-04 18:56:33 +01:00
slot . generated_token_probs . end ( ) ) ;
2023-10-05 16:02:55 +02:00
}
2023-10-22 21:53:08 +02:00
res . result_json [ " completion_probabilities " ] = probs_vector_to_json ( ctx , probs ) ;
2023-05-21 19:51:18 +02:00
}
2023-11-25 10:29:06 +01:00
if ( slot . oaicompat )
{
res . result_json [ " oaicompat_token_ctr " ] = slot . n_decoded ;
res . result_json [ " model " ] = slot . oaicompat_model ;
}
2024-01-26 13:42:20 +01:00
queue_results . send ( res ) ;
2023-10-22 21:53:08 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-29 21:42:11 +01:00
void send_embedding ( server_slot & slot )
2023-10-22 21:53:08 +02:00
{
task_result res ;
res . id = slot . task_id ;
2023-11-30 23:25:04 +01:00
res . multitask_id = slot . multitask_id ;
2023-10-22 21:53:08 +02:00
res . error = false ;
res . stop = true ;
const int n_embd = llama_n_embd ( model ) ;
if ( ! params . embedding )
2023-07-05 22:51:13 +02:00
{
2024-02-29 21:42:11 +01:00
LOG_WARNING ( " embedding disabled " , { { " params.embedding " , params . embedding } } ) ;
2023-10-22 21:53:08 +02:00
res . result_json = json
{
{ " embedding " , std : : vector < float > ( n_embd , 0.0f ) } ,
} ;
}
else
{
const float * data = llama_get_embeddings ( ctx ) ;
std : : vector < float > embedding ( data , data + n_embd ) ;
res . result_json = json
{
2024-02-29 21:42:11 +01:00
{ " embedding " , embedding } ,
2023-10-22 21:53:08 +02:00
} ;
2023-05-21 19:51:18 +02:00
}
2024-01-26 13:42:20 +01:00
queue_results . send ( res ) ;
2023-10-22 21:53:08 +02:00
}
2023-05-21 19:51:18 +02:00
2024-01-26 13:42:20 +01:00
void request_completion ( int task_id , json data , bool infill , bool embedding , int multitask_id )
2023-10-22 21:53:08 +02:00
{
task_server task ;
2024-01-26 13:42:20 +01:00
task . id = task_id ;
2023-11-23 22:56:53 +01:00
task . target_id = 0 ;
2023-11-25 10:29:06 +01:00
task . data = std : : move ( data ) ;
2023-10-22 21:53:08 +02:00
task . infill_mode = infill ;
2023-11-01 10:28:28 +01:00
task . embedding_mode = embedding ;
2024-01-11 08:10:34 +01:00
task . type = TASK_TYPE_COMPLETION ;
2023-11-30 23:25:04 +01:00
task . multitask_id = multitask_id ;
// when a completion task's prompt array is not a singleton, we split it into multiple requests
// otherwise, it's a single-prompt task, we actually queue it
2024-02-06 09:16:23 +01:00
// if there's numbers in the prompt array it will be treated as an array of tokens
if ( task . data . count ( " prompt " ) ! = 0 & & task . data . at ( " prompt " ) . size ( ) > 1 ) {
bool numbers = false ;
for ( const auto & e : task . data . at ( " prompt " ) ) {
if ( e . is_number ( ) ) {
numbers = true ;
break ;
}
}
// NOTE: split_multiprompt_task() does not handle a mix of strings and numbers,
// it will completely stall the server. I don't know where the bug for this is.
//
// if there are numbers, it needs to be treated like a single prompt,
// queue_tasks handles a mix of strings and numbers just fine.
if ( numbers ) {
queue_tasks . post ( task ) ;
} else {
split_multiprompt_task ( task_id , task ) ;
}
} else {
2024-02-26 23:15:48 +01:00
// an empty prompt can make slot become buggy
if ( task . data . contains ( " prompt " ) & & task . data [ " prompt " ] . is_string ( ) & & task . data [ " prompt " ] . get < std : : string > ( ) . empty ( ) ) {
task . data [ " prompt " ] = " " ; // add a space so that we have one token
}
2024-02-06 09:16:23 +01:00
queue_tasks . post ( task ) ;
}
2023-10-22 21:53:08 +02:00
}
// for multiple images processing
2024-02-29 21:42:11 +01:00
bool ingest_images ( server_slot & slot , int n_batch )
2023-10-22 21:53:08 +02:00
{
int image_idx = 0 ;
while ( image_idx < ( int ) slot . images . size ( ) )
{
slot_image & img = slot . images [ image_idx ] ;
// process prefix prompt
for ( int32_t i = 0 ; i < ( int32_t ) batch . n_tokens ; i + = n_batch )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
const int32_t n_tokens = std : : min ( n_batch , ( int32_t ) ( batch . n_tokens - i ) ) ;
llama_batch batch_view = {
n_tokens ,
batch . token + i ,
nullptr ,
batch . pos + i ,
batch . n_seq_id + i ,
batch . seq_id + i ,
batch . logits + i ,
0 , 0 , 0 , // unused
} ;
if ( llama_decode ( ctx , batch_view ) )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
LOG_TEE ( " %s : failed to eval \n " , __func__ ) ;
return false ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-10-22 21:53:08 +02:00
}
// process image with llm
for ( int i = 0 ; i < img . image_tokens ; i + = n_batch )
{
int n_eval = img . image_tokens - i ;
if ( n_eval > n_batch )
{
n_eval = n_batch ;
}
const int n_embd = llama_n_embd ( model ) ;
2024-02-29 21:42:11 +01:00
llama_batch batch_img = {
n_eval ,
nullptr ,
( img . image_embedding + i * n_embd ) ,
nullptr ,
nullptr ,
nullptr ,
nullptr ,
slot . n_past ,
1 , 0
} ;
2023-10-22 21:53:08 +02:00
if ( llama_decode ( ctx , batch_img ) )
{
LOG_TEE ( " %s : failed to eval image \n " , __func__ ) ;
return false ;
}
slot . n_past + = n_eval ;
}
image_idx + + ;
llama_batch_clear ( batch ) ;
// append prefix of next image
const auto json_prompt = ( image_idx > = ( int ) slot . images . size ( ) ) ?
slot . params . input_suffix : // no more images, then process suffix prompt
( json ) ( slot . images [ image_idx ] . prefix_prompt ) ;
std : : vector < llama_token > append_tokens = tokenize ( json_prompt , false ) ; // has next image
for ( int i = 0 ; i < ( int ) append_tokens . size ( ) ; + + i )
{
2024-01-30 19:17:30 +01:00
llama_batch_add ( batch , append_tokens [ i ] , system_tokens . size ( ) + slot . n_past , { slot . id } , true ) ;
2023-10-22 21:53:08 +02:00
slot . n_past + = 1 ;
2023-05-21 19:51:18 +02:00
}
}
2023-10-22 21:53:08 +02:00
return true ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-10-22 21:53:08 +02:00
void request_cancel ( int task_id )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
task_server task ;
2024-01-11 08:10:34 +01:00
task . type = TASK_TYPE_CANCEL ;
2023-10-22 21:53:08 +02:00
task . target_id = task_id ;
2024-01-26 13:42:20 +01:00
queue_tasks . post ( task ) ;
2023-10-22 21:53:08 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-01-26 13:42:20 +01:00
void split_multiprompt_task ( int multitask_id , task_server & multiprompt_task )
2023-11-30 23:25:04 +01:00
{
2023-12-01 19:35:03 +01:00
int prompt_count = multiprompt_task . data . at ( " prompt " ) . size ( ) ;
2024-02-06 09:16:23 +01:00
if ( prompt_count < = 1 ) {
send_error ( multiprompt_task , " error while handling multiple prompts " ) ;
return ;
}
2023-11-30 23:25:04 +01:00
2024-01-26 13:42:20 +01:00
// generate all the ID for subtask
2023-11-30 23:25:04 +01:00
std : : vector < int > subtask_ids ( prompt_count ) ;
for ( int i = 0 ; i < prompt_count ; i + + )
2024-01-26 13:42:20 +01:00
{
subtask_ids [ i ] = queue_tasks . get_new_id ( ) ;
}
// queue up the multitask so we can track its subtask progression
queue_tasks . add_multitask ( multitask_id , subtask_ids ) ;
// add subtasks
for ( int i = 0 ; i < prompt_count ; i + + )
2023-11-30 23:25:04 +01:00
{
json subtask_data = multiprompt_task . data ;
subtask_data [ " prompt " ] = subtask_data [ " prompt " ] [ i ] ;
// subtasks inherit everything else (infill mode, embedding mode, etc.)
2024-01-26 13:42:20 +01:00
request_completion ( subtask_ids [ i ] , subtask_data , multiprompt_task . infill_mode , multiprompt_task . embedding_mode , multitask_id ) ;
2023-11-30 23:25:04 +01:00
}
}
2024-01-26 13:42:20 +01:00
void process_single_task ( task_server & task )
2023-10-22 21:53:08 +02:00
{
2024-01-26 13:42:20 +01:00
switch ( task . type )
2023-07-05 22:51:13 +02:00
{
2024-01-26 13:42:20 +01:00
case TASK_TYPE_COMPLETION : {
2024-02-29 21:42:11 +01:00
server_slot * slot = get_slot ( json_value ( task . data , " slot_id " , - 1 ) ) ;
2024-01-26 13:42:20 +01:00
if ( slot = = nullptr )
{
// if no slot is available, we defer this task for processing later
2024-02-25 13:50:32 +01:00
LOG_VERBOSE ( " no slot is available " , { { " task_id " , task . id } } ) ;
2024-01-26 13:42:20 +01:00
queue_tasks . defer ( task ) ;
break ;
}
if ( task . data . contains ( " system_prompt " ) )
{
if ( ! all_slots_are_idle ) {
send_error ( task , " system prompt can only be updated when all slots are idle " ) ;
2024-01-13 18:31:26 +01:00
break ;
2023-10-22 21:53:08 +02:00
}
2024-02-29 21:42:11 +01:00
system_prompt_process ( task . data [ " system_prompt " ] ) ;
2023-10-22 21:53:08 +02:00
2024-01-26 13:42:20 +01:00
// reset cache_tokens for all slots
2024-02-29 21:42:11 +01:00
for ( server_slot & slot : slots )
2023-10-22 21:53:08 +02:00
{
2024-01-26 13:42:20 +01:00
slot . cache_tokens . clear ( ) ;
2024-01-30 19:17:30 +01:00
slot . n_past = 0 ;
slot . n_past_se = 0 ;
2023-10-22 21:53:08 +02:00
}
2024-01-26 13:42:20 +01:00
}
2023-10-22 21:53:08 +02:00
2024-01-26 13:42:20 +01:00
slot - > reset ( ) ;
2023-10-22 21:53:08 +02:00
2024-01-26 13:42:20 +01:00
slot - > infill = task . infill_mode ;
slot - > embedding = task . embedding_mode ;
slot - > task_id = task . id ;
slot - > multitask_id = task . multitask_id ;
2023-10-22 21:53:08 +02:00
2024-01-26 13:42:20 +01:00
if ( ! launch_slot_with_data ( slot , task . data ) )
{
// send error result
send_error ( task , " internal_error " ) ;
break ;
}
} break ;
case TASK_TYPE_CANCEL : { // release slot linked with the task id
for ( auto & slot : slots )
{
if ( slot . task_id = = task . target_id )
2023-10-22 21:53:08 +02:00
{
2024-01-26 13:42:20 +01:00
slot . release ( ) ;
2023-10-22 21:53:08 +02:00
break ;
}
2024-01-26 13:42:20 +01:00
}
} break ;
case TASK_TYPE_NEXT_RESPONSE : {
// do nothing
} break ;
2024-02-25 13:49:43 +01:00
case TASK_TYPE_METRICS : {
2024-02-21 15:47:48 +01:00
json slots_data = json : : array ( ) ;
int n_idle_slots = 0 ;
int n_processing_slots = 0 ;
2024-02-29 21:42:11 +01:00
for ( server_slot & slot : slots ) {
2024-02-21 15:47:48 +01:00
json slot_data = get_formated_generation ( slot ) ;
slot_data [ " id " ] = slot . id ;
slot_data [ " task_id " ] = slot . task_id ;
slot_data [ " state " ] = slot . state ;
slot_data [ " prompt " ] = slot . prompt ;
slot_data [ " next_token " ] = {
2024-02-29 21:42:11 +01:00
{ " has_next_token " , slot . has_next_token } ,
{ " n_remain " , slot . n_remaining } ,
2024-02-21 15:47:48 +01:00
{ " num_tokens_predicted " , slot . n_decoded } ,
2024-02-29 21:42:11 +01:00
{ " stopped_eos " , slot . stopped_eos } ,
{ " stopped_word " , slot . stopped_word } ,
{ " stopped_limit " , slot . stopped_limit } ,
{ " stopping_word " , slot . stopping_word } ,
2024-02-21 15:47:48 +01:00
} ;
2024-02-24 12:28:55 +01:00
if ( slot_data [ " state " ] = = IDLE ) {
n_idle_slots + + ;
} else {
n_processing_slots + + ;
}
2024-02-21 15:47:48 +01:00
slots_data . push_back ( slot_data ) ;
}
2024-02-25 13:50:32 +01:00
LOG_INFO ( " slot data " , {
{ " task_id " , task . id } ,
{ " n_idle_slots " , n_idle_slots } ,
{ " n_processing_slots " , n_processing_slots }
} ) ;
LOG_VERBOSE ( " slot data " , {
{ " task_id " , task . id } ,
{ " n_idle_slots " , n_idle_slots } ,
{ " n_processing_slots " , n_processing_slots } ,
{ " slots " , slots_data }
} ) ;
2024-02-21 15:47:48 +01:00
task_result res ;
res . id = task . id ;
res . multitask_id = task . multitask_id ;
res . stop = true ;
res . error = false ;
res . result_json = {
2024-02-25 13:49:43 +01:00
{ " idle " , n_idle_slots } ,
{ " processing " , n_processing_slots } ,
{ " deferred " , queue_tasks . queue_tasks_deferred . size ( ) } ,
{ " n_prompt_tokens_processed_total " , metrics . n_prompt_tokens_processed_total } ,
{ " n_tokens_predicted_total " , metrics . n_tokens_predicted_total } ,
{ " n_prompt_tokens_processed " , metrics . n_prompt_tokens_processed } ,
{ " t_prompt_processing " , metrics . t_prompt_processing } ,
{ " n_tokens_predicted " , metrics . n_tokens_predicted } ,
{ " t_tokens_generation " , metrics . t_tokens_generation } ,
2024-02-29 21:42:11 +01:00
{ " kv_cache_tokens_count " , llama_get_kv_cache_token_count ( ctx ) } ,
{ " kv_cache_used_cells " , llama_get_kv_cache_used_cells ( ctx ) } ,
2024-02-25 13:49:43 +01:00
2024-02-29 21:42:11 +01:00
{ " slots " , slots_data } ,
2024-02-21 15:47:48 +01:00
} ;
2024-02-25 13:49:43 +01:00
metrics . reset_bucket ( ) ;
2024-02-21 15:47:48 +01:00
queue_results . send ( res ) ;
} break ;
2023-07-02 23:38:44 +02:00
}
2024-01-26 13:42:20 +01:00
}
2023-11-30 23:25:04 +01:00
2024-01-26 13:42:20 +01:00
void on_finish_multitask ( task_multi & multitask )
{
// all subtasks done == multitask is done
task_result result ;
result . id = multitask . id ;
result . stop = true ;
result . error = false ;
2024-01-18 21:33:05 +01:00
2024-01-26 13:42:20 +01:00
// collect json results into one json result
std : : vector < json > result_jsons ;
for ( auto & subres : multitask . results )
2023-11-30 23:25:04 +01:00
{
2024-01-26 13:42:20 +01:00
result_jsons . push_back ( subres . result_json ) ;
result . error = result . error & & subres . error ;
2023-11-30 23:25:04 +01:00
}
2024-01-26 13:42:20 +01:00
result . result_json = json { { " results " , result_jsons } } ;
queue_results . send ( result ) ;
2023-10-22 21:53:08 +02:00
}
2023-07-02 23:38:44 +02:00
2023-10-22 21:53:08 +02:00
bool update_slots ( ) {
2024-01-13 18:31:26 +01:00
if ( system_need_update )
2023-07-05 22:51:13 +02:00
{
2024-02-25 13:50:32 +01:00
LOG_INFO ( " updating system prompt " , { } ) ;
2024-02-29 21:42:11 +01:00
system_prompt_update ( ) ;
2023-07-05 22:51:13 +02:00
}
2023-10-22 21:53:08 +02:00
llama_batch_clear ( batch ) ;
if ( all_slots_are_idle )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
if ( system_prompt . empty ( ) & & clean_kv_cache )
2023-07-05 22:51:13 +02:00
{
2024-02-25 13:50:32 +01:00
LOG_INFO ( " all slots are idle and system prompt is empty, clear the KV cache " , { } ) ;
2023-10-22 21:53:08 +02:00
kv_cache_clear ( ) ;
2023-07-05 22:51:13 +02:00
}
2024-01-26 13:42:20 +01:00
return true ;
2023-10-22 21:53:08 +02:00
}
2024-02-25 13:50:32 +01:00
LOG_VERBOSE ( " posting NEXT_RESPONSE " , { } ) ;
2024-01-30 19:17:30 +01:00
task_server task ;
task . type = TASK_TYPE_NEXT_RESPONSE ;
task . target_id = - 1 ;
queue_tasks . post ( task ) ;
2024-02-29 21:42:11 +01:00
for ( server_slot & slot : slots )
2023-10-22 21:53:08 +02:00
{
2024-01-27 14:38:05 +01:00
if ( slot . ga_n = = 1 )
2023-07-05 22:51:13 +02:00
{
2024-01-30 19:17:30 +01:00
if ( slot . is_processing ( ) & & system_tokens . size ( ) + slot . cache_tokens . size ( ) > = ( size_t ) slot . n_ctx )
2024-01-27 14:38:05 +01:00
{
// Shift context
2024-02-21 16:33:54 +01:00
const int n_keep = slot . params . n_keep + add_bos_token ;
2024-02-25 13:50:32 +01:00
const int n_left = ( int ) system_tokens . size ( ) + slot . n_past - n_keep ;
2024-01-27 14:38:05 +01:00
const int n_discard = n_left / 2 ;
2023-10-22 21:53:08 +02:00
2024-02-25 13:50:32 +01:00
LOG_INFO ( " slot context shift " , {
{ " slot_id " , slot . id } ,
{ " task_id " , slot . task_id } ,
{ " n_keep " , n_keep } ,
{ " n_left " , n_left } ,
{ " n_discard " , n_discard } ,
{ " n_ctx " , n_ctx } ,
{ " n_past " , slot . n_past } ,
{ " n_system_tokens " , system_tokens . size ( ) } ,
{ " n_cache_tokens " , slot . cache_tokens . size ( ) }
} ) ;
2024-02-25 21:12:24 +01:00
llama_kv_cache_seq_rm ( ctx , slot . id , n_keep , n_keep + n_discard ) ;
llama_kv_cache_seq_add ( ctx , slot . id , n_keep + n_discard , system_tokens . size ( ) + slot . n_past , - n_discard ) ;
2023-10-22 21:53:08 +02:00
2024-02-21 16:33:54 +01:00
for ( size_t i = n_keep + n_discard ; i < slot . cache_tokens . size ( ) ; i + + )
2024-01-27 14:38:05 +01:00
{
slot . cache_tokens [ i - n_discard ] = slot . cache_tokens [ i ] ;
}
2023-10-22 21:53:08 +02:00
2024-01-27 14:38:05 +01:00
slot . cache_tokens . resize ( slot . cache_tokens . size ( ) - n_discard ) ;
2023-10-22 21:53:08 +02:00
2024-01-27 14:38:05 +01:00
slot . n_past - = n_discard ;
2023-10-22 21:53:08 +02:00
2024-01-27 14:38:05 +01:00
slot . truncated = true ;
}
2023-07-05 22:51:13 +02:00
}
2023-10-22 21:53:08 +02:00
}
// decode any currently ongoing sequences
2024-02-25 13:50:32 +01:00
LOG_VERBOSE ( " decoding ongoing sequences " , { } ) ;
2023-10-22 21:53:08 +02:00
for ( auto & slot : slots )
{
// release the slot
2023-10-24 22:08:20 +02:00
if ( slot . command = = RELEASE )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
slot . state = IDLE ;
slot . command = NONE ;
slot . t_last_used = ggml_time_us ( ) ;
2024-02-25 13:50:32 +01:00
LOG_INFO ( " slot released " , {
{ " slot_id " , slot . id } ,
{ " task_id " , slot . task_id } ,
{ " n_ctx " , n_ctx } ,
{ " n_past " , slot . n_past } ,
{ " n_system_tokens " , system_tokens . size ( ) } ,
{ " n_cache_tokens " , slot . cache_tokens . size ( ) } ,
{ " truncated " , slot . truncated }
} ) ;
2024-01-26 13:42:20 +01:00
queue_tasks . notify_slot_changed ( ) ;
2023-10-22 21:53:08 +02:00
continue ;
2023-07-05 22:51:13 +02:00
}
2023-10-22 21:53:08 +02:00
2023-10-24 22:08:20 +02:00
if ( slot . state = = IDLE )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
continue ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-10-22 21:53:08 +02:00
slot . i_batch = batch . n_tokens ;
2024-01-27 14:38:05 +01:00
const int32_t slot_npast = slot . n_past_se > 0 ? slot . n_past_se : slot . n_past ;
2023-10-22 21:53:08 +02:00
2024-01-30 19:17:30 +01:00
// TODO: we always have to take into account the "system_tokens"
// this is not great and needs to be improved somehow
llama_batch_add ( batch , slot . sampled , system_tokens . size ( ) + slot_npast , { slot . id } , true ) ;
2023-10-22 21:53:08 +02:00
slot . n_past + = 1 ;
2023-05-21 19:51:18 +02:00
}
2023-10-22 21:53:08 +02:00
// process in chunks of params.n_batch
int32_t n_batch = params . n_batch ;
// assign workload to the slots
if ( params . cont_batching | | batch . n_tokens = = 0 )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
for ( auto & slot : slots )
{
2023-10-26 21:53:37 +02:00
const bool has_prompt = slot . prompt . is_array ( ) | | ( slot . prompt . is_string ( ) & & ! slot . prompt . get < std : : string > ( ) . empty ( ) ) | | ! slot . images . empty ( ) ;
2023-10-24 22:08:20 +02:00
// empty prompt passed -> release the slot and send empty response
2024-01-11 22:23:49 +01:00
// note: infill mode allows empty prompt
if ( slot . state = = IDLE & & slot . command = = LOAD_PROMPT & & ! has_prompt & & ! slot . infill )
2023-10-24 22:08:20 +02:00
{
slot . release ( ) ;
slot . print_timings ( ) ;
send_final_response ( slot ) ;
continue ;
}
2023-10-22 21:53:08 +02:00
// need process the prompt
if ( slot . state = = IDLE & & slot . command = = LOAD_PROMPT )
{
slot . state = PROCESSING ;
slot . command = NONE ;
std : : vector < llama_token > prompt_tokens ;
slot . t_start_process_prompt = ggml_time_us ( ) ;
slot . t_start_genereration = 0 ;
if ( slot . infill )
{
bool suff_rm_leading_spc = true ;
if ( params . input_suffix . find_first_of ( ' ' ) = = 0 & & params . input_suffix . size ( ) > 1 )
{
params . input_suffix . erase ( 0 , 1 ) ;
suff_rm_leading_spc = false ;
}
auto prefix_tokens = tokenize ( slot . params . input_prefix , false ) ;
auto suffix_tokens = tokenize ( slot . params . input_suffix , false ) ;
const int space_token = 29871 ; // TODO: this should not be hardcoded
if ( suff_rm_leading_spc & & ! suffix_tokens . empty ( ) & & suffix_tokens [ 0 ] = = space_token ) {
suffix_tokens . erase ( suffix_tokens . begin ( ) ) ;
}
2023-10-23 21:40:03 +02:00
prefix_tokens . insert ( prefix_tokens . begin ( ) , llama_token_prefix ( model ) ) ;
prefix_tokens . insert ( prefix_tokens . begin ( ) , llama_token_bos ( model ) ) ; // always add BOS
2024-01-30 19:17:30 +01:00
prefix_tokens . insert ( prefix_tokens . end ( ) , llama_token_suffix ( model ) ) ;
prefix_tokens . insert ( prefix_tokens . end ( ) , suffix_tokens . begin ( ) , suffix_tokens . end ( ) ) ;
2023-10-23 21:40:03 +02:00
prefix_tokens . push_back ( llama_token_middle ( model ) ) ;
2023-10-22 21:53:08 +02:00
prompt_tokens = prefix_tokens ;
}
else
{
2023-11-17 03:14:37 +01:00
prompt_tokens = tokenize ( slot . prompt , system_prompt . empty ( ) & & add_bos_token ) ; // add BOS if there isn't system prompt
2023-10-22 21:53:08 +02:00
}
2024-02-29 21:42:11 +01:00
slot . n_prompt_tokens = prompt_tokens . size ( ) ;
2023-10-22 21:53:08 +02:00
2023-11-11 06:48:21 +01:00
if ( slot . params . n_keep < 0 )
{
2024-02-29 21:42:11 +01:00
slot . params . n_keep = slot . n_prompt_tokens ;
2023-11-11 06:48:21 +01:00
}
slot . params . n_keep = std : : min ( slot . n_ctx - 4 , slot . params . n_keep ) ;
// if input prompt is too big, truncate it
2024-02-29 21:42:11 +01:00
if ( slot . n_prompt_tokens > = slot . n_ctx )
2023-11-11 06:48:21 +01:00
{
const int n_left = slot . n_ctx - slot . params . n_keep ;
const int n_block_size = n_left / 2 ;
2024-02-29 21:42:11 +01:00
const int erased_blocks = ( slot . n_prompt_tokens - slot . params . n_keep - n_block_size ) / n_block_size ;
2023-11-11 06:48:21 +01:00
2024-02-29 21:42:11 +01:00
std : : vector < llama_token > new_tokens (
prompt_tokens . begin ( ) ,
prompt_tokens . begin ( ) + slot . params . n_keep ) ;
new_tokens . insert (
new_tokens . end ( ) ,
prompt_tokens . begin ( ) + slot . params . n_keep + erased_blocks * n_block_size ,
prompt_tokens . end ( ) ) ;
2023-11-11 06:48:21 +01:00
LOG_VERBOSE ( " input truncated " , {
2024-02-29 21:42:11 +01:00
{ " n_ctx " , slot . n_ctx } ,
{ " n_keep " , slot . params . n_keep } ,
{ " n_left " , n_left } ,
2023-11-11 06:48:21 +01:00
{ " new_tokens " , tokens_to_str ( ctx , new_tokens . cbegin ( ) , new_tokens . cend ( ) ) } ,
} ) ;
slot . truncated = true ;
prompt_tokens = new_tokens ;
2024-02-29 21:42:11 +01:00
slot . n_prompt_tokens = prompt_tokens . size ( ) ;
GGML_ASSERT ( slot . n_prompt_tokens < slot . n_ctx ) ;
2023-11-11 06:48:21 +01:00
}
2023-10-22 21:53:08 +02:00
if ( ! slot . params . cache_prompt )
{
llama_sampling_reset ( slot . ctx_sampling ) ;
2024-02-29 21:42:11 +01:00
slot . n_past = 0 ;
2024-01-27 14:38:05 +01:00
slot . n_past_se = 0 ;
2024-02-29 21:42:11 +01:00
slot . ga_i = 0 ;
slot . n_prompt_tokens_processed = slot . n_prompt_tokens ;
2023-10-22 21:53:08 +02:00
}
else
{
// push the prompt into the sampling context (do not apply grammar)
for ( auto & token : prompt_tokens )
{
llama_sampling_accept ( slot . ctx_sampling , ctx , token , false ) ;
}
slot . n_past = common_part ( slot . cache_tokens , prompt_tokens ) ;
2024-02-25 19:43:50 +01:00
// the last token of the cache is not in the KV cache until the next call to llama_decode
// (it was sampled, pushed into the "cache_tokens", but not yet put in the context)
if ( slot . n_past > 0 & & slot . n_past = = ( int32_t ) slot . cache_tokens . size ( ) )
{
slot . n_past - = 1 ;
}
2024-02-29 21:42:11 +01:00
slot . n_prompt_tokens_processed = slot . n_prompt_tokens - slot . n_past ;
2023-10-22 21:53:08 +02:00
2024-01-27 14:38:05 +01:00
if ( slot . ga_n ! = 1 )
{
int ga_i = 0 ;
int32_t ga_n = slot . ga_n ;
int32_t ga_w = slot . ga_w ;
int32_t slot_npast = 0 ;
for ( int k = 0 ; k < slot . n_past ; + + k )
{
while ( slot_npast > = ga_i + ga_w ) {
const int bd = ( ga_w / ga_n ) * ( ga_n - 1 ) ;
slot_npast - = bd ;
ga_i + = ga_w / ga_n ;
}
slot_npast + + ;
}
slot . n_past_se = slot_npast ;
slot . ga_i = ga_i ;
}
2024-02-25 13:50:32 +01:00
LOG_INFO ( " slot progression " , {
{ " slot_id " , slot . id } ,
{ " task_id " , slot . task_id } ,
{ " n_past " , slot . n_past } ,
2024-02-29 21:42:11 +01:00
{ " n_prompt_tokens_processed " , slot . n_prompt_tokens_processed }
2024-02-25 13:50:32 +01:00
} ) ;
2023-10-22 21:53:08 +02:00
}
slot . cache_tokens = prompt_tokens ;
2024-02-29 21:42:11 +01:00
if ( slot . n_past = = slot . n_prompt_tokens & & slot . n_past > 0 )
2023-10-22 21:53:08 +02:00
{
// we have to evaluate at least 1 token to generate logits.
2024-02-25 13:50:32 +01:00
LOG_INFO ( " we have to evaluate at least 1 token to generate logits " , {
{ " slot_id " , slot . id } ,
{ " task_id " , slot . task_id }
} ) ;
2023-10-22 21:53:08 +02:00
slot . n_past - - ;
2024-01-27 14:38:05 +01:00
if ( slot . ga_i > 0 )
{
slot . n_past_se - - ;
}
2023-10-22 21:53:08 +02:00
}
2024-02-25 13:50:32 +01:00
int p0 = ( int ) system_tokens . size ( ) + slot . n_past ;
LOG_INFO ( " kv cache rm [p0, end) " , {
{ " slot_id " , slot . id } ,
{ " task_id " , slot . task_id } ,
{ " p0 " , p0 }
} ) ;
llama_kv_cache_seq_rm ( ctx , slot . id , p0 , - 1 ) ;
2024-02-09 11:49:49 +01:00
2023-10-22 21:53:08 +02:00
LOG_VERBOSE ( " prompt ingested " , {
2024-01-30 19:17:30 +01:00
{ " n_past " , slot . n_past } ,
{ " cached " , tokens_to_str ( ctx , slot . cache_tokens . cbegin ( ) , slot . cache_tokens . cbegin ( ) + slot . n_past ) } ,
2023-10-22 21:53:08 +02:00
{ " to_eval " , tokens_to_str ( ctx , slot . cache_tokens . cbegin ( ) + slot . n_past , slot . cache_tokens . cend ( ) ) } ,
} ) ;
const bool has_images = process_images ( slot ) ;
// process the prefix of first image
2023-11-17 03:14:37 +01:00
std : : vector < llama_token > prefix_tokens = has_images ? tokenize ( slot . images [ 0 ] . prefix_prompt , add_bos_token ) : prompt_tokens ;
2024-01-30 19:17:30 +01:00
2024-01-27 14:38:05 +01:00
int32_t slot_npast = slot . n_past_se > 0 ? slot . n_past_se : slot . n_past ;
2024-01-30 19:17:30 +01:00
int32_t ga_i = slot . ga_i ;
2024-01-27 14:38:05 +01:00
int32_t ga_n = slot . ga_n ;
int32_t ga_w = slot . ga_w ;
2024-01-30 19:17:30 +01:00
2023-10-22 21:53:08 +02:00
for ( ; slot . n_past < ( int ) prefix_tokens . size ( ) ; + + slot . n_past )
{
2024-01-27 14:38:05 +01:00
if ( slot . ga_n ! = 1 )
{
while ( slot_npast > = ga_i + ga_w ) {
const int bd = ( ga_w / ga_n ) * ( ga_n - 1 ) ;
slot_npast - = bd ;
ga_i + = ga_w / ga_n ;
}
}
llama_batch_add ( batch , prefix_tokens [ slot . n_past ] , system_tokens . size ( ) + slot_npast , { slot . id } , false ) ;
2024-01-30 19:17:30 +01:00
slot_npast + + ;
2023-10-22 21:53:08 +02:00
}
if ( has_images & & ! ingest_images ( slot , n_batch ) )
{
2024-02-25 13:50:32 +01:00
LOG_ERROR ( " failed processing images " , {
2024-02-29 21:42:11 +01:00
{ " slot_id " , slot . id } ,
{ " task_id " , slot . task_id } ,
2024-02-25 13:50:32 +01:00
} ) ;
// FIXME @phymbert: to be properly tested
// early returning without changing the slot state will block the slot for ever
// no one at the moment is checking the return value
2023-10-22 21:53:08 +02:00
return false ;
}
// extract the logits only for the last token
if ( batch . n_tokens > 0 )
{
batch . logits [ batch . n_tokens - 1 ] = true ;
}
slot . n_decoded = 0 ;
slot . i_batch = batch . n_tokens - 1 ;
}
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-05-21 19:51:18 +02:00
2023-10-22 21:53:08 +02:00
if ( batch . n_tokens = = 0 )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
all_slots_are_idle = true ;
return true ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-05-21 19:51:18 +02:00
2023-10-22 21:53:08 +02:00
for ( int32_t i = 0 ; i < ( int32_t ) batch . n_tokens ; i + = n_batch )
{
const int32_t n_tokens = std : : min ( n_batch , ( int32_t ) ( batch . n_tokens - i ) ) ;
2024-01-27 14:38:05 +01:00
for ( auto & slot : slots )
{
if ( slot . ga_n ! = 1 )
{
// context extension via Self-Extend
while ( slot . n_past_se > = slot . ga_i + slot . ga_w )
{
const int ib = ( slot . ga_n * slot . ga_i ) / slot . ga_w ;
const int bd = ( slot . ga_w / slot . ga_n ) * ( slot . ga_n - 1 ) ;
const int dd = ( slot . ga_w / slot . ga_n ) - ib * bd - slot . ga_w ;
LOG_TEE ( " \n " ) ;
LOG_TEE ( " shift: [%6d, %6d] + %6d -> [%6d, %6d] \n " , slot . ga_i , slot . n_past_se , ib * bd , slot . ga_i + ib * bd , slot . n_past_se + ib * bd ) ;
LOG_TEE ( " div: [%6d, %6d] / %6d -> [%6d, %6d] \n " , slot . ga_i + ib * bd , slot . ga_i + ib * bd + slot . ga_w , slot . ga_n , ( slot . ga_i + ib * bd ) / slot . ga_n , ( slot . ga_i + ib * bd + slot . ga_w ) / slot . ga_n ) ;
LOG_TEE ( " shift: [%6d, %6d] + %6d -> [%6d, %6d] \n " , slot . ga_i + ib * bd + slot . ga_w , slot . n_past_se + ib * bd , dd , slot . ga_i + ib * bd + slot . ga_w + dd , slot . n_past_se + ib * bd + dd ) ;
2024-02-25 21:12:24 +01:00
llama_kv_cache_seq_add ( ctx , slot . id , slot . ga_i , slot . n_past_se , ib * bd ) ;
2024-01-27 14:38:05 +01:00
llama_kv_cache_seq_div ( ctx , slot . id , slot . ga_i + ib * bd , slot . ga_i + ib * bd + slot . ga_w , slot . ga_n ) ;
2024-02-25 21:12:24 +01:00
llama_kv_cache_seq_add ( ctx , slot . id , slot . ga_i + ib * bd + slot . ga_w , slot . n_past_se + ib * bd , dd ) ;
2024-01-27 14:38:05 +01:00
slot . n_past_se - = bd ;
slot . ga_i + = slot . ga_w / slot . ga_n ;
LOG_TEE ( " \n n_past_old = %d, n_past = %d, ga_i = %d \n \n " , slot . n_past_se + bd , slot . n_past_se , slot . ga_i ) ;
}
slot . n_past_se + = n_tokens ;
}
}
2024-01-30 19:17:30 +01:00
2023-10-22 21:53:08 +02:00
llama_batch batch_view =
{
n_tokens ,
batch . token + i ,
nullptr ,
batch . pos + i ,
batch . n_seq_id + i ,
batch . seq_id + i ,
batch . logits + i ,
0 , 0 , 0 , // unused
} ;
2023-05-21 19:51:18 +02:00
2023-10-22 21:53:08 +02:00
const int ret = llama_decode ( ctx , batch_view ) ;
2024-01-27 14:38:05 +01:00
2023-10-22 21:53:08 +02:00
if ( ret ! = 0 )
{
if ( n_batch = = 1 | | ret < 0 )
{
// if you get here, it means the KV cache is full - try increasing it via the context size
LOG_TEE ( " %s : failed to decode the batch, n_batch = %d, ret = %d \n " , __func__ , n_batch , ret ) ;
return false ;
}
2023-06-20 00:12:39 +02:00
2023-10-22 21:53:08 +02:00
LOG_TEE ( " %s : failed to find free space in the KV cache, retrying with smaller n_batch = %d \n " , __func__ , n_batch / 2 ) ;
// retry with half the batch size to try to find a free slot in the KV cache
n_batch / = 2 ;
i - = n_batch ;
continue ;
}
for ( auto & slot : slots )
{
if ( slot . i_batch < ( int ) i | | slot . i_batch > = ( int ) ( i + n_tokens ) )
{
continue ;
}
// prompt evaluated for embedding
2023-11-01 10:28:28 +01:00
if ( slot . embedding )
2023-10-22 21:53:08 +02:00
{
send_embedding ( slot ) ;
slot . release ( ) ;
slot . i_batch = - 1 ;
2024-02-24 19:16:04 +01:00
continue ;
2023-10-22 21:53:08 +02:00
}
completion_token_output result ;
const llama_token id = llama_sampling_sample ( slot . ctx_sampling , ctx , NULL , slot . i_batch - i ) ;
llama_sampling_accept ( slot . ctx_sampling , ctx , id , true ) ;
2024-01-07 07:45:26 +01:00
slot . n_decoded + = 1 ;
2023-10-22 21:53:08 +02:00
if ( slot . n_decoded = = 1 )
{
slot . t_start_genereration = ggml_time_us ( ) ;
slot . t_prompt_processing = ( slot . t_start_genereration - slot . t_start_process_prompt ) / 1e3 ;
2024-02-25 13:49:43 +01:00
metrics . on_prompt_eval ( slot ) ;
2023-10-22 21:53:08 +02:00
}
llama_token_data_array cur_p = { slot . ctx_sampling - > cur . data ( ) , slot . ctx_sampling - > cur . size ( ) , false } ;
result . tok = id ;
const int32_t n_probs = slot . sparams . n_probs ;
if ( slot . sparams . temp < = 0 & & n_probs > 0 )
{
// for llama_sample_token_greedy we need to sort candidates
llama_sample_softmax ( ctx , & cur_p ) ;
}
for ( size_t i = 0 ; i < std : : min ( cur_p . size , ( size_t ) n_probs ) ; + + i )
{
result . probs . push_back ( { cur_p . data [ i ] . id , cur_p . data [ i ] . p } ) ;
}
if ( ! process_token ( result , slot ) )
{
slot . release ( ) ;
slot . print_timings ( ) ;
2023-10-24 22:08:20 +02:00
send_final_response ( slot ) ;
2024-02-25 13:49:43 +01:00
metrics . on_prediction ( slot ) ;
2023-10-22 21:53:08 +02:00
}
slot . i_batch = - 1 ;
}
2023-06-20 00:12:39 +02:00
}
2024-02-25 13:50:32 +01:00
LOG_VERBOSE ( " slots updated " , { } ) ;
2023-10-22 21:53:08 +02:00
return true ;
2023-06-20 00:12:39 +02:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
} ;
2023-07-05 22:51:13 +02:00
static void server_print_usage ( const char * argv0 , const gpt_params & params ,
const server_params & sparams )
{
2023-09-05 21:10:27 +02:00
printf ( " usage: %s [options] \n " , argv0 ) ;
printf ( " \n " ) ;
printf ( " options: \n " ) ;
2023-10-24 22:10:43 +02:00
printf ( " -h, --help show this help message and exit \n " ) ;
printf ( " -v, --verbose verbose output (default: %s) \n " , server_verbose ? " enabled " : " disabled " ) ;
2023-11-01 23:04:33 +01:00
printf ( " -t N, --threads N number of threads to use during computation (default: %d) \n " , params . n_threads ) ;
2023-10-24 22:10:43 +02:00
printf ( " -tb N, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads) \n " ) ;
2024-03-01 10:08:08 +01:00
printf ( " --threads-http N number of threads in the http server pool to process requests (default: hardware concurrency) \n " ) ;
2023-11-01 23:04:33 +01:00
printf ( " -c N, --ctx-size N size of the prompt context (default: %d) \n " , params . n_ctx ) ;
printf ( " --rope-scaling {none,linear,yarn} \n " ) ;
printf ( " RoPE frequency scaling method, defaults to linear unless specified by the model \n " ) ;
2023-10-24 22:10:43 +02:00
printf ( " --rope-freq-base N RoPE base frequency (default: loaded from model) \n " ) ;
2023-11-01 23:04:33 +01:00
printf ( " --rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N \n " ) ;
printf ( " --yarn-ext-factor N YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation) \n " ) ;
printf ( " --yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: 1.0) \n " ) ;
printf ( " --yarn-beta-slow N YaRN: high correction dim or alpha (default: %.1f) \n " , params . yarn_beta_slow ) ;
printf ( " --yarn-beta-fast N YaRN: low correction dim or beta (default: %.1f) \n " , params . yarn_beta_fast ) ;
printf ( " -b N, --batch-size N batch size for prompt processing (default: %d) \n " , params . n_batch ) ;
2023-10-24 22:10:43 +02:00
printf ( " --memory-f32 use f32 instead of f16 for memory key+value (default: disabled) \n " ) ;
printf ( " not recommended: doubles context memory required and no measurable increase in quality \n " ) ;
2024-01-31 16:30:17 +01:00
if ( llama_supports_mlock ( ) )
2023-07-05 22:51:13 +02:00
{
2024-01-30 19:17:30 +01:00
printf ( " --mlock force system to keep model in RAM rather than swapping or compressing \n " ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2024-01-31 16:30:17 +01:00
if ( llama_supports_mmap ( ) )
2023-07-05 22:51:13 +02:00
{
2024-01-30 19:17:30 +01:00
printf ( " --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock) \n " ) ;
2023-05-21 19:51:18 +02:00
}
2024-02-16 10:31:07 +01:00
printf ( " --numa TYPE attempt optimizations that help on some NUMA systems \n " ) ;
printf ( " - distribute: spread execution evenly over all nodes \n " ) ;
printf ( " - isolate: only spawn threads on CPUs on the node that execution started on \n " ) ;
printf ( " - numactl: use the CPU map provided my numactl \n " ) ;
2024-01-31 16:30:17 +01:00
if ( llama_supports_gpu_offload ( ) ) {
printf ( " -ngl N, --n-gpu-layers N \n " ) ;
printf ( " number of layers to store in VRAM \n " ) ;
printf ( " -sm SPLIT_MODE, --split-mode SPLIT_MODE \n " ) ;
printf ( " how to split the model across multiple GPUs, one of: \n " ) ;
printf ( " - none: use one GPU only \n " ) ;
printf ( " - layer (default): split layers and KV across GPUs \n " ) ;
printf ( " - row: split rows across GPUs \n " ) ;
printf ( " -ts SPLIT --tensor-split SPLIT \n " ) ;
printf ( " fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1 \n " ) ;
printf ( " -mg i, --main-gpu i the GPU to use for the model (with split-mode = none), \n " ) ;
printf ( " or for intermediate results and KV (with split-mode = row) \n " ) ;
}
2023-09-05 21:10:27 +02:00
printf ( " -m FNAME, --model FNAME \n " ) ;
2024-01-30 19:17:30 +01:00
printf ( " model path (default: %s) \n " , params . model . c_str ( ) ) ;
2023-09-05 21:10:27 +02:00
printf ( " -a ALIAS, --alias ALIAS \n " ) ;
2024-01-30 19:17:30 +01:00
printf ( " set an alias for the model, will be added as `model` field in completion response \n " ) ;
printf ( " --lora FNAME apply LoRA adapter (implies --no-mmap) \n " ) ;
printf ( " --lora-base FNAME optional model to use as a base for the layers modified by the LoRA adapter \n " ) ;
printf ( " --host ip address to listen (default (default: %s) \n " , sparams . hostname . c_str ( ) ) ;
printf ( " --port PORT port to listen (default (default: %d) \n " , sparams . port ) ;
printf ( " --path PUBLIC_PATH path from which to serve static files (default %s) \n " , sparams . public_path . c_str ( ) ) ;
printf ( " --api-key API_KEY optional api key to enhance server security. If set, requests must include this key for access. \n " ) ;
printf ( " --api-key-file FNAME path to file containing api keys delimited by new lines. If set, requests must include one of the keys for access. \n " ) ;
printf ( " -to N, --timeout N server read/write timeout in seconds (default: %d) \n " , sparams . read_timeout ) ;
printf ( " --embedding enable embedding vector output (default: %s) \n " , params . embedding ? " enabled " : " disabled " ) ;
printf ( " -np N, --parallel N number of slots for process requests (default: %d) \n " , params . n_parallel ) ;
printf ( " -cb, --cont-batching enable continuous batching (a.k.a dynamic batching) (default: disabled) \n " ) ;
printf ( " -spf FNAME, --system-prompt-file FNAME \n " ) ;
printf ( " set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications. \n " ) ;
2024-02-23 20:31:54 +01:00
printf ( " -ctk TYPE, --cache-type-k TYPE \n " ) ;
printf ( " KV cache data type for K (default: f16) \n " ) ;
printf ( " -ctv TYPE, --cache-type-v TYPE \n " ) ;
printf ( " KV cache data type for V (default: f16) \n " ) ;
2024-01-30 19:17:30 +01:00
printf ( " --mmproj MMPROJ_FILE path to a multimodal projector file for LLaVA. \n " ) ;
2024-02-25 13:50:32 +01:00
printf ( " --log-format log output format: json or text (default: json) \n " ) ;
2024-01-30 19:17:30 +01:00
printf ( " --log-disable disables logging to a file. \n " ) ;
2024-02-18 18:39:57 +01:00
printf ( " --slots-endpoint-disable disables slots monitoring endpoint. \n " ) ;
2024-02-25 13:49:43 +01:00
printf ( " --metrics enable prometheus compatible metrics endpoint (default: %s). \n " , sparams . metrics_endpoint ? " enabled " : " disabled " ) ;
2023-09-05 21:10:27 +02:00
printf ( " \n " ) ;
2024-02-18 17:30:09 +01:00
printf ( " -n, --n-predict maximum tokens to predict (default: %d) \n " , params . n_predict ) ;
2024-01-02 11:38:15 +01:00
printf ( " --override-kv KEY=TYPE:VALUE \n " ) ;
2024-01-30 19:17:30 +01:00
printf ( " advanced option to override model metadata by key. may be specified multiple times. \n " ) ;
printf ( " types: int, float, bool. example: --override-kv tokenizer.ggml.add_bos_token=bool:false \n " ) ;
2024-03-01 08:59:43 +01:00
printf ( " -gan N, --grp-attn-n N set the group attention factor to extend context size through self-extend(default: 1=disabled), used together with group attention width `--grp-attn-w` \n " ) ;
printf ( " -gaw N, --grp-attn-w N set the group attention width to extend context size through self-extend(default: 512), used together with group attention factor `--grp-attn-n` \n " ) ;
2024-02-20 15:58:27 +01:00
printf ( " --chat-template JINJA_TEMPLATE \n " ) ;
printf ( " set custom jinja chat template (default: template taken from model's metadata) \n " ) ;
printf ( " Note: only commonly used templates are accepted, since we don't have jinja parser \n " ) ;
2024-01-02 11:38:15 +01:00
printf ( " \n " ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-07-05 22:51:13 +02:00
static void server_params_parse ( int argc , char * * argv , server_params & sparams ,
2023-10-22 21:53:08 +02:00
gpt_params & params , llama_server_context & llama )
2023-07-05 22:51:13 +02:00
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
gpt_params default_params ;
server_params default_sparams ;
std : : string arg ;
bool invalid_param = false ;
2023-07-05 22:51:13 +02:00
for ( int i = 1 ; i < argc ; i + + )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
arg = argv [ i ] ;
2023-07-05 22:51:13 +02:00
if ( arg = = " --port " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
sparams . port = std : : stoi ( argv [ i ] ) ;
2023-07-05 22:51:13 +02:00
}
else if ( arg = = " --host " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
sparams . hostname = argv [ i ] ;
2023-07-05 22:51:13 +02:00
}
else if ( arg = = " --path " )
{
if ( + + i > = argc )
{
2023-07-04 16:05:27 +02:00
invalid_param = true ;
break ;
}
sparams . public_path = argv [ i ] ;
2023-07-05 22:51:13 +02:00
}
2023-12-15 12:49:01 +01:00
else if ( arg = = " --api-key " )
{
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
2024-02-03 12:23:37 +01:00
sparams . api_keys . emplace_back ( argv [ i ] ) ;
2024-01-11 18:51:17 +01:00
}
else if ( arg = = " --api-key-file " )
{
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
std : : ifstream key_file ( argv [ i ] ) ;
if ( ! key_file ) {
fprintf ( stderr , " error: failed to open file '%s' \n " , argv [ i ] ) ;
invalid_param = true ;
break ;
}
std : : string key ;
while ( std : : getline ( key_file , key ) ) {
if ( key . size ( ) > 0 ) {
sparams . api_keys . push_back ( key ) ;
}
}
key_file . close ( ) ;
2023-12-15 12:49:01 +01:00
}
2023-07-05 22:51:13 +02:00
else if ( arg = = " --timeout " | | arg = = " -to " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
sparams . read_timeout = std : : stoi ( argv [ i ] ) ;
sparams . write_timeout = std : : stoi ( argv [ i ] ) ;
2023-07-05 22:51:13 +02:00
}
else if ( arg = = " -m " | | arg = = " --model " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
params . model = argv [ i ] ;
2023-07-05 22:51:13 +02:00
}
else if ( arg = = " -a " | | arg = = " --alias " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
params . model_alias = argv [ i ] ;
2023-07-05 22:51:13 +02:00
}
else if ( arg = = " -h " | | arg = = " --help " )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
server_print_usage ( argv [ 0 ] , default_params , default_sparams ) ;
exit ( 0 ) ;
2023-07-05 22:51:13 +02:00
}
else if ( arg = = " -c " | | arg = = " --ctx-size " | | arg = = " --ctx_size " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
params . n_ctx = std : : stoi ( argv [ i ] ) ;
2023-07-05 22:51:13 +02:00
}
2023-11-01 23:04:33 +01:00
else if ( arg = = " --rope-scaling " )
{
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
std : : string value ( argv [ i ] ) ;
2024-02-25 11:09:09 +01:00
/**/ if ( value = = " none " ) { params . rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_NONE ; }
else if ( value = = " linear " ) { params . rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_LINEAR ; }
else if ( value = = " yarn " ) { params . rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_YARN ; }
2023-11-01 23:04:33 +01:00
else { invalid_param = true ; break ; }
}
llama : add custom RoPE (#2054)
* Implement customizable RoPE
The original RoPE has pre-defined parameters
theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]
Our customizable RoPE, ggml_rope_custom_inplace, uses
theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]
with the default matches the original
scale = 1.0
base = 10000
The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.
Recent researches show changing these two parameters extends the context limit with minimal loss.
1. Extending Context to 8K
kaiokendev
https://kaiokendev.github.io/til#extending-context-to-8k
2. Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
https://arxiv.org/abs/2306.15595
3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
https://www.reddit.com/user/bloc97
https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5
* ggml-metal: fix custom rope
* common: fix argument names in help
* llama: increase MEM_REQ_EVAL for MODEL_3B
It avoids crashing for quantized weights on CPU.
Better ways to calculate the required buffer size would be better.
* llama: make MEM_REQ_EVAL depend on n_ctx
* server: use proper Content-Type in curl examples
Without the header Content-Type: application/json, curl will POST with
Content-Type: application/x-www-form-urlencoded
Though our simple server doesn't care, the httplib.h used has a limit
with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192
With Content-Type: application/json, we can send large json data.
* style : minor fixes, mostly indentations
* ggml : fix asserts
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-15 12:34:16 +02:00
else if ( arg = = " --rope-freq-base " )
{
2023-07-23 22:31:17 +02:00
if ( + + i > = argc )
{
llama : add custom RoPE (#2054)
* Implement customizable RoPE
The original RoPE has pre-defined parameters
theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]
Our customizable RoPE, ggml_rope_custom_inplace, uses
theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]
with the default matches the original
scale = 1.0
base = 10000
The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.
Recent researches show changing these two parameters extends the context limit with minimal loss.
1. Extending Context to 8K
kaiokendev
https://kaiokendev.github.io/til#extending-context-to-8k
2. Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
https://arxiv.org/abs/2306.15595
3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
https://www.reddit.com/user/bloc97
https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5
* ggml-metal: fix custom rope
* common: fix argument names in help
* llama: increase MEM_REQ_EVAL for MODEL_3B
It avoids crashing for quantized weights on CPU.
Better ways to calculate the required buffer size would be better.
* llama: make MEM_REQ_EVAL depend on n_ctx
* server: use proper Content-Type in curl examples
Without the header Content-Type: application/json, curl will POST with
Content-Type: application/x-www-form-urlencoded
Though our simple server doesn't care, the httplib.h used has a limit
with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192
With Content-Type: application/json, we can send large json data.
* style : minor fixes, mostly indentations
* ggml : fix asserts
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-15 12:34:16 +02:00
invalid_param = true ;
break ;
}
params . rope_freq_base = std : : stof ( argv [ i ] ) ;
}
else if ( arg = = " --rope-freq-scale " )
{
2023-07-23 22:31:17 +02:00
if ( + + i > = argc )
{
llama : add custom RoPE (#2054)
* Implement customizable RoPE
The original RoPE has pre-defined parameters
theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]
Our customizable RoPE, ggml_rope_custom_inplace, uses
theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]
with the default matches the original
scale = 1.0
base = 10000
The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.
Recent researches show changing these two parameters extends the context limit with minimal loss.
1. Extending Context to 8K
kaiokendev
https://kaiokendev.github.io/til#extending-context-to-8k
2. Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
https://arxiv.org/abs/2306.15595
3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
https://www.reddit.com/user/bloc97
https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5
* ggml-metal: fix custom rope
* common: fix argument names in help
* llama: increase MEM_REQ_EVAL for MODEL_3B
It avoids crashing for quantized weights on CPU.
Better ways to calculate the required buffer size would be better.
* llama: make MEM_REQ_EVAL depend on n_ctx
* server: use proper Content-Type in curl examples
Without the header Content-Type: application/json, curl will POST with
Content-Type: application/x-www-form-urlencoded
Though our simple server doesn't care, the httplib.h used has a limit
with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192
With Content-Type: application/json, we can send large json data.
* style : minor fixes, mostly indentations
* ggml : fix asserts
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-15 12:34:16 +02:00
invalid_param = true ;
break ;
}
params . rope_freq_scale = std : : stof ( argv [ i ] ) ;
}
2023-11-01 23:04:33 +01:00
else if ( arg = = " --yarn-ext-factor " )
{
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . yarn_ext_factor = std : : stof ( argv [ i ] ) ;
}
else if ( arg = = " --yarn-attn-factor " )
{
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . yarn_attn_factor = std : : stof ( argv [ i ] ) ;
}
else if ( arg = = " --yarn-beta-fast " )
{
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . yarn_beta_fast = std : : stof ( argv [ i ] ) ;
}
else if ( arg = = " --yarn-beta-slow " )
{
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . yarn_beta_slow = std : : stof ( argv [ i ] ) ;
}
2023-07-05 22:51:13 +02:00
else if ( arg = = " --threads " | | arg = = " -t " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
params . n_threads = std : : stoi ( argv [ i ] ) ;
2023-07-05 22:51:13 +02:00
}
2024-01-27 14:38:05 +01:00
else if ( arg = = " --grp-attn-n " | | arg = = " -gan " )
{
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
params . grp_attn_n = std : : stoi ( argv [ i ] ) ;
}
else if ( arg = = " --grp-attn-w " | | arg = = " -gaw " )
{
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
params . grp_attn_w = std : : stoi ( argv [ i ] ) ;
}
2023-10-24 22:10:43 +02:00
else if ( arg = = " --threads-batch " | | arg = = " -tb " )
{
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
params . n_threads_batch = std : : stoi ( argv [ i ] ) ;
}
2024-03-01 10:08:08 +01:00
else if ( arg = = " --threads-http " )
{
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
sparams . n_threads_http = std : : stoi ( argv [ i ] ) ;
}
2023-07-05 22:51:13 +02:00
else if ( arg = = " -b " | | arg = = " --batch-size " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
params . n_batch = std : : stoi ( argv [ i ] ) ;
params . n_batch = std : : min ( 512 , params . n_batch ) ;
2023-07-05 22:51:13 +02:00
}
else if ( arg = = " --gpu-layers " | | arg = = " -ngl " | | arg = = " --n-gpu-layers " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
2024-01-31 16:30:17 +01:00
if ( llama_supports_gpu_offload ( ) ) {
params . n_gpu_layers = std : : stoi ( argv [ i ] ) ;
} else {
LOG_WARNING ( " Not compiled with GPU offload support, --n-gpu-layers option will be ignored. "
2023-07-05 22:51:13 +02:00
" See main README.md for information on enabling GPU BLAS support " ,
{ { " n_gpu_layers " , params . n_gpu_layers } } ) ;
2024-01-31 16:30:17 +01:00
}
2024-01-12 20:07:38 +01:00
}
else if ( arg = = " --split-mode " | | arg = = " -sm " )
{
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
std : : string arg_next = argv [ i ] ;
if ( arg_next = = " none " )
{
2024-02-25 11:09:09 +01:00
params . split_mode = LLAMA_SPLIT_MODE_NONE ;
2024-01-12 20:07:38 +01:00
}
else if ( arg_next = = " layer " )
{
2024-02-25 11:09:09 +01:00
params . split_mode = LLAMA_SPLIT_MODE_LAYER ;
2024-01-12 20:07:38 +01:00
}
else if ( arg_next = = " row " )
{
2024-02-25 11:09:09 +01:00
params . split_mode = LLAMA_SPLIT_MODE_ROW ;
2024-01-12 20:07:38 +01:00
}
else {
invalid_param = true ;
break ;
}
# ifndef GGML_USE_CUBLAS
fprintf ( stderr , " warning: llama.cpp was compiled without cuBLAS. Setting the split mode has no effect. \n " ) ;
# endif // GGML_USE_CUBLAS
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-07-05 22:51:13 +02:00
else if ( arg = = " --tensor-split " | | arg = = " -ts " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
ggml : add unified SYCL backend for Intel GPUs (#2690)
* first update for migration
* update init_cublas
* add debug functio, commit all help code
* step 1
* step 2
* step3 add fp16, slower 31->28
* add GGML_LIST_DEVICE function
* step 5 format device and print
* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue
* support main device is non-zero
* step7 add debug for code path, rm log
* step 8, rename all macro & func from cuda by sycl
* fix error of select non-zero device, format device list
* ren ggml-sycl.hpp -> ggml-sycl.h
* clear CMAKE to rm unused lib and options
* correct queue: rm dtct:get_queue
* add print tensor function to debug
* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481
* summary dpct definition in one header file to replace folder:dpct
* refactor device log
* mv dpct definition from folder dpct to ggml-sycl.h
* update readme, refactor build script
* fix build with sycl
* set nthread=1 when sycl, increase performance
* add run script, comment debug code
* add ls-sycl-device tool
* add ls-sycl-device, rm unused files
* rm rear space
* dos2unix
* Update README_sycl.md
* fix return type
* remove sycl version from include path
* restore rm code to fix hang issue
* add syc and link for sycl readme
* rm original sycl code before refactor
* fix code err
* add know issue for pvc hang issue
* enable SYCL_F16 support
* align pr4766
* check for sycl blas, better performance
* cleanup 1
* remove extra endif
* add build&run script, clean CMakefile, update guide by review comments
* rename macro to intel hardware
* editor config format
* format fixes
* format fixes
* editor format fix
* Remove unused headers
* skip build sycl tool for other code path
* replace tab by space
* fix blas matmul function
* fix mac build
* restore hip dependency
* fix conflict
* ren as review comments
* mv internal function to .cpp file
* export funciton print_sycl_devices(), mv class dpct definition to source file
* update CI/action for sycl code, fix CI error of repeat/dup
* fix action ID format issue
* rm unused strategy
* enable llama_f16 in ci
* fix conflict
* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml
* fix ci cases for unsupported data type
* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL
* revert hip cmake changes
* fix indent
* add prefix in func name
* revert no mmq
* rm cpu blas duplicate
* fix no_new_line
* fix src1->type==F16 bug.
* pass batch offset for F16 src1
* fix batch error
* fix wrong code
* revert sycl checking in test-sampling
* pass void as arguments of ggml_backend_sycl_print_sycl_devices
* remove extra blank line in test-sampling
* revert setting n_threads in sycl
* implement std::isinf for icpx with fast math.
* Update ci/run.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add copyright and MIT license declare
* update the cmd example
---------
Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 16:56:23 +01:00
# if defined(GGML_USE_CUBLAS) || defined(GGML_USE_SYCL)
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
std : : string arg_next = argv [ i ] ;
2023-06-06 21:33:23 +02:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
// split string by , and /
2023-07-05 22:51:13 +02:00
const std : : regex regex { R " ([,/]+) " } ;
std : : sregex_token_iterator it { arg_next . begin ( ) , arg_next . end ( ) , regex , - 1 } ;
std : : vector < std : : string > split_arg { it , { } } ;
2024-01-31 16:30:17 +01:00
GGML_ASSERT ( split_arg . size ( ) < = llama_max_devices ( ) ) ;
2023-06-06 21:33:23 +02:00
2024-01-31 16:30:17 +01:00
for ( size_t i_device = 0 ; i_device < llama_max_devices ( ) ; + + i_device )
2023-07-05 22:51:13 +02:00
{
if ( i_device < split_arg . size ( ) )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . tensor_split [ i_device ] = std : : stof ( split_arg [ i_device ] ) ;
}
2023-07-05 22:51:13 +02:00
else
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . tensor_split [ i_device ] = 0.0f ;
}
}
2023-06-06 21:33:23 +02:00
# else
2023-07-31 15:44:35 +02:00
LOG_WARNING ( " llama.cpp was compiled without cuBLAS. It is not possible to set a tensor split. \n " , { } ) ;
2023-06-06 21:33:23 +02:00
# endif // GGML_USE_CUBLAS
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-07-05 22:51:13 +02:00
else if ( arg = = " --main-gpu " | | arg = = " -mg " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
ggml : add unified SYCL backend for Intel GPUs (#2690)
* first update for migration
* update init_cublas
* add debug functio, commit all help code
* step 1
* step 2
* step3 add fp16, slower 31->28
* add GGML_LIST_DEVICE function
* step 5 format device and print
* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue
* support main device is non-zero
* step7 add debug for code path, rm log
* step 8, rename all macro & func from cuda by sycl
* fix error of select non-zero device, format device list
* ren ggml-sycl.hpp -> ggml-sycl.h
* clear CMAKE to rm unused lib and options
* correct queue: rm dtct:get_queue
* add print tensor function to debug
* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481
* summary dpct definition in one header file to replace folder:dpct
* refactor device log
* mv dpct definition from folder dpct to ggml-sycl.h
* update readme, refactor build script
* fix build with sycl
* set nthread=1 when sycl, increase performance
* add run script, comment debug code
* add ls-sycl-device tool
* add ls-sycl-device, rm unused files
* rm rear space
* dos2unix
* Update README_sycl.md
* fix return type
* remove sycl version from include path
* restore rm code to fix hang issue
* add syc and link for sycl readme
* rm original sycl code before refactor
* fix code err
* add know issue for pvc hang issue
* enable SYCL_F16 support
* align pr4766
* check for sycl blas, better performance
* cleanup 1
* remove extra endif
* add build&run script, clean CMakefile, update guide by review comments
* rename macro to intel hardware
* editor config format
* format fixes
* format fixes
* editor format fix
* Remove unused headers
* skip build sycl tool for other code path
* replace tab by space
* fix blas matmul function
* fix mac build
* restore hip dependency
* fix conflict
* ren as review comments
* mv internal function to .cpp file
* export funciton print_sycl_devices(), mv class dpct definition to source file
* update CI/action for sycl code, fix CI error of repeat/dup
* fix action ID format issue
* rm unused strategy
* enable llama_f16 in ci
* fix conflict
* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml
* fix ci cases for unsupported data type
* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL
* revert hip cmake changes
* fix indent
* add prefix in func name
* revert no mmq
* rm cpu blas duplicate
* fix no_new_line
* fix src1->type==F16 bug.
* pass batch offset for F16 src1
* fix batch error
* fix wrong code
* revert sycl checking in test-sampling
* pass void as arguments of ggml_backend_sycl_print_sycl_devices
* remove extra blank line in test-sampling
* revert setting n_threads in sycl
* implement std::isinf for icpx with fast math.
* Update ci/run.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add copyright and MIT license declare
* update the cmd example
---------
Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 16:56:23 +01:00
# if defined(GGML_USE_CUBLAS) || defined(GGML_USE_SYCL)
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . main_gpu = std : : stoi ( argv [ i ] ) ;
2023-06-06 21:33:23 +02:00
# else
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
LOG_WARNING ( " llama.cpp was compiled without cuBLAS. It is not possible to set a main GPU. " , { } ) ;
2023-05-28 19:48:57 +02:00
# endif
2023-07-05 22:51:13 +02:00
}
else if ( arg = = " --lora " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
2024-02-03 12:23:37 +01:00
params . lora_adapter . emplace_back ( argv [ i ] , 1.0f ) ;
train : finetune LORA (#2632)
* fix track_max_mem in forward_batch_wo_cache_flash_attn_train
* remove unnecessary Adam(W) optimizer tensors.
reduces optimizer memory overhead from 7*modelsize to 2*modelsize.
additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t.
bumps training checkpoint file version, but old checkpoints can still be read.
new version with less tensors is saved.
* add gradient clipping to AdamW
* Fix reset of unused g->nodes and g->grads to NULL
* implement gradient checkpointing for training
reduces memory overhead from O(n_layer) to O(sqrt(n_layer))
as explained in readme of https://github.com/cybertronai/gradient-checkpointing
* remove unused compute buffer 3
* add and use function ggml_build_backward_expand to avoid stack overflows with large maximum number of nodes
GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);
* change AdamW decay parameter to work like the torch AdamW decay parameter
It is now relative to Adam learning rate `alpha*sched`.
Before that it was relative to `sched` only.
`alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1]
* change default AdamW weight decay parameter used in training to 0.1 as used in nanoGPT
* change default AdamW weight decay parameter defined in ggml to 0.0, making Adam default instead of AdamW
btw: the default weight decay parameter for torch.optim.AdamW is 0.01
* bug fixes for cross entropy loss
ggml_cross_entropy_loss: sums where not correctly added in workload of each thread
ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues
guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16
cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup.
so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance.
* fix test-grad0 for cross_entropy_loss
the second argument to cross_entropy_loss must sum up to 1 for each row
* fix test-grad0 for soft_max
dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work
instead use sum(log(soft_max()*(1-eps)+eps)); use eps to avoid log(0)
* improve finite differences of test-grad0 by using double instead of float
* change cross_entropy_loss to output average over all rows
this helps keeping the loss and gradients in a sane range
* improve gradient checkpointing
sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal.
since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different:
```
given: n, u, v
objective: minimize(a*u+b*v) where a*b=n, a>0, b>0
b=n/a
minimize(a*u+v*n/a)
diff(a*u+v*n/a, a) = u - (v*n/a)/a
diff(a*u+v*n/a, a) == 0
u - (v*n/a)/a == 0
u == v*n/(a*a)
u*a*a = v*n
a*a = v*n/u
a = sqrt(n*v/u)
```
this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage.
* disable gradient checkpointing debug output
* llama : fix rope usage in train-text-from-scratch after ChatGLM change
* add more training parameters:
--enable-restart N Only for Adam optimizer. Enable restarts of cos-decay
--disable-restart N Only for Adam optimizer. Disable restarts of cos-decay
--opt-past N Number of optimization iterations to track for delta convergence test. Disabled when zero.
--opt-delta N Maximum delta for delta convergence test. Disabled when <= zero.
--opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero.
--adam-epsf N AdamW epsilon for convergence test. Disabled when <= zero.
--adam-min-alpha N Adam minimum learning rate alpha, usually 0.1 * alpha
* replace memcpy with reshape operation so that the graph is not cut at the input
this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it
* remove unused function argument from get_example_targets_batch
* measure and print total training time
* add optimization callback to ggml_opt_resume_g
this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)).
can be used for dynamic learning schedule and setting input data for batches before each iteration
* use optimization callback in training
allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters
reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration
* add minimum number of tensor dimensions to apply weight decay (default 2)
this allows to not apply weight decay to bias parameters
* rename training parameter cos-decay-alpha to cos-decay-min and clarify that adam-min-alpha also applies to warmup
* fix increase of model.train_samples and model.train_tokens
now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations
* change sampling parameters for prediction after training to defaults of common.h
and clarify what is context for prediction and what are generated tokens
* tighten abs error bounds for cross_entropy_loss in test-grad0
* add conditional compilation of using F16 exp in flash attention
uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention
* tighten abs error bounds for flash_attn in test-grad0
* tighten abs error bounds for sqrt in test-grad0
* remove out-commented vectorized code of opt_adam
the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead
* ggml : update ggml_rms_norm_back with configurable eps
* llama training : fix ggml_rms_norm_back calls to pass configurable eps
* remove trailing whitespace
* add train function using automatic gradient checkpointing backward pass and allocator
* in train function replace add_inplace by regular add
because using add_inplace seems to result in different gradients
* don't use allocate hash_map on context
because the context has no_alloc=True when using memory allocator resulting in NULL data pointers
* correctly clone reshape and permute operations by also cloning tensor->nb values
* fix variable name and add missing type cast
* terminate recursive tensor cloning when reaching tensor without src tensors
* correctly clone view tensors by setting data pointers
without this the checkpointing would only work when being used together with memory allocator
* fix variable names
* swap arguments to commutative ops to be the same as in `forward_batch_wo_cache_flash_attn`
* add input tensors as checkpoints
so that recursive tensor cloning of gradient checkpointing terminates on input tensors
* fix variable name and add missing boolean negation
* make sure some tensors are not reallocated by inserting new temporary nodes depending on them:
output and parameter gradient tensors need to be available at the end of the graph execution
parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration
checkpoint tensors are allocated all together to reduce memory allocator fragmentation
afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs
* fix ASSERT to work with zero layers
* add training options whether to use allocator and/or unified training function
* integrate unified training function which may use memory allocator
the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing
* format name of cloned tensors with " (clone)" suffix
* set names for tensors in unified train function for easier debugging
* allocate graph on context using ggml_new_graph
* remove handwritten training functions
* remove unused training parameters "use_scratch" and "use_unified"
* remove trailing whitespace
* remove unused train params: mem_compute1_gb & mem_compute2_gb
mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions
mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented)
* remove unused forward_batch function
* add debug asserts in ggml_allocr_alloc to some common pitfalls when using this function directly
* only use ggml_allocr_alloc when tensor has NULL data and is no view
* fix test when to create temporary backward graph
temporary backward graph is only necessary when using checkpointing
* fix memory "leak" in optimizers
each iteration a new cplan with new memory for work data was allocated.
now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data.
* reverse order of for loop in ggml_build_backward_expand to save memory when using gradient checkpointing and allocator
with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory.
the computation results are the same
* add API functions to access llama model tensors
* add stub example for finetuning, based on train-text-from-scratch
* move and remove code
* add API functions to access remaining model parameters:
mult, head and rot
* first draft for LORA finetune training
* remove const model and layer arguments in API functions for accessing model tensors
* bug fixes to make finetune compile
automatic allocator does not work yet
* add debug prints for training memory improvements
* fix names of lora tensors
* avoid stack overflow resulting from big ggml_cgraph
replace stack allocation and ggml_build_forward by ggml_new_graph in combination with ggml_build_forward_expand
* replace llama API functions to get model tensors by one function to get model tensor by name
LLAMA_API struct ggml_tensor * llama_get_model_tensor(struct llama_model * model, const char * name);
* remove unused call to not existing llama_get_layer_from_model
* implement ggml_compute_forward_out_prod_q_f32
* remove trailing whitespace
* add lora finetune support on quantized base model tensors
* add ggml_add_cast API function
this function works like ggml_add, but accepts a data type for the resulting tensor.
only supported for quantized src0 input.
* use ggml_add_cast in finetuning
lora-applied weights will now have data type F32, which improves gradients when finetuning quantized base models
* bug fix: actually use result type passed to ggml_add_cast
* make sure base model tensors data cannot be used in viewable operations
memory allocator would try to make lora application inplace on base model tensors.
since those are memory mapped this will result in memory access violations
* fix bug in ggml_out_prod which resulted in wrong n_dims of result tensors
* avoid keeping in memory ALL of the gradients
The problem here stems from ggml_graph_reset. This function is called in the optimization function, before each graph computation, to reset the gradients to zero. This required a unique memory slot for each gradient: allocating memory from a previosly freed memory location might lead to non-zero input gradients.
During ggml_compute_backward the gradients are build stepwise by adding or substracting new values, starting from a OP_NONE tensor which needs to contain zero-values. This requires the graph reset.
To avoid this I now remember in ggml_build_backward_expand the original OP_NONE gradient tensors in a hash table, which is passed to ggml_compute_backward. There instead of using add (or sub or similar) I test whether the existing gradient to be changed is a zero-valued-tensor by looking up its existence in the hash table. When it is such a zero-tensor it will not be modified, but replaced by the value to be added, otherwise the regular add (not inplace, allocator will take care of this) will be used. This way none of those zero-tensor values will be necessary in the final backward graph and more importantly they won't need a unique memory slot, just to make them zero.
* remove trailing whitespace
* remove debug prints and function to compute tensor data hash
* improve optimization iteration prints
* adjust maximal values to support finetuning 3B models
* change default finetune params lora_r and lora_alpha to match the n_rank parameters of 4
* bug fix: make sure finetune input gradient is allocated at begin and kept until end
* remove unnecessary src tensor from ggml_get_rows_back
we don't need data of src[2] for computation, only to setup the correct output shape.
remove dependency on src[2], so that allocator can work more freely.
the computational graph is still completely determined, because the output shape is naturally included.
this is similar to how ggml_reshape does it.
* remove unnecessary src tensor from ggml_repeat & ggml_repeat_back
we don't need data of src[1] for computation, only to setup the correct output shape.
remove dependency on src[1], so that allocator can work more freely.
the computational graph is still completely determined, because the output shape is naturally included
* resolve todo
allocator will only make it inplace when they are of the same type
* mixing multiple LORA adapters is now possible
pass more than one '--lora FNAME' argument to apply more than one LORA.
use '--lora-scaled FNAME S' when you want to specify a user-defined scale for an adapter.
* add option to save finetune output every N iterations
* also save latest finetune output with ITERATION="LATEST" and print where files are saved
saving with LATEST makes it easier to resume training from the latest checkpoint
the string "LATEST" can be configured with command line option "--fn-latest STR"
* update checkpoint train stats before saving via "--save-every"
* add command line option `--rank-wo N` for rank of wo tensor
* update finetune README
* fix dump_non_result_info_yaml to output multiple lora adapters
* bug fix: replace GGML_TYPE_SIZE[t] by ggml_type_size(t)
* replace llama_n_mult by llama_n_ff
* finetune bug fixes to compile with merged in code from master
* remove prediction related code to reduce duplicated code with main
use main instead
* reduce large memory overhead in train-text-from-scratch
all gradients had to be pinned so that graph_reset works correctly.
this is no longer necessary with the changes to ggml_compute_backward introduced in this PR.
* add comment explaining why finetune checkpoints are allocated in one block
* make default value of float member a float literal
* handle rms_norm and rope parameters the same as in train-text-from-scratch
* remove unused code
* remove vocab related code as it is unnecessary
* add LLM_KV_TRAINING_TYPE to train-text-from-scratch checkpoints
so that they can be differentiated from lora finetune checkpoints
* add gguf constants and load/save functions from train-text-from-scratch
* add load & save lora finetune checkpoints via gguf
* add python script to convert old finetune checkpoint files to gguf
* remove old checkpoint save & load code
* remove code to print data checksums which was used to verify correctness of new gguf code
* omit tokenization when training is disabled, only save llama lora adapter
training can be disabled by passing '-n 0' to finetune
* remove trailing whitespace
* update README.md
* implement ggml_compute_forward_repeat_f16
* avoid stack overflow of large cgraphs in test-grad0
* add ggml API functions ggml_unravel_index, ggml_get_i32_nd and its analogs for set and for f32
ggml_get_i32_1d, ggml_set_i32_1d, ggml_get_f32_1d, ggml_set_f32_1d now support non-contiguous tensors.
in case of non-contiguous tensor, the 1d index is unraveled into a multi index using ggml_unravel_index to be passed to '_nd' function equivalent.
this fixes a bug in test-grad0 which happens due to ggml_build_backward not building purely contiguous tensors anymore
* increase test-grad0 context mem size to accommodate for bigger cgraph
* add sanity check to ggml_compute_backward, asserting the correct shape of gradients
* fix ggml_acc_or_set to return tensor of correct shape
* remove unused 'inplace' argument from ggml_compute_backward function
inplace operations to add gradients are no longer created by ggml_compute_backward
use allocator to automatically make inplace operations
* add missing argument 'int i0' to ggml_get_i32_nd & ggml_set_i32_nd header declarations
* fix error message in ggml_allocr_alloc to display actual max_avail
* fix check_gradient
ggml_build_backward_expand was previously replaced by ggml_build_backward, but the assignment of forward graph to backward graph missing
* use tensor->view_src instead of ggml_is_view and get_view_source
* move gradient checkpointing code into ggml, new API function:
// build gradient checkpointing backward graph gb for gf using provided checkpoints
// gb_tmp will contain original backward graph with rewritten backward process nodes,
// but without the second forward pass nodes.
GGML_API void ggml_build_backward_gradient_checkpointing(
struct ggml_context * ctx,
struct ggml_cgraph * gf,
struct ggml_cgraph * gb,
struct ggml_cgraph * gb_tmp,
struct ggml_tensor * * checkpoints,
int n_checkpoints);
* replace custom data getters and setters by ggml functions
* train-text-from-scratch can train (full finetune) gguf models
just pass the gguf model via `--checkpoint-in FN`.
after this, to continue training, pass the generated checkpoint instead of the original gguf model.
tested with smaller models, bigger models may exceed available memory.
use (LORA) finetune for those.
* remove trailing whitespace
* add option to save train-text-from-scratch output every N iterations
* update README.md
* fix warnings
* fix warnings
* remove finetune option to disable allocator
the allocator should always be used.
by making sure that it is always used it gets easier to implement automatic memory requirements computation
* add tensor checkpoints only when gradient checkpointing is enabled
* initialize opt ggml context if none was provided
* add ggml-alloc API function 'ggml_allocr_max_size' to get max size of alloc
GGML_API size_t ggml_allocr_max_size(struct ggml_allocr * alloc);
* finetune: automatically allocate all memory and changes to command line options
remove '--n_examples N' parameter, as it no longer makes sense to call optimization process multiple times in a loop.
add '--only_write_lora' command line option: will skip tokenization and training, to only write a llama.cpp comptabile LORA adapter.
remove memory buffer related command line options.
improve iteration console output.
* add finetune to Makefile
* update README.md
* print time per iteration and estimate remaining time
* increase measured alloc size by tensor_alignment
ggml_allocr_reset will reduce the given size by up to tensor_alignment-1
* fix README.md
* add some more allocator debug prints
* bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue
* revert last commit
"bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue"
"alloc was freeing an externally allocated tensor, because it calculated the end of allocator memory as alloc->data + alloc->max_size instead of alloc->data + alloc->size."
This is intentional to reduce the risk of freeing external tensors when measuring. Unless max_size is not properly calculated, I don't see why this is an issue.
* remove unnecessary "0x" before "%p" output
* move measurement memory segment to upper region of the address space
* update README.md
* fix printf format warnings
* add missing gguf_free in load_checkpoint_lora_file
* load default rms_norm and rope parameters from base model
* add gradient accumulation
specify number accumulation steps with '--grad-acc N'.
this will simulate a bigger batch size of grad_acc*batch.
* fix tracking of train_samples and train_tokens
* build : fix compile warnings
* ggml : fix L-BFGS linesearch loop
* improve finetune time measurement
fix printf warnings on system where int64_t is (long int).
change time datatypes to double because values get big with long training times.
exclude file saving from time measurement.
converge faster to actual time per iteration by removing very small first duration before first iteration was performed.
fix bug in output of total training time, the reported value was 1000 times to small.
* specify default lora rank with '--lora-r N'
'--lora-r N' will specify default rank for all tensors
'--rank-wq N', etc. will override this default rank for specific tensor types.
* fix gradient accumulation bug where the same batch was used for each microstep
* fix gradient accumulation bug where the same batch was used for each microstep
* support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back
k and v can now be repeated in q along ne[2]
in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2.
in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3].
so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads.
in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2.
since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor.
additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned.
we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions.
this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous.
since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous.
change test-grad0 to also test for repeated k/v in q.
this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable.
added a note to explain this.
* add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'.
* fix finetune to support grouped-query-attention (using flash-attention)
note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.
* support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b)
* test broadcasting mul_mat backward pass
* decouple random number generator of each operation test
when changing one test the rng of others tests is not influenced anymore
* add comment briefly describing what ggml_repeat_back does
* simplify broadcasting mul_mat backward using ggml_repeat_back
* add cgraph evaluation order member and corresponding enum type
this controls in which order ggml_build_forward visits source nodes.
by default the nodes are visited left to right, i.e. src[0] first.
in some cases it is beneficial for ggml-alloc to visit in a different order.
two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last).
* measure max compute size for each cgraph eval order and use best order
this can bring huge memory savings:
e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB
* remove unused command line options
* add sample start patterns and options to force new or by default resume last shuffling
* update shuffle rng state on reshuffle
* exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32
* remove probably unnecessary exception type flags from stringstream
* pass correct max number of tokens to llama_tokenize
* account for possible leading whitespace that will be added by tokenizer
e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12]
* use unrolled vec_mad in out_prod
y is vec_mad result vec.
x is vec_mad input vec.
v is vec_mad input scalar.
ggml_vec_mad_f32_unroll will internally loop over x and v with same y.
GGML_VEC_MAD_UNROLL is by default defined to 32.
This value is empirical optimized using performance test runs of out-prod in openllama-3b finetune with 256 context length and batch size 1. It gives 23% performance boost for out_prod.
Full measurements of out-prod runtime in ms:
unroll_xv unroll_yv
1 67014.643 87826.469
2 77117.552 89077.656
4 72091.311 109121.657
8 61077.543 88678.334
16 56914.67 79514.947
24 59024.595 84350.254
28 55952.446 83368.73
32 51476.658 85177.745
36 55973.792 84659.92
40 55139.616 93844.738
48 60736.392 93330.267
64 99856.878 116994.99
Second column is when unrollying yv instead of xv
* set lora_alpha to value of lora_r if it is not set via command line
otherwise only changing lora_r will change scaling of lora adapter used in prediction
* reshuffle original sample order instead of the previous shuffled order
otherwise resumed reshuffle will not result in same sample order
* block tiling for out-prod inspired by mul-mat
block sizes are empirically optimized
roughly doubles the flops of out-prod
* exclude some more known zero values from computations in flash_attn_f32 & flash_attn_back_f32
* add static keywords
* remove outcommented old code
* update train-text-from-scratch with tokenization, sample selection and shuffling from finetune
* remove lbfgs related train parameters
* move common train functions into common/train.[h|cpp]
* move train state into struct train_state
* move train data saving code into callback to unify code of opt_callback
train_params are still different in finetune and train-text-from-scratch, so it can't yet be moved to train.h|cpp
* move common train params into common/train
* move common opt_callback into common/train
* fix consume_common_train_arg
* save and load head_count_kv in lora checkpoints
* increase train_samples by used_samples instead of number of batches
on batch can contain more than one sample when option "fill_with_next_samples" is used
* fix usage of llama_tokenize
* remove static from process_escape since we need it exposed in header
* fix code formating of long function declarations
* fix condition in load_train_state_gguf
* use die("msg") instead of replace GGML_ASSERT(!"msg") or throw std::runtime_error("msg")
* fix saving and loading of training type
* remove terminating '\0' from tokenization
(llama_tokenize is now passed the string length instead of relying on terminating '\0')
* fix compile warnings
* fix compile warnings
* use new/delete for train_state instead of malloc/free
using malloc may result in seg faults when trying to assign string fields
* assert that sample_count > 0, avoiding division by zero
* fix frand to return value in interval [0,1)
* add train option "--sample-random-offsets"
Use samples beginning at random offsets.
The offset is only applied to the first sample in each batch context window.
Together with "--fill-with-next-samples" this may help for training endless text generation.
For example given a dataset containing samples "abcd", "ABCD", "0123".
With context size of 8 and options "--fill-with-next-samples", "--no-separate-with-eos", "--no-separate-with-bos",
the context windows of batches could only be filled with "abcdABCD", "ABCDabcd", "0123abcd", etc.
With "--sample-random-offsets" it can also be filled with "23abcdAB", "bcd0123A", etc.
* deduplicate code into function
* remove n_rot hparam, as it must always be hparam.n_embd_head()
* align code
* assert correct base model tensor shapes
* move some params from lora hparams into model hparams and load model params from gguf
this equalizes the model definition in finetune and text-from-scratch and removes the need for additional llama api functions to get model parameters
* remove now unnecessary llama API functions to get model params that where added by this PR
* train-text-from-scratch: automatically allocate model tensors, remove option '--mem-model N'
* train-text-from-scratch: automatically allocate opt context
* train-text-from-scratch: automatically allocate input tensors
* train-text-from-scratch: automatically allocate compute memory
* remove unused options and equalize train-text-from-scratch with finetune
* initialize opt->loss_after with zero
* add export-lora program
* remove trailing whitespace
* add export-lora build in Makefile
* remove unused struct tensor_info from export-lora
* add export-lora build dependency to llama
because it depends on common, which depends on llama
* update finetune README.md
* cancel optimization when specified number of epochs is completed
* improve handling of export-lora arguments
print errors and warnings when files could not be read or created
* Fix export-lora.cpp "not enough space in the context's memory pool" (#1)
* Fix export-lora.cpp "not enough space in the context's memory pool"
Without this patch, export-lora would sometimes error with "not enough space in the context's memory pool (needed 656784, available 656800)".
* increase required context size by 5*GGML_MEM_ALIGN instead of plain 16
---------
Co-authored-by: xaedes <xaedes@gmail.com>
* improve handling of not yet supported tensor types
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: meatbag-18a <145869052+meatbag-18a@users.noreply.github.com>
2023-09-28 20:40:11 +02:00
params . use_mmap = false ;
}
else if ( arg = = " --lora-scaled " )
{
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
const char * lora_adapter = argv [ i ] ;
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
2024-02-03 12:23:37 +01:00
params . lora_adapter . emplace_back ( lora_adapter , std : : stof ( argv [ i ] ) ) ;
2023-07-13 15:58:25 +02:00
params . use_mmap = false ;
2023-07-05 22:51:13 +02:00
}
else if ( arg = = " --lora-base " )
{
if ( + + i > = argc )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
invalid_param = true ;
break ;
}
params . lora_base = argv [ i ] ;
2023-07-05 22:51:13 +02:00
}
else if ( arg = = " -v " | | arg = = " --verbose " )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
# if SERVER_VERBOSE != 1
LOG_WARNING ( " server.cpp is not built with verbose logging. " , { } ) ;
# else
server_verbose = true ;
# endif
2023-07-05 22:51:13 +02:00
}
else if ( arg = = " --mlock " )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . use_mlock = true ;
2023-07-05 22:51:13 +02:00
}
else if ( arg = = " --no-mmap " )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . use_mmap = false ;
2023-07-05 22:51:13 +02:00
}
2024-02-16 10:31:07 +01:00
else if ( arg = = " --numa " ) {
if ( + + i > = argc ) {
invalid_param = true ;
break ;
} else {
std : : string value ( argv [ i ] ) ;
/**/ if ( value = = " distribute " | | value = = " " ) { params . numa = GGML_NUMA_STRATEGY_DISTRIBUTE ; }
else if ( value = = " isolate " ) { params . numa = GGML_NUMA_STRATEGY_ISOLATE ; }
else if ( value = = " numactl " ) { params . numa = GGML_NUMA_STRATEGY_NUMACTL ; }
else { invalid_param = true ; break ; }
}
2023-10-22 21:53:08 +02:00
}
else if ( arg = = " --embedding " )
{
params . embedding = true ;
}
else if ( arg = = " -cb " | | arg = = " --cont-batching " )
{
params . cont_batching = true ;
}
else if ( arg = = " -np " | | arg = = " --parallel " )
{
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
params . n_parallel = std : : stoi ( argv [ i ] ) ;
} else if ( arg = = " -n " | | arg = = " --n-predict " )
{
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
params . n_predict = std : : stoi ( argv [ i ] ) ;
} else if ( arg = = " -spf " | | arg = = " --system-prompt-file " )
2023-07-05 22:51:13 +02:00
{
2023-10-22 21:53:08 +02:00
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
std : : ifstream file ( argv [ i ] ) ;
if ( ! file ) {
fprintf ( stderr , " error: failed to open file '%s' \n " , argv [ i ] ) ;
invalid_param = true ;
break ;
}
std : : string systm_content ;
std : : copy (
std : : istreambuf_iterator < char > ( file ) ,
std : : istreambuf_iterator < char > ( ) ,
std : : back_inserter ( systm_content )
) ;
2024-02-29 21:42:11 +01:00
llama . system_prompt_process ( json : : parse ( systm_content ) ) ;
2023-10-22 21:53:08 +02:00
}
2024-02-23 20:31:54 +01:00
else if ( arg = = " -ctk " | | arg = = " --cache-type-k " ) {
params . cache_type_k = argv [ + + i ] ;
}
else if ( arg = = " -ctv " | | arg = = " --cache-type-v " ) {
params . cache_type_v = argv [ + + i ] ;
}
2023-10-22 21:53:08 +02:00
else if ( arg = = " --mmproj " )
{
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
params . mmproj = argv [ i ] ;
2023-07-05 22:51:13 +02:00
}
2024-02-25 13:50:32 +01:00
else if ( arg = = " --log-format " )
{
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
if ( std : : strcmp ( argv [ i ] , " json " ) = = 0 )
{
server_log_json = true ;
}
else if ( std : : strcmp ( argv [ i ] , " text " ) = = 0 )
{
server_log_json = false ;
}
else
{
invalid_param = true ;
break ;
}
}
2023-11-30 23:25:49 +01:00
else if ( arg = = " --log-disable " )
{
log_set_target ( stdout ) ;
LOG_INFO ( " logging to file is disabled. " , { } ) ;
}
2024-02-18 18:39:57 +01:00
else if ( arg = = " --slots-endpoint-disable " )
{
sparams . slots_endpoint = false ;
}
2024-02-25 13:49:43 +01:00
else if ( arg = = " --metrics " )
{
sparams . metrics_endpoint = true ;
}
2024-02-11 11:16:22 +01:00
else if ( arg = = " --chat-template " )
{
if ( + + i > = argc )
{
invalid_param = true ;
break ;
}
2024-02-20 15:58:27 +01:00
if ( ! verify_custom_template ( argv [ i ] ) ) {
fprintf ( stderr , " error: the supplied chat template is not supported: %s \n " , argv [ i ] ) ;
fprintf ( stderr , " note: llama.cpp does not use jinja parser, we only support commonly used templates \n " ) ;
2024-02-11 11:16:22 +01:00
invalid_param = true ;
break ;
}
2024-02-20 15:58:27 +01:00
sparams . chat_template = argv [ i ] ;
2024-02-11 11:16:22 +01:00
}
2024-01-02 12:28:15 +01:00
else if ( arg = = " --override-kv " )
{
2024-01-02 11:38:15 +01:00
if ( + + i > = argc ) {
invalid_param = true ;
break ;
}
char * sep = strchr ( argv [ i ] , ' = ' ) ;
if ( sep = = nullptr | | sep - argv [ i ] > = 128 ) {
fprintf ( stderr , " error: Malformed KV override: %s \n " , argv [ i ] ) ;
invalid_param = true ;
break ;
}
struct llama_model_kv_override kvo ;
std : : strncpy ( kvo . key , argv [ i ] , sep - argv [ i ] ) ;
kvo . key [ sep - argv [ i ] ] = 0 ;
sep + + ;
if ( strncmp ( sep , " int: " , 4 ) = = 0 ) {
sep + = 4 ;
2024-02-25 11:09:09 +01:00
kvo . tag = LLAMA_KV_OVERRIDE_TYPE_INT ;
2024-01-02 11:38:15 +01:00
kvo . int_value = std : : atol ( sep ) ;
} else if ( strncmp ( sep , " float: " , 6 ) = = 0 ) {
sep + = 6 ;
2024-02-25 11:09:09 +01:00
kvo . tag = LLAMA_KV_OVERRIDE_TYPE_FLOAT ;
2024-01-02 11:38:15 +01:00
kvo . float_value = std : : atof ( sep ) ;
} else if ( strncmp ( sep , " bool: " , 5 ) = = 0 ) {
sep + = 5 ;
2024-02-25 11:09:09 +01:00
kvo . tag = LLAMA_KV_OVERRIDE_TYPE_BOOL ;
2024-01-02 11:38:15 +01:00
if ( std : : strcmp ( sep , " true " ) = = 0 ) {
kvo . bool_value = true ;
} else if ( std : : strcmp ( sep , " false " ) = = 0 ) {
kvo . bool_value = false ;
} else {
fprintf ( stderr , " error: Invalid boolean value for KV override: %s \n " , argv [ i ] ) ;
invalid_param = true ;
break ;
}
} else {
fprintf ( stderr , " error: Invalid type for KV override: %s \n " , argv [ i ] ) ;
invalid_param = true ;
break ;
}
params . kv_overrides . push_back ( kvo ) ;
}
2023-07-05 22:51:13 +02:00
else
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
fprintf ( stderr , " error: unknown argument: %s \n " , arg . c_str ( ) ) ;
server_print_usage ( argv [ 0 ] , default_params , default_sparams ) ;
exit ( 1 ) ;
}
2023-05-21 19:51:18 +02:00
}
2024-01-02 11:38:15 +01:00
if ( ! params . kv_overrides . empty ( ) ) {
2024-02-03 12:23:37 +01:00
params . kv_overrides . emplace_back ( ) ;
2024-01-02 11:38:15 +01:00
params . kv_overrides . back ( ) . key [ 0 ] = 0 ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-07-05 22:51:13 +02:00
if ( invalid_param )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
fprintf ( stderr , " error: invalid parameter for argument: %s \n " , arg . c_str ( ) ) ;
server_print_usage ( argv [ 0 ] , default_params , default_sparams ) ;
exit ( 1 ) ;
2023-05-21 19:51:18 +02:00
}
}
2023-11-25 10:29:06 +01:00
/* llama.cpp completion api semantics */
2023-09-15 21:38:27 +02:00
static json format_partial_response (
2024-02-29 21:42:11 +01:00
llama_server_context & llama , server_slot * slot , const std : : string & content , const std : : vector < completion_token_output > & probs
2023-09-15 21:38:27 +02:00
) {
2023-10-22 21:53:08 +02:00
json res = json
{
{ " content " , content } ,
{ " stop " , false } ,
{ " slot_id " , slot - > id } ,
{ " multimodal " , llama . multimodal }
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
} ;
2023-07-02 23:38:44 +02:00
2023-10-22 21:53:08 +02:00
if ( slot - > sparams . n_probs > 0 )
2023-07-05 22:51:13 +02:00
{
2023-07-02 23:38:44 +02:00
res [ " completion_probabilities " ] = probs_vector_to_json ( llama . ctx , probs ) ;
}
return res ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-07-05 22:51:13 +02:00
static json format_tokenizer_response ( const std : : vector < llama_token > & tokens )
{
2024-02-25 13:50:32 +01:00
return json {
{ " tokens " , tokens }
} ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-08-27 01:11:45 +02:00
static json format_detokenized_response ( std : : string content )
{
2024-02-25 13:50:32 +01:00
return json {
{ " content " , content }
} ;
2023-08-27 01:11:45 +02:00
}
2023-10-02 09:42:02 +02:00
2023-10-22 21:53:08 +02:00
static void log_server_request ( const httplib : : Request & req , const httplib : : Response & res )
2023-07-05 22:51:13 +02:00
{
2024-02-25 13:50:32 +01:00
// skip GH copilot requests when using default port
if ( req . path = = " /v1/health " | | req . path = = " /v1/completions " )
{
return ;
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
LOG_INFO ( " request " , {
2024-02-25 13:50:32 +01:00
{ " remote_addr " , req . remote_addr } ,
{ " remote_port " , req . remote_port } ,
{ " status " , res . status } ,
{ " method " , req . method } ,
{ " path " , req . path } ,
{ " params " , req . params } ,
} ) ;
2023-07-04 16:05:27 +02:00
LOG_VERBOSE ( " request " , {
2024-02-25 13:50:32 +01:00
{ " request " , req . body } ,
{ " response " , res . body } ,
} ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-05-21 19:51:18 +02:00
2024-02-29 21:42:11 +01:00
static void append_to_generated_text_from_generated_token_probs ( llama_server_context & llama , server_slot * slot )
2023-09-15 21:38:27 +02:00
{
2023-10-22 21:53:08 +02:00
auto & gtps = slot - > generated_token_probs ;
2023-08-25 17:18:48 +02:00
auto translator = token_translator { llama . ctx } ;
auto add_strlen = [ = ] ( size_t sum , const completion_token_output & cto ) { return sum + translator ( cto ) . size ( ) ; } ;
const size_t len = std : : accumulate ( gtps . begin ( ) , gtps . end ( ) , size_t ( 0 ) , add_strlen ) ;
2023-10-22 21:53:08 +02:00
if ( slot - > generated_text . capacity ( ) < slot - > generated_text . size ( ) + len )
{
slot - > generated_text . reserve ( slot - > generated_text . size ( ) + len ) ;
2023-08-25 17:18:48 +02:00
}
2023-10-22 21:53:08 +02:00
for ( const completion_token_output & cto : gtps )
{
slot - > generated_text + = translator ( cto ) ;
2023-08-25 17:18:48 +02:00
}
}
2024-02-18 17:23:16 +01:00
std : : function < void ( int ) > shutdown_handler ;
2024-02-28 09:55:37 +01:00
std : : atomic_flag is_terminating = ATOMIC_FLAG_INIT ;
inline void signal_handler ( int signal ) {
if ( is_terminating . test_and_set ( ) ) {
// in case it hangs, we can force terminate the server by hitting Ctrl+C twice
// this is for better developer experience, we can remove when the server is stable enough
fprintf ( stderr , " Received second interrupt, terminating immediately. \n " ) ;
exit ( 1 ) ;
}
shutdown_handler ( signal ) ;
}
2024-02-18 17:23:16 +01:00
2023-07-05 22:51:13 +02:00
int main ( int argc , char * * argv )
{
2023-12-17 16:02:16 +01:00
# if SERVER_VERBOSE != 1
log_disable ( ) ;
# endif
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
// own arguments required by this example
gpt_params params ;
server_params sparams ;
// struct that contains llama context and inference
llama_server_context llama ;
2023-10-22 21:53:08 +02:00
server_params_parse ( argc , argv , sparams , params , llama ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-07-05 22:51:13 +02:00
if ( params . model_alias = = " unknown " )
{
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
params . model_alias = params . model ;
}
2024-02-16 10:31:07 +01:00
llama_backend_init ( ) ;
llama_numa_init ( params . numa ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-11-02 07:50:16 +01:00
LOG_INFO ( " build info " , { { " build " , LLAMA_BUILD_NUMBER } ,
{ " commit " , LLAMA_COMMIT } } ) ;
2023-10-22 21:53:08 +02:00
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
LOG_INFO ( " system info " , {
2023-07-05 22:51:13 +02:00
{ " n_threads " , params . n_threads } ,
2023-09-28 21:42:38 +02:00
{ " n_threads_batch " , params . n_threads_batch } ,
2023-07-05 22:51:13 +02:00
{ " total_threads " , std : : thread : : hardware_concurrency ( ) } ,
{ " system_info " , llama_print_system_info ( ) } ,
} ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-01-10 20:56:05 +01:00
httplib : : Server svr ;
2024-01-11 08:10:34 +01:00
std : : atomic < server_state > state { SERVER_STATE_LOADING_MODEL } ;
2024-01-10 20:56:05 +01:00
2024-01-11 19:02:48 +01:00
svr . set_default_headers ( { { " Server " , " llama.cpp " } } ) ;
// CORS preflight
svr . Options ( R " (.*) " , [ ] ( const httplib : : Request & req , httplib : : Response & res ) {
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
res . set_header ( " Access-Control-Allow-Credentials " , " true " ) ;
res . set_header ( " Access-Control-Allow-Methods " , " POST " ) ;
res . set_header ( " Access-Control-Allow-Headers " , " * " ) ;
} ) ;
2024-01-10 20:56:05 +01:00
2024-02-20 08:48:19 +01:00
svr . Get ( " /health " , [ & ] ( const httplib : : Request & req , httplib : : Response & res ) {
2024-01-11 08:10:34 +01:00
server_state current_state = state . load ( ) ;
2024-01-10 20:56:05 +01:00
switch ( current_state ) {
2024-02-20 08:48:19 +01:00
case SERVER_STATE_READY : {
2024-02-21 15:47:48 +01:00
// request slots data using task queue
task_server task ;
task . id = llama . queue_tasks . get_new_id ( ) ;
2024-02-25 13:49:43 +01:00
task . type = TASK_TYPE_METRICS ;
2024-02-21 15:47:48 +01:00
task . target_id = - 1 ;
llama . queue_results . add_waiting_task_id ( task . id ) ;
llama . queue_tasks . post ( task ) ;
// get the result
task_result result = llama . queue_results . recv ( task . id ) ;
llama . queue_results . remove_waiting_task_id ( task . id ) ;
int n_idle_slots = result . result_json [ " idle " ] ;
int n_processing_slots = result . result_json [ " processing " ] ;
json health = {
{ " status " , " ok " } ,
{ " slots_idle " , n_idle_slots } ,
{ " slots_processing " , n_processing_slots } } ;
res . status = 200 ; // HTTP OK
if ( sparams . slots_endpoint & & req . has_param ( " include_slots " ) ) {
health [ " slots " ] = result . result_json [ " slots " ] ;
2024-02-20 08:48:19 +01:00
}
2024-02-21 15:47:48 +01:00
if ( n_idle_slots = = 0 ) {
health [ " status " ] = " no slot available " ;
2024-02-20 08:48:19 +01:00
if ( req . has_param ( " fail_on_no_slot " ) ) {
2024-02-18 17:31:28 +01:00
res . status = 503 ; // HTTP Service Unavailable
}
}
2024-02-21 15:47:48 +01:00
res . set_content ( health . dump ( ) , " application/json " ) ;
2024-01-10 20:56:05 +01:00
break ;
2024-02-20 08:48:19 +01:00
}
2024-01-11 08:10:34 +01:00
case SERVER_STATE_LOADING_MODEL :
2024-01-10 20:56:05 +01:00
res . set_content ( R " ({ " status " : " loading model " }) " , " application/json " ) ;
res . status = 503 ; // HTTP Service Unavailable
break ;
2024-01-11 08:10:34 +01:00
case SERVER_STATE_ERROR :
2024-01-10 20:56:05 +01:00
res . set_content ( R " ({ " status " : " error " , " error " : " Model failed to load " }) " , " application/json " ) ;
res . status = 500 ; // HTTP Internal Server Error
break ;
}
} ) ;
2024-02-18 18:39:57 +01:00
if ( sparams . slots_endpoint ) {
svr . Get ( " /slots " , [ & ] ( const httplib : : Request & , httplib : : Response & res ) {
2024-02-21 15:47:48 +01:00
// request slots data using task queue
task_server task ;
task . id = llama . queue_tasks . get_new_id ( ) ;
2024-02-25 13:49:43 +01:00
task . type = TASK_TYPE_METRICS ;
2024-02-21 15:47:48 +01:00
task . target_id = - 1 ;
2024-02-18 18:39:57 +01:00
2024-02-21 15:47:48 +01:00
llama . queue_results . add_waiting_task_id ( task . id ) ;
llama . queue_tasks . post ( task ) ;
// get the result
task_result result = llama . queue_results . recv ( task . id ) ;
llama . queue_results . remove_waiting_task_id ( task . id ) ;
res . set_content ( result . result_json [ " slots " ] . dump ( ) , " application/json " ) ;
2024-02-18 18:39:57 +01:00
res . status = 200 ; // HTTP OK
} ) ;
}
2024-02-25 13:49:43 +01:00
if ( sparams . metrics_endpoint ) {
svr . Get ( " /metrics " , [ & ] ( const httplib : : Request & , httplib : : Response & res ) {
// request slots data using task queue
task_server task ;
task . id = llama . queue_tasks . get_new_id ( ) ;
task . type = TASK_TYPE_METRICS ;
task . target_id = - 1 ;
llama . queue_results . add_waiting_task_id ( task . id ) ;
llama . queue_tasks . post ( task ) ;
// get the result
task_result result = llama . queue_results . recv ( task . id ) ;
llama . queue_results . remove_waiting_task_id ( task . id ) ;
json data = result . result_json ;
uint64_t n_prompt_tokens_processed = data [ " n_prompt_tokens_processed " ] ;
uint64_t t_prompt_processing = data [ " t_prompt_processing " ] ;
uint64_t n_tokens_predicted = data [ " n_tokens_predicted " ] ;
uint64_t t_tokens_generation = data [ " t_tokens_generation " ] ;
int32_t kv_cache_used_cells = data [ " kv_cache_used_cells " ] ;
// metrics definition: https://prometheus.io/docs/practices/naming/#metric-names
json all_metrics_def = json {
{ " counter " , { {
{ " name " , " prompt_tokens_total " } ,
{ " help " , " Number of prompt tokens processed. " } ,
{ " value " , data [ " n_prompt_tokens_processed_total " ] }
} , {
{ " name " , " tokens_predicted_total " } ,
{ " help " , " Number of generation tokens processed. " } ,
{ " value " , data [ " n_tokens_predicted_total " ] }
} } } ,
{ " gauge " , { {
{ " name " , " prompt_tokens_seconds " } ,
{ " help " , " Average prompt throughput in tokens/s. " } ,
{ " value " , n_prompt_tokens_processed ? 1e3 / t_prompt_processing * n_prompt_tokens_processed : 0 }
} , {
{ " name " , " predicted_tokens_seconds " } ,
{ " help " , " Average generation throughput in tokens/s. " } ,
{ " value " , n_tokens_predicted ? 1e3 / t_tokens_generation * n_tokens_predicted : 0 }
} , {
{ " name " , " kv_cache_usage_ratio " } ,
{ " help " , " KV-cache usage. 1 means 100 percent usage. " } ,
{ " value " , 1. * kv_cache_used_cells / params . n_ctx }
} , {
{ " name " , " kv_cache_tokens " } ,
{ " help " , " KV-cache tokens. " } ,
{ " value " , data [ " kv_cache_tokens_count " ] }
} , {
{ " name " , " requests_processing " } ,
{ " help " , " Number of request processing. " } ,
{ " value " , data [ " processing " ] }
} , {
{ " name " , " requests_deferred " } ,
{ " help " , " Number of request deferred. " } ,
{ " value " , data [ " deferred " ] }
} } }
} ;
std : : stringstream prometheus ;
for ( const auto & el : all_metrics_def . items ( ) ) {
const auto & type = el . key ( ) ;
const auto & metrics_def = el . value ( ) ;
for ( const auto & metric_def : metrics_def ) {
std : : string name = metric_def [ " name " ] ;
std : : string help = metric_def [ " help " ] ;
prometheus < < " # HELP llamacpp: " < < name < < " " < < help < < " \n "
< < " # TYPE llamacpp: " < < name < < " " < < type < < " \n "
< < " llamacpp: " < < name < < " " < < metric_def [ " value " ] < < " \n " ;
}
}
res . set_content ( prometheus . str ( ) , " text/plain; version=0.0.4 " ) ;
res . status = 200 ; // HTTP OK
} ) ;
}
2024-01-10 20:56:05 +01:00
svr . set_logger ( log_server_request ) ;
svr . set_exception_handler ( [ ] ( const httplib : : Request & , httplib : : Response & res , std : : exception_ptr ep )
{
const char fmt [ ] = " 500 Internal Server Error \n %s " ;
char buf [ BUFSIZ ] ;
try
{
std : : rethrow_exception ( std : : move ( ep ) ) ;
}
catch ( std : : exception & e )
{
snprintf ( buf , sizeof ( buf ) , fmt , e . what ( ) ) ;
}
catch ( . . . )
{
snprintf ( buf , sizeof ( buf ) , fmt , " Unknown Exception " ) ;
}
res . set_content ( buf , " text/plain; charset=utf-8 " ) ;
res . status = 500 ;
} ) ;
svr . set_error_handler ( [ ] ( const httplib : : Request & , httplib : : Response & res )
{
if ( res . status = = 401 )
{
res . set_content ( " Unauthorized " , " text/plain; charset=utf-8 " ) ;
}
if ( res . status = = 400 )
{
res . set_content ( " Invalid request " , " text/plain; charset=utf-8 " ) ;
}
else if ( res . status = = 404 )
{
res . set_content ( " File Not Found " , " text/plain; charset=utf-8 " ) ;
res . status = 404 ;
}
} ) ;
// set timeouts and change hostname and port
svr . set_read_timeout ( sparams . read_timeout ) ;
svr . set_write_timeout ( sparams . write_timeout ) ;
if ( ! svr . bind_to_port ( sparams . hostname , sparams . port ) )
2023-07-05 22:51:13 +02:00
{
2024-01-10 20:56:05 +01:00
fprintf ( stderr , " \n couldn't bind to server socket: hostname=%s port=%d \n \n " , sparams . hostname . c_str ( ) , sparams . port ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
return 1 ;
}
2024-01-10 20:56:05 +01:00
// Set the base directory for serving static files
svr . set_base_dir ( sparams . public_path ) ;
2023-10-22 21:53:08 +02:00
2024-01-10 20:56:05 +01:00
std : : unordered_map < std : : string , std : : string > log_data ;
log_data [ " hostname " ] = sparams . hostname ;
log_data [ " port " ] = std : : to_string ( sparams . port ) ;
2024-01-11 18:51:17 +01:00
if ( sparams . api_keys . size ( ) = = 1 ) {
log_data [ " api_key " ] = " api_key: **** " + sparams . api_keys [ 0 ] . substr ( sparams . api_keys [ 0 ] . length ( ) - 4 ) ;
} else if ( sparams . api_keys . size ( ) > 1 ) {
log_data [ " api_key " ] = " api_key: " + std : : to_string ( sparams . api_keys . size ( ) ) + " keys loaded " ;
2024-01-10 20:56:05 +01:00
}
// load the model
if ( ! llama . load_model ( params ) )
{
2024-01-11 08:10:34 +01:00
state . store ( SERVER_STATE_ERROR ) ;
2024-01-10 20:56:05 +01:00
return 1 ;
} else {
llama . initialize ( ) ;
2024-01-11 08:10:34 +01:00
state . store ( SERVER_STATE_READY ) ;
2024-01-11 18:41:39 +01:00
LOG_INFO ( " model loaded " , { } ) ;
2024-01-10 20:56:05 +01:00
}
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-02-22 09:33:24 +01:00
if ( sparams . chat_template . empty ( ) ) { // custom chat template is not supplied
// check if the template comes with the model is supported by us
llama . validate_model_chat_template ( sparams ) ;
}
2023-12-15 12:49:01 +01:00
// Middleware for API key validation
auto validate_api_key = [ & sparams ] ( const httplib : : Request & req , httplib : : Response & res ) - > bool {
// If API key is not set, skip validation
2024-01-11 18:51:17 +01:00
if ( sparams . api_keys . empty ( ) ) {
2023-12-15 12:49:01 +01:00
return true ;
}
// Check for API key in the header
auto auth_header = req . get_header_value ( " Authorization " ) ;
std : : string prefix = " Bearer " ;
if ( auth_header . substr ( 0 , prefix . size ( ) ) = = prefix ) {
std : : string received_api_key = auth_header . substr ( prefix . size ( ) ) ;
2024-01-11 18:51:17 +01:00
if ( std : : find ( sparams . api_keys . begin ( ) , sparams . api_keys . end ( ) , received_api_key ) ! = sparams . api_keys . end ( ) ) {
2023-12-15 12:49:01 +01:00
return true ; // API key is valid
}
}
// API key is invalid or not provided
2023-12-17 15:56:09 +01:00
res . set_content ( " Unauthorized: Invalid API Key " , " text/plain; charset=utf-8 " ) ;
2023-12-15 12:49:01 +01:00
res . status = 401 ; // Unauthorized
LOG_WARNING ( " Unauthorized: Invalid API Key " , { } ) ;
return false ;
} ;
2023-07-04 16:05:27 +02:00
// this is only called if no index.html is found in the public --path
2023-10-22 21:53:08 +02:00
svr . Get ( " / " , [ ] ( const httplib : : Request & , httplib : : Response & res )
2023-07-05 22:51:13 +02:00
{
2023-12-17 15:56:09 +01:00
res . set_content ( reinterpret_cast < const char * > ( & index_html ) , index_html_len , " text/html; charset=utf-8 " ) ;
2023-10-22 21:53:08 +02:00
return false ;
} ) ;
2023-07-05 22:51:13 +02:00
// this is only called if no index.js is found in the public --path
2023-10-22 21:53:08 +02:00
svr . Get ( " /index.js " , [ ] ( const httplib : : Request & , httplib : : Response & res )
2023-07-05 22:51:13 +02:00
{
2023-12-17 15:56:09 +01:00
res . set_content ( reinterpret_cast < const char * > ( & index_js ) , index_js_len , " text/javascript; charset=utf-8 " ) ;
2023-10-22 21:53:08 +02:00
return false ;
} ) ;
2023-07-04 16:05:27 +02:00
// this is only called if no index.html is found in the public --path
2023-10-22 21:53:08 +02:00
svr . Get ( " /completion.js " , [ ] ( const httplib : : Request & , httplib : : Response & res )
2023-07-05 22:51:13 +02:00
{
2023-12-17 15:56:09 +01:00
res . set_content ( reinterpret_cast < const char * > ( & completion_js ) , completion_js_len , " application/javascript; charset=utf-8 " ) ;
2023-10-22 21:53:08 +02:00
return false ;
} ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-08-15 00:14:14 +02:00
// this is only called if no index.html is found in the public --path
2023-10-22 21:53:08 +02:00
svr . Get ( " /json-schema-to-grammar.mjs " , [ ] ( const httplib : : Request & , httplib : : Response & res )
2023-08-15 00:14:14 +02:00
{
2023-12-17 15:56:09 +01:00
res . set_content ( reinterpret_cast < const char * > ( & json_schema_to_grammar_mjs ) , json_schema_to_grammar_mjs_len , " application/javascript; charset=utf-8 " ) ;
2023-10-22 21:53:08 +02:00
return false ;
} ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2024-01-11 19:02:48 +01:00
svr . Get ( " /props " , [ & llama ] ( const httplib : : Request & req , httplib : : Response & res )
2023-10-22 21:53:08 +02:00
{
2024-01-11 19:02:48 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
2023-10-22 21:53:08 +02:00
json data = {
{ " user_name " , llama . name_user . c_str ( ) } ,
2024-02-05 09:10:22 +01:00
{ " assistant_name " , llama . name_assistant . c_str ( ) } ,
2024-02-07 07:15:19 +01:00
{ " default_generation_settings " , llama . default_generation_settings_for_props } ,
{ " total_slots " , llama . params . n_parallel }
2023-10-22 21:53:08 +02:00
} ;
2023-12-17 15:56:09 +01:00
res . set_content ( data . dump ( ) , " application/json; charset=utf-8 " ) ;
2023-10-22 21:53:08 +02:00
} ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-12-15 12:49:01 +01:00
svr . Post ( " /completion " , [ & llama , & validate_api_key ] ( const httplib : : Request & req , httplib : : Response & res )
2023-10-22 21:53:08 +02:00
{
2024-01-11 19:02:48 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
2023-12-15 12:49:01 +01:00
if ( ! validate_api_key ( req , res ) ) {
return ;
}
2023-10-22 21:53:08 +02:00
json data = json : : parse ( req . body ) ;
2024-01-26 13:42:20 +01:00
const int task_id = llama . queue_tasks . get_new_id ( ) ;
llama . queue_results . add_waiting_task_id ( task_id ) ;
llama . request_completion ( task_id , data , false , false , - 1 ) ;
2023-10-22 21:53:08 +02:00
if ( ! json_value ( data , " stream " , false ) ) {
std : : string completion_text ;
2024-01-26 13:42:20 +01:00
task_result result = llama . queue_results . recv ( task_id ) ;
2023-10-24 22:08:20 +02:00
if ( ! result . error & & result . stop ) {
2023-12-17 15:56:09 +01:00
res . set_content ( result . result_json . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) , " application/json; charset=utf-8 " ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2023-10-22 21:53:08 +02:00
else
{
res . status = 404 ;
2023-12-17 15:56:09 +01:00
res . set_content ( result . result_json [ " content " ] , " text/plain; charset=utf-8 " ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2024-01-26 13:42:20 +01:00
llama . queue_results . remove_waiting_task_id ( task_id ) ;
2023-10-22 21:53:08 +02:00
} else {
const auto chunked_content_provider = [ task_id , & llama ] ( size_t , httplib : : DataSink & sink )
{
while ( true )
{
2024-01-26 13:42:20 +01:00
task_result result = llama . queue_results . recv ( task_id ) ;
2023-10-22 21:53:08 +02:00
if ( ! result . error ) {
const std : : string str =
2023-11-25 10:29:06 +01:00
" data: " +
result . result_json . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) +
" \n \n " ;
2023-10-22 21:53:08 +02:00
LOG_VERBOSE ( " data stream " , {
{ " to_send " , str }
} ) ;
if ( ! sink . write ( str . c_str ( ) , str . size ( ) ) )
{
2024-01-26 13:42:20 +01:00
llama . queue_results . remove_waiting_task_id ( task_id ) ;
2023-10-22 21:53:08 +02:00
return false ;
}
2023-10-24 22:08:20 +02:00
if ( result . stop ) {
2023-10-22 21:53:08 +02:00
break ;
}
} else {
2023-11-19 17:54:10 +01:00
const std : : string str =
2023-11-25 10:29:06 +01:00
" error: " +
result . result_json . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) +
" \n \n " ;
2023-11-19 17:54:10 +01:00
LOG_VERBOSE ( " data stream " , {
{ " to_send " , str }
} ) ;
if ( ! sink . write ( str . c_str ( ) , str . size ( ) ) )
{
2024-01-26 13:42:20 +01:00
llama . queue_results . remove_waiting_task_id ( task_id ) ;
2023-11-19 17:54:10 +01:00
return false ;
}
2023-10-22 21:53:08 +02:00
break ;
2023-08-25 12:32:45 +02:00
}
}
2024-01-26 13:42:20 +01:00
llama . queue_results . remove_waiting_task_id ( task_id ) ;
2023-10-22 21:53:08 +02:00
sink . done ( ) ;
return true ;
} ;
2023-08-25 12:32:45 +02:00
2023-10-22 21:53:08 +02:00
auto on_complete = [ task_id , & llama ] ( bool )
{
// cancel
llama . request_cancel ( task_id ) ;
2024-01-26 13:42:20 +01:00
llama . queue_results . remove_waiting_task_id ( task_id ) ;
2023-10-22 21:53:08 +02:00
} ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
res . set_chunked_content_provider ( " text/event-stream " , chunked_content_provider , on_complete ) ;
}
} ) ;
2023-07-02 23:38:44 +02:00
2024-01-11 19:02:48 +01:00
svr . Get ( " /v1/models " , [ & params ] ( const httplib : : Request & req , httplib : : Response & res )
2023-11-25 10:29:06 +01:00
{
2024-01-11 19:02:48 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
2023-11-25 10:29:06 +01:00
std : : time_t t = std : : time ( 0 ) ;
json models = {
{ " object " , " list " } ,
{ " data " , {
{
{ " id " , params . model_alias } ,
{ " object " , " model " } ,
{ " created " , t } ,
{ " owned_by " , " llamacpp " }
} ,
} }
} ;
2023-12-17 15:56:09 +01:00
res . set_content ( models . dump ( ) , " application/json; charset=utf-8 " ) ;
2023-11-25 10:29:06 +01:00
} ) ;
2024-02-28 09:39:15 +01:00
const auto chat_completions = [ & llama , & validate_api_key , & sparams ] ( const httplib : : Request & req , httplib : : Response & res )
{
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
if ( ! validate_api_key ( req , res ) ) {
return ;
}
json data = oaicompat_completion_params_parse ( llama . model , json : : parse ( req . body ) , sparams . chat_template ) ;
2024-01-11 19:02:48 +01:00
2024-02-28 09:39:15 +01:00
const int task_id = llama . queue_tasks . get_new_id ( ) ;
llama . queue_results . add_waiting_task_id ( task_id ) ;
llama . request_completion ( task_id , data , false , false , - 1 ) ;
2023-11-25 10:29:06 +01:00
2024-02-28 09:39:15 +01:00
if ( ! json_value ( data , " stream " , false ) ) {
std : : string completion_text ;
task_result result = llama . queue_results . recv ( task_id ) ;
2023-11-25 10:29:06 +01:00
2024-02-28 09:39:15 +01:00
if ( ! result . error & & result . stop ) {
json oaicompat_result = format_final_response_oaicompat ( data , result ) ;
2023-11-25 10:29:06 +01:00
2024-02-28 09:39:15 +01:00
res . set_content ( oaicompat_result . dump ( - 1 , ' ' , false ,
json : : error_handler_t : : replace ) ,
" application/json; charset=utf-8 " ) ;
} else {
res . status = 500 ;
res . set_content ( result . result_json [ " content " ] , " text/plain; charset=utf-8 " ) ;
}
llama . queue_results . remove_waiting_task_id ( task_id ) ;
} else {
const auto chunked_content_provider = [ task_id , & llama ] ( size_t , httplib : : DataSink & sink ) {
while ( true ) {
task_result llama_result = llama . queue_results . recv ( task_id ) ;
if ( ! llama_result . error ) {
std : : vector < json > result_array = format_partial_response_oaicompat ( llama_result ) ;
2023-11-25 10:29:06 +01:00
2024-02-28 09:39:15 +01:00
for ( auto it = result_array . begin ( ) ; it ! = result_array . end ( ) ; + + it )
{
if ( ! it - > empty ( ) ) {
2023-11-25 10:29:06 +01:00
const std : : string str =
2024-02-28 09:39:15 +01:00
" data: " +
it - > dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) +
2023-11-25 10:29:06 +01:00
" \n \n " ;
LOG_VERBOSE ( " data stream " , { { " to_send " , str } } ) ;
if ( ! sink . write ( str . c_str ( ) , str . size ( ) ) ) {
2024-01-26 13:42:20 +01:00
llama . queue_results . remove_waiting_task_id ( task_id ) ;
2023-11-25 10:29:06 +01:00
return false ;
}
}
}
2024-02-28 09:39:15 +01:00
if ( llama_result . stop ) {
break ;
}
} else {
const std : : string str =
" error: " +
llama_result . result_json . dump ( - 1 , ' ' , false ,
json : : error_handler_t : : replace ) +
" \n \n " ;
LOG_VERBOSE ( " data stream " , { { " to_send " , str } } ) ;
if ( ! sink . write ( str . c_str ( ) , str . size ( ) ) ) {
llama . queue_results . remove_waiting_task_id ( task_id ) ;
return false ;
}
break ;
}
}
sink . done ( ) ;
llama . queue_results . remove_waiting_task_id ( task_id ) ;
return true ;
} ;
2023-11-25 10:29:06 +01:00
2024-02-28 09:39:15 +01:00
auto on_complete = [ task_id , & llama ] ( bool ) {
// cancel request
llama . request_cancel ( task_id ) ;
llama . queue_results . remove_waiting_task_id ( task_id ) ;
} ;
2023-11-25 10:29:06 +01:00
2024-02-28 09:39:15 +01:00
res . set_chunked_content_provider ( " text/event-stream " , chunked_content_provider , on_complete ) ;
}
} ;
svr . Post ( " /chat/completions " , chat_completions ) ;
svr . Post ( " /v1/chat/completions " , chat_completions ) ;
2023-11-25 10:29:06 +01:00
2023-12-15 12:49:01 +01:00
svr . Post ( " /infill " , [ & llama , & validate_api_key ] ( const httplib : : Request & req , httplib : : Response & res )
2023-10-22 21:53:08 +02:00
{
2024-01-11 19:02:48 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
2023-12-15 12:49:01 +01:00
if ( ! validate_api_key ( req , res ) ) {
return ;
}
2023-10-22 21:53:08 +02:00
json data = json : : parse ( req . body ) ;
2024-01-26 13:42:20 +01:00
const int task_id = llama . queue_tasks . get_new_id ( ) ;
llama . queue_results . add_waiting_task_id ( task_id ) ;
llama . request_completion ( task_id , data , true , false , - 1 ) ;
2023-10-22 21:53:08 +02:00
if ( ! json_value ( data , " stream " , false ) ) {
std : : string completion_text ;
2024-01-26 13:42:20 +01:00
task_result result = llama . queue_results . recv ( task_id ) ;
2023-10-22 21:53:08 +02:00
if ( ! result . error & & result . stop )
{
2023-12-17 15:56:09 +01:00
res . set_content ( result . result_json . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) , " application/json; charset=utf-8 " ) ;
2023-07-02 23:38:44 +02:00
}
2023-10-22 21:53:08 +02:00
else
{
res . status = 404 ;
2023-12-17 15:56:09 +01:00
res . set_content ( result . result_json [ " content " ] , " text/plain; charset=utf-8 " ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
}
2024-01-26 13:42:20 +01:00
llama . queue_results . remove_waiting_task_id ( task_id ) ;
2023-10-02 09:42:02 +02:00
} else {
2023-10-22 21:53:08 +02:00
const auto chunked_content_provider = [ task_id , & llama ] ( size_t , httplib : : DataSink & sink ) {
while ( true )
{
2024-01-26 13:42:20 +01:00
task_result result = llama . queue_results . recv ( task_id ) ;
2023-10-22 21:53:08 +02:00
if ( ! result . error ) {
const std : : string str =
" data: " +
result . result_json . dump ( - 1 , ' ' , false , json : : error_handler_t : : replace ) +
" \n \n " ;
LOG_VERBOSE ( " data stream " , {
{ " to_send " , str }
} ) ;
if ( ! sink . write ( str . c_str ( ) , str . size ( ) ) )
{
2024-01-26 13:42:20 +01:00
llama . queue_results . remove_waiting_task_id ( task_id ) ;
2023-10-22 21:53:08 +02:00
return false ;
}
if ( result . stop )
{
break ;
}
}
else
{
break ;
}
2023-10-02 09:42:02 +02:00
}
2024-01-26 13:42:20 +01:00
llama . queue_results . remove_waiting_task_id ( task_id ) ;
2023-10-22 21:53:08 +02:00
sink . done ( ) ;
return true ;
} ;
2023-10-02 09:42:02 +02:00
2023-10-22 21:53:08 +02:00
auto on_complete = [ task_id , & llama ] ( bool )
{
// cancel
llama . request_cancel ( task_id ) ;
} ;
2023-10-02 09:42:02 +02:00
2023-10-22 21:53:08 +02:00
res . set_chunked_content_provider ( " text/event-stream " , chunked_content_provider , on_complete ) ;
2023-10-02 09:42:02 +02:00
}
2023-10-22 21:53:08 +02:00
} ) ;
2023-10-02 09:42:02 +02:00
2023-10-22 21:53:08 +02:00
svr . Options ( R " (/.*) " , [ ] ( const httplib : : Request & , httplib : : Response & res )
2023-12-17 15:56:09 +01:00
{ return res . set_content ( " " , " application/json; charset=utf-8 " ) ; } ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
2023-10-22 21:53:08 +02:00
svr . Post ( " /tokenize " , [ & llama ] ( const httplib : : Request & req , httplib : : Response & res )
{
2024-01-11 19:02:48 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
2023-10-22 21:53:08 +02:00
const json body = json : : parse ( req . body ) ;
std : : vector < llama_token > tokens ;
if ( body . count ( " content " ) ! = 0 )
{
tokens = llama . tokenize ( body [ " content " ] , false ) ;
}
const json data = format_tokenizer_response ( tokens ) ;
2023-12-17 15:56:09 +01:00
return res . set_content ( data . dump ( ) , " application/json; charset=utf-8 " ) ;
2023-10-22 21:53:08 +02:00
} ) ;
2023-10-20 20:07:23 +02:00
2023-10-22 21:53:08 +02:00
svr . Post ( " /detokenize " , [ & llama ] ( const httplib : : Request & req , httplib : : Response & res )
{
2024-01-11 19:02:48 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
2023-10-22 21:53:08 +02:00
const json body = json : : parse ( req . body ) ;
std : : string content ;
if ( body . count ( " tokens " ) ! = 0 )
{
const std : : vector < llama_token > tokens = body [ " tokens " ] ;
content = tokens_to_str ( llama . ctx , tokens . cbegin ( ) , tokens . cend ( ) ) ;
}
2023-10-20 20:07:23 +02:00
2023-10-22 21:53:08 +02:00
const json data = format_detokenized_response ( content ) ;
2023-12-17 15:56:09 +01:00
return res . set_content ( data . dump ( ) , " application/json; charset=utf-8 " ) ;
2023-10-22 21:53:08 +02:00
} ) ;
2023-06-20 00:12:39 +02:00
2023-10-22 21:53:08 +02:00
svr . Post ( " /embedding " , [ & llama ] ( const httplib : : Request & req , httplib : : Response & res )
{
2024-01-11 19:02:48 +01:00
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
2023-10-22 21:53:08 +02:00
const json body = json : : parse ( req . body ) ;
json prompt ;
if ( body . count ( " content " ) ! = 0 )
{
prompt = body [ " content " ] ;
}
else
{
prompt = " " ;
}
2023-12-29 15:22:10 +01:00
json image_data ;
if ( body . count ( " image_data " ) ! = 0 ) {
image_data = body [ " image_data " ] ;
}
else
{
image_data = " " ;
}
2024-01-26 13:42:20 +01:00
// create and queue the task
const int task_id = llama . queue_tasks . get_new_id ( ) ;
llama . queue_results . add_waiting_task_id ( task_id ) ;
llama . request_completion ( task_id , { { " prompt " , prompt } , { " n_predict " , 0 } , { " image_data " , image_data } } , false , true , - 1 ) ;
// get the result
task_result result = llama . queue_results . recv ( task_id ) ;
llama . queue_results . remove_waiting_task_id ( task_id ) ;
// send the result
2023-12-17 15:56:09 +01:00
return res . set_content ( result . result_json . dump ( ) , " application/json; charset=utf-8 " ) ;
2023-10-22 21:53:08 +02:00
} ) ;
2023-06-20 00:12:39 +02:00
2024-01-29 14:48:10 +01:00
svr . Post ( " /v1/embeddings " , [ & llama ] ( const httplib : : Request & req , httplib : : Response & res )
{
res . set_header ( " Access-Control-Allow-Origin " , req . get_header_value ( " Origin " ) ) ;
const json body = json : : parse ( req . body ) ;
json prompt ;
if ( body . count ( " input " ) ! = 0 )
{
prompt = body [ " input " ] ;
// batch
if ( prompt . is_array ( ) ) {
json data = json : : array ( ) ;
int i = 0 ;
for ( const json & elem : prompt ) {
const int task_id = llama . queue_tasks . get_new_id ( ) ;
llama . queue_results . add_waiting_task_id ( task_id ) ;
llama . request_completion ( task_id , { { " prompt " , elem } , { " n_predict " , 0 } } , false , true , - 1 ) ;
// get the result
task_result result = llama . queue_results . recv ( task_id ) ;
llama . queue_results . remove_waiting_task_id ( task_id ) ;
json embedding = json {
{ " embedding " , json_value ( result . result_json , " embedding " , json : : array ( ) ) } ,
{ " index " , i + + } ,
{ " object " , " embedding " }
} ;
data . push_back ( embedding ) ;
}
json result = format_embeddings_response_oaicompat ( body , data ) ;
return res . set_content ( result . dump ( ) , " application/json; charset=utf-8 " ) ;
}
}
else
{
prompt = " " ;
}
// create and queue the task
const int task_id = llama . queue_tasks . get_new_id ( ) ;
llama . queue_results . add_waiting_task_id ( task_id ) ;
llama . request_completion ( task_id , { { " prompt " , prompt } , { " n_predict " , 0 } } , false , true , - 1 ) ;
// get the result
task_result result = llama . queue_results . recv ( task_id ) ;
llama . queue_results . remove_waiting_task_id ( task_id ) ;
json data = json : : array ( { json {
{ " embedding " , json_value ( result . result_json , " embedding " , json : : array ( ) ) } ,
{ " index " , 0 } ,
{ " object " , " embedding " }
} }
) ;
json root = format_embeddings_response_oaicompat ( body , data ) ;
// send the result
return res . set_content ( root . dump ( ) , " application/json; charset=utf-8 " ) ;
} ) ;
2023-10-22 21:53:08 +02:00
// GG: if I put the main loop inside a thread, it crashes on the first request when build in Debug!?
// "Bus error: 10" - this is on macOS, it does not crash on Linux
//std::thread t2([&]()
2024-01-26 13:42:20 +01:00
/*{
2023-10-22 21:53:08 +02:00
bool running = true ;
while ( running )
{
running = llama . update_slots ( ) ;
}
2024-01-26 13:42:20 +01:00
} */
2023-10-22 21:53:08 +02:00
//);
2023-05-21 19:51:18 +02:00
2024-03-01 10:08:08 +01:00
if ( sparams . n_threads_http > 0 ) {
log_data [ " n_threads_http " ] = std : : to_string ( sparams . n_threads_http ) ;
svr . new_task_queue = [ & sparams ] { return new httplib : : ThreadPool ( sparams . n_threads_http ) ; } ;
}
2024-02-24 12:28:55 +01:00
LOG_INFO ( " HTTP server listening " , log_data ) ;
// run the HTTP server in a thread - see comment below
std : : thread t ( [ & ] ( )
{
if ( ! svr . listen_after_bind ( ) )
{
state . store ( SERVER_STATE_ERROR ) ;
return 1 ;
}
return 0 ;
} ) ;
2024-01-26 13:42:20 +01:00
llama . queue_tasks . on_new_task ( std : : bind (
& llama_server_context : : process_single_task , & llama , std : : placeholders : : _1 ) ) ;
llama . queue_tasks . on_finish_multitask ( std : : bind (
& llama_server_context : : on_finish_multitask , & llama , std : : placeholders : : _1 ) ) ;
2024-02-29 21:42:11 +01:00
llama . queue_tasks . on_run_slots ( std : : bind (
& llama_server_context : : update_slots , & llama ) ) ;
2024-01-26 13:42:20 +01:00
llama . queue_results . on_multitask_update ( std : : bind (
& llama_server_queue : : update_multitask ,
& llama . queue_tasks ,
std : : placeholders : : _1 ,
std : : placeholders : : _2 ,
std : : placeholders : : _3
) ) ;
2024-02-18 17:23:16 +01:00
shutdown_handler = [ & ] ( int ) {
llama . queue_tasks . terminate ( ) ;
} ;
# if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
struct sigaction sigint_action ;
sigint_action . sa_handler = signal_handler ;
sigemptyset ( & sigint_action . sa_mask ) ;
sigint_action . sa_flags = 0 ;
sigaction ( SIGINT , & sigint_action , NULL ) ;
# elif defined (_WIN32)
auto console_ctrl_handler = + [ ] ( DWORD ctrl_type ) - > BOOL {
return ( ctrl_type = = CTRL_C_EVENT ) ? ( signal_handler ( SIGINT ) , true ) : false ;
} ;
SetConsoleCtrlHandler ( reinterpret_cast < PHANDLER_ROUTINE > ( console_ctrl_handler ) , true ) ;
# endif
llama . queue_tasks . start_loop ( ) ;
svr . stop ( ) ;
2023-10-22 21:53:08 +02:00
t . join ( ) ;
2023-07-10 17:49:56 +02:00
2023-10-22 21:53:08 +02:00
llama_backend_free ( ) ;
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.
Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.
This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.
Summary of the changes:
- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables
---------
Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 13:53:04 +02:00
return 0 ;
2023-05-21 19:51:18 +02:00
}