* server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Server benchmark tools
Benchmark is using k6.
Install k6
Follow instruction from: https://k6.io/docs/get-started/installation/
Example for ubuntu:
snap install k6
Download a dataset
This dataset was originally proposed in vLLM benchmarks.
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
Download a model
Example for PHI-2
../../../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf
Start the server
The server must answer OAI Chat completion requests on http://localhost:8080/v1
or according to the environment variable SERVER_BENCH_URL
.
Example:
server --host localhost --port 8080 \
--model ggml-model-q4_0.gguf \
--cont-batching \
--metrics \
--parallel 8 \
--batch-size 512 \
--ctx-size 4096 \
--log-format text \
-ngl 33
Run the benchmark
For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run:
k6 run script.js --duration 10m --iterations 500 --vus 8
The benchmark values can be overridden with:
SERVER_BENCH_URL
server url prefix for chat completions, defaulthttp://localhost:8080/v1
SERVER_BENCH_N_PROMPTS
total prompts to randomly select in the benchmark, default480
SERVER_BENCH_MODEL_ALIAS
model alias to pass in the completion request, defaultmy-model
SERVER_BENCH_MAX_TOKENS
max tokens to predict, default:512
SERVER_BENCH_DATASET
path to the benchmark dataset fileSERVER_BENCH_MAX_PROMPT_TOKENS
maximum prompt tokens to filter out in the dataset: default1024
SERVER_BENCH_MAX_CONTEXT
maximum context size of the completions request to filter out in the dataset: prompt + predicted tokens, default2048
Note: the local tokenizer is just a string space split, real number of tokens will differ.
Or with k6 options:
SERVER_BENCH_N_PROMPTS=500 k6 run script.js --duration 10m --iterations 500 --vus 8
To debug http request use --http-debug="full"
.
Metrics
Following metrics are available computed from the OAI chat completions response usage
:
llamacpp_tokens_second
Trend ofusage.total_tokens / request duration
llamacpp_prompt_tokens
Trend ofusage.prompt_tokens
llamacpp_prompt_tokens_total_counter
Counter ofusage.prompt_tokens
llamacpp_completion_tokens
Trend ofusage.completion_tokens
llamacpp_completion_tokens_total_counter
Counter ofusage.completion_tokens
llamacpp_completions_truncated_rate
Rate of completions truncated, i.e. iffinish_reason === 'length'
llamacpp_completions_stop_rate
Rate of completions stopped by the model, i.e. iffinish_reason === 'stop'
The script will fail if too many completions are truncated, see llamacpp_completions_truncated_rate
.
K6 metrics might be compared against server metrics, with:
curl http://localhost:8080/metrics