Pierrick Hymbert a016026a3a
server: continuous performance monitoring and PR comment (#6283)
* server: bench: init

* server: bench: reduce list of GPU nodes

* server: bench: fix graph, fix output artifact

* ci: bench: add mermaid in case of image cannot be uploaded

* ci: bench: more resilient, more metrics

* ci: bench: trigger build

* ci: bench: fix duration

* ci: bench: fix typo

* ci: bench: fix mermaid values, markdown generated

* typo on the step name

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* ci: bench: trailing spaces

* ci: bench: move images in a details section

* ci: bench: reduce bullet point size

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-03-27 20:26:49 +01:00
..

Server benchmark tools

Benchmark is using k6.

Install k6

Follow instruction from: https://k6.io/docs/get-started/installation/

Example for ubuntu:

snap install k6

Download a dataset

This dataset was originally proposed in vLLM benchmarks.

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Download a model

Example for PHI-2

../../../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf

Start the server

The server must answer OAI Chat completion requests on http://localhost:8080/v1 or according to the environment variable SERVER_BENCH_URL.

Example:

server --host localhost --port 8080 \
  --model ggml-model-q4_0.gguf \
  --cont-batching \
  --metrics \
  --parallel 8 \
  --batch-size 512 \
  --ctx-size 4096 \
  --log-format text \
  -ngl 33

Run the benchmark

For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run:

k6 run script.js --duration 10m --iterations 500 --vus 8

The benchmark values can be overridden with:

  • SERVER_BENCH_URL server url prefix for chat completions, default http://localhost:8080/v1
  • SERVER_BENCH_N_PROMPTS total prompts to randomly select in the benchmark, default 480
  • SERVER_BENCH_MODEL_ALIAS model alias to pass in the completion request, default my-model
  • SERVER_BENCH_MAX_TOKENS max tokens to predict, default: 512
  • SERVER_BENCH_DATASET path to the benchmark dataset file
  • SERVER_BENCH_MAX_PROMPT_TOKENS maximum prompt tokens to filter out in the dataset: default 1024
  • SERVER_BENCH_MAX_CONTEXT maximum context size of the completions request to filter out in the dataset: prompt + predicted tokens, default 2048

Note: the local tokenizer is just a string space split, real number of tokens will differ.

Or with k6 options:

SERVER_BENCH_N_PROMPTS=500 k6 run script.js --duration 10m --iterations 500 --vus 8

To debug http request use --http-debug="full".

Metrics

Following metrics are available computed from the OAI chat completions response usage:

  • llamacpp_tokens_second Trend of usage.total_tokens / request duration
  • llamacpp_prompt_tokens Trend of usage.prompt_tokens
  • llamacpp_prompt_tokens_total_counter Counter of usage.prompt_tokens
  • llamacpp_completion_tokens Trend of usage.completion_tokens
  • llamacpp_completion_tokens_total_counter Counter of usage.completion_tokens
  • llamacpp_completions_truncated_rate Rate of completions truncated, i.e. if finish_reason === 'length'
  • llamacpp_completions_stop_rate Rate of completions stopped by the model, i.e. if finish_reason === 'stop'

The script will fail if too many completions are truncated, see llamacpp_completions_truncated_rate.

K6 metrics might be compared against server metrics, with:

curl http://localhost:8080/metrics