llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-02-02 23:03:37 +01:00

History

Olivier Chafik 1c641e6aac `build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 ) * `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew * server: update refs -> llama-server gitignore llama-server * server: simplify nix package * main: update refs -> llama fix examples/main ref * main/server: fix targets * update more names * Update build.yml * rm accidentally checked in bins * update straggling refs * Update .gitignore * Update server-llm.sh * main: target name -> llama-cli * Prefix all example bins w/ llama- * fix main refs * rename {main->llama}-cmake-pkg binary * prefix more cmake targets w/ llama- * add/fix gbnf-validator subfolder to cmake * sort cmake example subdirs * rm bin files * fix llama-lookup-* Makefile rules * gitignore /llama-* * rename Dockerfiles * rename llama\|main -> llama-cli; consistent RPM bin prefixes * fix some missing -cli suffixes * rename dockerfile w/ llama-cli * rename(make): llama-baby-llama * update dockerfile refs * more llama-cli(.exe) * fix test-eval-callback * rename: llama-cli-cmake-pkg(.exe) * address gbnf-validator unused fread warning (switched to C++ / ifstream) * add two missing llama- prefixes * Updating docs for eval-callback binary to use new `llama-` prefix. * Updating a few lingering doc references for rename of main to llama-cli * Updating `run-with-preset.py` to use new binary names. Updating docs around `perplexity` binary rename. * Updating documentation references for lookup-merge and export-lora * Updating two small `main` references missed earlier in the finetune docs. * Update apps.nix * update grammar/README.md w/ new llama-* names * update llama-rpc-server bin name + doc * Revert "update llama-rpc-server bin name + doc" This reverts commit `e474ef1df4`. * add hot topic notice to README.md * Update README.md * Update README.md * rename gguf-split & quantize bins refs in **/tests.sh --------- Co-authored-by: HanClinto <hanclinto@gmail.com>		2024-06-13 00:41:52 +01:00
..
bench.py	`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )	2024-06-13 00:41:52 +01:00
prometheus.yml	server: continuous performance monitoring and PR comment (#6283 )	2024-03-27 20:26:49 +01:00
README.md	`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )	2024-06-13 00:41:52 +01:00
requirements.txt	server: continuous performance monitoring and PR comment (#6283 )	2024-03-27 20:26:49 +01:00
script.js	bench: server add stop word for PHI-2 (#6916 )	2024-04-26 09:26:16 +02:00

README.md

Server benchmark tools

Benchmark is using k6.

Install k6 and sse extension

SSE is not supported by default in k6, you have to build k6 with the xk6-sse extension.

Example:

go install go.k6.io/xk6/cmd/xk6@latest
xk6 build master \
--with github.com/phymbert/xk6-sse

Download a dataset

This dataset was originally proposed in vLLM benchmarks.

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Download a model

Example for PHI-2

../../../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf

Start the server

The server must answer OAI Chat completion requests on http://localhost:8080/v1 or according to the environment variable SERVER_BENCH_URL.

Example:

server --host localhost --port 8080 \
  --model ggml-model-q4_0.gguf \
  --cont-batching \
  --metrics \
  --parallel 8 \
  --batch-size 512 \
  --ctx-size 4096 \
  --log-format text \
  -ngl 33

Run the benchmark

For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run:

./k6 run script.js --duration 10m --iterations 500 --vus 8

The benchmark values can be overridden with:

SERVER_BENCH_URL server url prefix for chat completions, default http://localhost:8080/v1
SERVER_BENCH_N_PROMPTS total prompts to randomly select in the benchmark, default 480
SERVER_BENCH_MODEL_ALIAS model alias to pass in the completion request, default my-model
SERVER_BENCH_MAX_TOKENS max tokens to predict, default: 512
SERVER_BENCH_DATASET path to the benchmark dataset file
SERVER_BENCH_MAX_PROMPT_TOKENS maximum prompt tokens to filter out in the dataset: default 1024
SERVER_BENCH_MAX_CONTEXT maximum context size of the completions request to filter out in the dataset: prompt + predicted tokens, default 2048

Note: the local tokenizer is just a string space split, real number of tokens will differ.

Or with k6 options:

SERVER_BENCH_N_PROMPTS=500 k6 run script.js --duration 10m --iterations 500 --vus 8

To debug http request use --http-debug="full".

Metrics

Following metrics are available computed from the OAI chat completions response usage:

llamacpp_tokens_second Trend of usage.total_tokens / request duration
llamacpp_prompt_tokens Trend of usage.prompt_tokens
llamacpp_prompt_tokens_total_counter Counter of usage.prompt_tokens
llamacpp_completion_tokens Trend of usage.completion_tokens
llamacpp_completion_tokens_total_counter Counter of usage.completion_tokens
llamacpp_completions_truncated_rate Rate of completions truncated, i.e. if finish_reason === 'length'
llamacpp_completions_stop_rate Rate of completions stopped by the model, i.e. if finish_reason === 'stop'

The script will fail if too many completions are truncated, see llamacpp_completions_truncated_rate.

K6 metrics might be compared against server metrics, with:

curl http://localhost:8080/metrics

Using the CI python script

The bench.py script does several steps:

start the server
define good variable for k6
run k6 script
extract metrics from prometheus

It aims to be used in the CI, but you can run it manually:

LLAMA_SERVER_BIN_PATH=../../../cmake-build-release/bin/llama-server python bench.py \
              --runner-label local \
              --name local \
              --branch `git rev-parse --abbrev-ref HEAD` \
              --commit `git rev-parse HEAD` \
              --scenario script.js \
              --duration 5m \
              --hf-repo ggml-org/models	 \
              --hf-file phi-2/ggml-model-q4_0.gguf \
              --model-path-prefix models \
              --parallel 4 \
              -ngl 33 \
              --batch-size 2048 \
              --ubatch-size	256 \
              --ctx-size 4096 \
              --n-prompts 200 \
              --max-prompt-tokens 256 \
              --max-tokens 256