mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-01 15:40:21 +01:00

History

Georgi Gerganov 1442677f92 common : refactor cli arg parsing (#7675 ) * common : gpt_params_parse do not print usage * common : rework usage print (wip) * common : valign * common : rework print_usage * infill : remove cfg support * common : reorder args * server : deduplicate parameters ggml-ci * common : add missing header ggml-ci * common : remote --random-prompt usages ggml-ci * examples : migrate to gpt_params ggml-ci * batched-bench : migrate to gpt_params * retrieval : migrate to gpt_params * common : change defaults for escape and n_ctx * common : remove chatml and instruct params ggml-ci * common : passkey use gpt_params		2024-06-04 21:23:39 +03:00
..
batched-bench.cpp	common : refactor cli arg parsing (#7675 )	2024-06-04 21:23:39 +03:00
CMakeLists.txt	batched : add bench tool (#3545 )	2023-10-11 21:25:33 +03:00
README.md	common : refactor cli arg parsing (#7675 )	2024-06-04 21:23:39 +03:00

README.md

llama.cpp/example/batched-bench

Benchmark the batched decoding performance of llama.cpp

Usage

There are 2 modes of operation:

prompt not shared - each batch has a separate prompt of size PP (i.e. N_KV = B*(PP + TG))
prompt is shared - there is a common prompt of size PP used by all batches (i.e. N_KV = PP + B*TG)

./batched-bench -m model.gguf -c 2048 -b 2048 -ub 512 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32 [-pps]

# LLaMA 7B, F16, N_KV_MAX = 16384 (8GB), prompt not shared
./batched-bench -m ./models/llama-7b/ggml-model-f16.gguf -c 16384 -b 2048 -ub 512 -ngl 99

# LLaMA 7B, Q8_0, N_KV_MAX = 16384 (8GB), prompt is shared
./batched-bench -m ./models/llama-7b/ggml-model-q8_0.gguf -c 16384 -b 2048 -ub 512 -ngl 99 -pps

# custom set of batches
./batched-bench -m ./models/llama-7b/ggml-model-q8_0.gguf -c 2048 -b 512 -ub 512 -ngl 999 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32

Sample results

PP - prompt tokens per batch
TG - generated tokens per batch
B - number of batches
N_KV - required KV cache size
T_PP - prompt processing time (i.e. time to first token)
S_PP - prompt processing speed ((B*PP)/T_PP or PP/T_PP)
T_TG - time to generate all batches
S_TG - text generation speed ((B*TG)/T_TG)
T - total time
S - total speed (i.e. all tokens / total time)

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.108	1186.64	3.079	41.57	3.187	80.32
128	128	2	512	0.198	1295.19	5.029	50.90	5.227	97.95
128	128	4	1024	0.373	1373.96	6.878	74.44	7.251	141.23
128	128	8	2048	0.751	1363.27	7.344	139.43	8.095	252.99
128	128	16	4096	1.570	1304.68	8.455	242.23	10.024	408.60
128	128	32	8192	3.408	1201.73	8.801	465.40	12.209	670.96
128	256	1	384	0.107	1196.70	6.329	40.45	6.436	59.67
128	256	2	768	0.194	1317.45	10.239	50.00	10.433	73.61
128	256	4	1536	0.366	1399.03	13.960	73.35	14.326	107.22
128	256	8	3072	0.751	1363.92	15.110	135.54	15.861	193.69
128	256	16	6144	1.569	1304.93	18.073	226.64	19.642	312.80
128	256	32	12288	3.409	1201.35	19.223	426.15	22.633	542.93