mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-11 21:10:24 +01:00

History

Daniel Bevenius 23b5e12eb5

simple : update error message for KV cache check (#4324 )

This commit updates the error message that is printed when the
KV cache is not big enough to hold all the prompt and generated
tokens. Specifically it removes the reference to n_parallel and
replaces it with n_len.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

2023-12-04 18:04:21 +02:00

CMakeLists.txt

examples : add compiler version and target to build info (#2998 )

2023-09-15 16:59:49 -04:00

README.md

llama : custom attention mask + parallel decoding + no context swaps (#3228 )

2023-09-28 19:04:36 +03:00

simple.cpp

simple : update error message for KV cache check (#4324 )

2023-12-04 18:04:21 +02:00

README.md

llama.cpp/example/simple

The purpose of this example is to demonstrate a minimal usage of llama.cpp for generating text with a given prompt.

./simple ./models/llama-7b-v2/ggml-model-f16.gguf "Hello my name is"

...

main: n_len = 32, n_ctx = 2048, n_parallel = 1, n_kv_req = 32

 Hello my name is Shawn and I'm a 20 year old male from the United States. I'm a 20 year old

main: decoded 27 tokens in 2.31 s, speed: 11.68 t/s

llama_print_timings:        load time =   579.15 ms
llama_print_timings:      sample time =     0.72 ms /    28 runs   (    0.03 ms per token, 38888.89 tokens per second)
llama_print_timings: prompt eval time =   655.63 ms /    10 tokens (   65.56 ms per token,    15.25 tokens per second)
llama_print_timings:        eval time =  2180.97 ms /    27 runs   (   80.78 ms per token,    12.38 tokens per second)
llama_print_timings:       total time =  2891.13 ms