@llama.cpp @results Feature: Results Background: Server startup Given a server listening on localhost:8080 And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models And a model file test-model-00001-of-00003.gguf And 128 as batch size And 1024 KV cache size And 128 max tokens to predict And continuous batching Scenario Outline: consistent results with same seed Given slots Then the server is starting Then the server is healthy Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42 Given concurrent completion requests Then the server is busy Then the server is idle And all slots are idle Then all predictions are equal Examples: | n_slots | | 1 | | 2 | Scenario Outline: different results with different seed Given slots Then the server is starting Then the server is healthy Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42 Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43 Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44 Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45 Given concurrent completion requests Then the server is busy Then the server is idle And all slots are idle Then all predictions are different Examples: | n_slots | | 1 | | 2 | Scenario Outline: consistent results with same seed and varying batch size Given 4 slots And temperature # And 0 as draft Then the server is starting Then the server is healthy Given 1 prompts "Write a very long story about AI." with seed 42 And concurrent completion requests # Then the server is busy # Not all slots will be utilized. Then the server is idle And all slots are idle Given prompts "Write a very long story about AI." with seed 42 And concurrent completion requests # Then the server is busy # Not all slots will be utilized. Then the server is idle And all slots are idle Then all predictions are equal Examples: | n_parallel | temp | | 1 | 0.0 | | 2 | 0.0 | | 4 | 0.0 | | 1 | 1.0 | # FIXME: These tests fail on master. # Problems: unified KV cache (except for CPU backend with LLAMA_NO_LLAMAFILE=1), SIMD nondeterminism. # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227 # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574 # and https://github.com/ggerganov/llama.cpp/pull/7347 . # | 2 | 1.0 | # | 4 | 1.0 | Scenario Outline: consistent token probs with same seed and prompt Given slots And KV cache size And 1.0 temperature And max tokens to predict Then the server is starting Then the server is healthy Given 1 prompts "The meaning of life is" with seed 42 And concurrent completion requests # Then the server is busy # Not all slots will be utilized. Then the server is idle And all slots are idle Given prompts "The meaning of life is" with seed 42 And concurrent completion requests # Then the server is busy # Not all slots will be utilized. Then the server is idle And all slots are idle Then all token probabilities are equal Examples: | n_slots | n_kv | n_predict | n_parallel | | 4 | 1024 | 1 | 1 | | 4 | 1024 | 1 | 4 | # FIXME: These tests fail on master. # Problems: unified KV cache (except for CPU backend with LLAMA_NO_LLAMAFILE=1), SIMD nondeterminism. # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227 # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574 # and https://github.com/ggerganov/llama.cpp/pull/7347 . # | 4 | 1024 | 100 | 1 | # This test still fails even the above patches; the first token probabilities are already different. # | 4 | 1024 | 100 | 4 |