2024-04-24 11:08:36 +02:00
|
|
|
@llama.cpp
|
|
|
|
@results
|
|
|
|
Feature: Results
|
|
|
|
|
|
|
|
Background: Server startup
|
|
|
|
Given a server listening on localhost:8080
|
|
|
|
And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
|
|
|
|
And a model file test-model-00001-of-00003.gguf
|
|
|
|
And 128 as batch size
|
2024-05-01 17:52:55 +02:00
|
|
|
And 1024 KV cache size
|
2024-04-24 11:08:36 +02:00
|
|
|
And 128 max tokens to predict
|
2024-05-01 17:52:55 +02:00
|
|
|
And continuous batching
|
2024-04-24 11:08:36 +02:00
|
|
|
|
2024-05-01 17:52:55 +02:00
|
|
|
Scenario Outline: consistent results with same seed
|
2024-04-24 11:08:36 +02:00
|
|
|
Given <n_slots> slots
|
|
|
|
Then the server is starting
|
|
|
|
Then the server is healthy
|
|
|
|
|
2024-05-01 17:52:55 +02:00
|
|
|
Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
|
2024-04-24 11:08:36 +02:00
|
|
|
|
|
|
|
Given concurrent completion requests
|
|
|
|
Then the server is busy
|
|
|
|
Then the server is idle
|
|
|
|
And all slots are idle
|
|
|
|
Then all predictions are equal
|
|
|
|
Examples:
|
|
|
|
| n_slots |
|
|
|
|
| 1 |
|
|
|
|
| 2 |
|
2024-05-01 17:52:55 +02:00
|
|
|
|
|
|
|
Scenario Outline: different results with different seed
|
|
|
|
Given <n_slots> slots
|
|
|
|
Then the server is starting
|
|
|
|
Then the server is healthy
|
|
|
|
|
|
|
|
Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
|
|
|
|
Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43
|
|
|
|
Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44
|
|
|
|
Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45
|
|
|
|
|
|
|
|
Given concurrent completion requests
|
|
|
|
Then the server is busy
|
|
|
|
Then the server is idle
|
|
|
|
And all slots are idle
|
|
|
|
Then all predictions are different
|
|
|
|
Examples:
|
|
|
|
| n_slots |
|
|
|
|
| 1 |
|
|
|
|
| 2 |
|
|
|
|
|
|
|
|
Scenario Outline: consistent results with same seed and varying batch size
|
|
|
|
Given 4 slots
|
|
|
|
And <temp> temperature
|
|
|
|
# And 0 as draft
|
|
|
|
Then the server is starting
|
|
|
|
Then the server is healthy
|
|
|
|
|
|
|
|
Given 1 prompts "Write a very long story about AI." with seed 42
|
|
|
|
And concurrent completion requests
|
|
|
|
# Then the server is busy # Not all slots will be utilized.
|
|
|
|
Then the server is idle
|
|
|
|
And all slots are idle
|
|
|
|
|
|
|
|
Given <n_parallel> prompts "Write a very long story about AI." with seed 42
|
|
|
|
And concurrent completion requests
|
|
|
|
# Then the server is busy # Not all slots will be utilized.
|
|
|
|
Then the server is idle
|
|
|
|
And all slots are idle
|
|
|
|
|
|
|
|
Then all predictions are equal
|
|
|
|
Examples:
|
|
|
|
| n_parallel | temp |
|
2024-05-19 16:26:02 +02:00
|
|
|
| 1 | 0.0 |
|
|
|
|
| 2 | 0.0 |
|
|
|
|
| 4 | 0.0 |
|
|
|
|
| 1 | 1.0 |
|
|
|
|
# FIXME: These tests fail on master.
|
|
|
|
# Problems: unified KV cache (except for CPU backend with LLAMA_NO_LLAMAFILE=1), SIMD nondeterminism.
|
2024-05-01 17:52:55 +02:00
|
|
|
# See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
|
2024-05-19 16:26:02 +02:00
|
|
|
# and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
|
|
|
|
# and https://github.com/ggerganov/llama.cpp/pull/7347 .
|
|
|
|
# | 2 | 1.0 |
|
|
|
|
# | 4 | 1.0 |
|
|
|
|
|
|
|
|
Scenario Outline: consistent token probs with same seed and prompt
|
|
|
|
Given <n_slots> slots
|
|
|
|
And <n_kv> KV cache size
|
|
|
|
And 1.0 temperature
|
|
|
|
And <n_predict> max tokens to predict
|
|
|
|
Then the server is starting
|
|
|
|
Then the server is healthy
|
|
|
|
|
|
|
|
Given 1 prompts "The meaning of life is" with seed 42
|
|
|
|
And concurrent completion requests
|
|
|
|
# Then the server is busy # Not all slots will be utilized.
|
|
|
|
Then the server is idle
|
|
|
|
And all slots are idle
|
|
|
|
|
|
|
|
Given <n_parallel> prompts "The meaning of life is" with seed 42
|
|
|
|
And concurrent completion requests
|
|
|
|
# Then the server is busy # Not all slots will be utilized.
|
|
|
|
Then the server is idle
|
|
|
|
And all slots are idle
|
|
|
|
|
|
|
|
Then all token probabilities are equal
|
|
|
|
Examples:
|
|
|
|
| n_slots | n_kv | n_predict | n_parallel |
|
|
|
|
| 4 | 1024 | 1 | 1 |
|
|
|
|
| 4 | 1024 | 1 | 4 |
|
|
|
|
# FIXME: These tests fail on master.
|
|
|
|
# Problems: unified KV cache (except for CPU backend with LLAMA_NO_LLAMAFILE=1), SIMD nondeterminism.
|
|
|
|
# See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
|
|
|
|
# and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
|
|
|
|
# and https://github.com/ggerganov/llama.cpp/pull/7347 .
|
|
|
|
# | 4 | 1024 | 100 | 1 |
|
|
|
|
# This test still fails even the above patches; the first token probabilities are already different.
|
|
|
|
# | 4 | 1024 | 100 | 4 |
|