mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-27 04:23:06 +01:00
llama: Add generic abort to token_decode_internal
Aborting a generation is required if a user wants to decode requests sequentially. Otherwise there is a segfault for the second request because the first request is not done yet. Fortunately, llama.cpp already has a callback to check if a user has aborted with token decode. However, this is only used in the GGML backend for CPU and Metal. Other backends such as CUDA are out of luck. Therefore, add a backend agnostic check that occurs per batch. This allows users to cancel their requests without having to wait for the entire prompt processing operation to finish. An example test is trying to decode an 8000 token prompt with a batch of 2048 and aborting. In this case, the abort will be faster since it's being checked every batch instead of after 8000 tokens. Signed-off-by: kingbri <bdashore3@proton.me>
This commit is contained in:
parent
dc22344088
commit
6893f3ac5d
@ -17561,6 +17561,12 @@ static int llama_decode_internal(
|
||||
};
|
||||
|
||||
while (lctx.sbatch.n_tokens > 0) {
|
||||
// If aborted, break out
|
||||
if (lctx.abort_callback != nullptr && lctx.abort_callback(lctx.abort_callback_data)) {
|
||||
LLAMA_LOG_ERROR("%s: token decode aborted\n", __func__);
|
||||
return -1;
|
||||
}
|
||||
|
||||
llama_ubatch ubatch;
|
||||
if (kv_self.recurrent) {
|
||||
if (embd_pooled) {
|
||||
|
Loading…
Reference in New Issue
Block a user