From 6893f3ac5d1d9bf789be512b7469e3da41ad318d Mon Sep 17 00:00:00 2001
From: kingbri <bdashore3@proton.me>
Date: Thu, 28 Nov 2024 15:42:22 -0500
Subject: [PATCH] llama: Add generic abort to token_decode_internal

Aborting a generation is required if a user wants to decode requests
sequentially. Otherwise there is a segfault for the second request
because the first request is not done yet.

Fortunately, llama.cpp already has a callback to check if a user
has aborted with token decode. However, this is only used in the GGML
backend for CPU and Metal. Other backends such as CUDA are out of luck.

Therefore, add a backend agnostic check that occurs per batch. This allows
users to cancel their requests without having to wait for the entire
prompt processing operation to finish.

An example test is trying to decode an 8000 token prompt with a batch of 2048
and aborting. In this case, the abort will be faster since it's being
checked every batch instead of after 8000 tokens.

Signed-off-by: kingbri <bdashore3@proton.me>
---
 src/llama.cpp | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/src/llama.cpp b/src/llama.cpp
index 22b951ba2..25e3ae84d 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -17561,6 +17561,12 @@ static int llama_decode_internal(
     };
 
     while (lctx.sbatch.n_tokens > 0) {
+        // If aborted, break out
+        if (lctx.abort_callback != nullptr && lctx.abort_callback(lctx.abort_callback_data)) {
+            LLAMA_LOG_ERROR("%s: token decode aborted\n", __func__);
+            return -1;
+        }
+
         llama_ubatch ubatch;
         if (kv_self.recurrent) {
             if (embd_pooled) {