1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS) 2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible. 3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation 4. Update `main` and `server` examples documentation to use the new style dash separator argument format 5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility. 6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356
11 KiB
llama.cpp/example/server
This example allow you to have a llama.cpp http server to interact from a web page or consume the API.
Table of Contents
- Quick Start
- Node JS Test
- API Endpoints
- More examples
- Common Options
- Performance Tuning and Memory Options
Quick Start
To get started right away, run the following command, making sure to use the correct path for the model you have:
Unix-based systems (Linux, macOS, etc.):
./server -m models/7B/ggml-model.bin --ctx_size 2048
Windows:
server.exe -m models\7B\ggml-model.bin --ctx_size 2048
That will start a server that by default listens on 127.0.0.1:8080
. You can consume the endpoints with Postman or NodeJS with axios library.
Node JS Test
You need to have Node.js installed.
mkdir llama-client
cd llama-client
npm init
npm install axios
Create a index.js file and put inside this:
const axios = require("axios");
const prompt = `Building a website can be done in 10 simple steps:`;
async function Test() {
let result = await axios.post("http://127.0.0.1:8080/completion", {
prompt,
batch_size: 128,
n_predict: 512,
});
// the response is received until completion finish
console.log(result.data.content);
}
Test();
And run it:
node .
API Endpoints
You can interact with this API Endpoints. This implementations just support chat style interaction.
- POST
hostname:port/completion
: Setting up the Llama Context to begin the completions tasks.
Options:
batch_size
: Set the batch size for prompt processing (default: 512).
temperature
: Adjust the randomness of the generated text (default: 0.8).
top_k
: Limit the next token selection to the K most probable tokens (default: 40).
top_p
: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).
n_predict
: Set the number of tokens to predict when generating text (default: 128, -1 = infinity).
threads
: Set the number of threads to use during computation.
n_keep
: Specify the number of tokens from the initial prompt to retain when the model resets its internal context. By default, this value is set to 0 (meaning no tokens are kept). Use -1
to retain all tokens from the initial prompt.
as_loop
: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to true
.
interactive
: It allows interacting with the completion, and the completion stops as soon as it encounters a stop word
. To enable this, set to true
.
prompt
: Provide a prompt. Internally, the prompt is compared, and it detects if a part has already been evaluated, and the remaining part will be evaluate.
stop
: Specify the words or characters that indicate a stop. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration.
exclude
: Specify the words or characters you do not want to appear in the completion. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration.
- POST
hostname:port/embedding
: Generate embedding of a given text
Options:
content
: Set the text to get generate the embedding.
threads
: Set the number of threads to use during computation.
To use this endpoint, you need to start the server with the --embedding
option added.
- POST
hostname:port/tokenize
: Tokenize a given text
Options:
content
: Set the text to tokenize.
- GET
hostname:port/next-token
: Receive the next token predicted, execute this request in a loop. Make sure setas_loop
astrue
in the completion request.
Options:
stop
: Set hostname:port/next-token?stop=true
to stop the token generation.
More examples
Interactive mode
This mode allows interacting in a chat-like manner. It is recommended for models designed as assistants such as Vicuna
, WizardLM
, Koala
, among others. Make sure to add the correct stop word for the corresponding model.
The prompt should be generated by you, according to the model's guidelines. You should keep adding the model's completions to the context as well.
This example works well for Vicuna - version 1
.
const axios = require("axios");
let prompt = `A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
### Human: Hello, Assistant.
### Assistant: Hello. How may I help you today?
### Human: Please tell me the largest city in Europe.
### Assistant: Sure. The largest city in Europe is Moscow, the capital of Russia.`;
async function ChatCompletion(answer) {
// the user's next question to the prompt
prompt += `\n### Human: ${answer}\n`
result = await axios.post("http://127.0.0.1:8080/completion", {
prompt,
batch_size: 128,
temperature: 0.2,
top_k: 40,
top_p: 0.9,
n_keep: -1,
n_predict: 2048,
stop: ["\n### Human:"], // when detect this, stop completion
exclude: ["### Assistant:"], // no show in the completion
threads: 8,
as_loop: true, // use this to request the completion token by token
interactive: true, // enable the detection of a stop word
});
// create a loop to receive every token predicted
// note: this operation is blocking, avoid use this in a ui thread
let message = "";
while (true) {
// you can stop the inference adding '?stop=true' like this http://127.0.0.1:8080/next-token?stop=true
result = await axios.get("http://127.0.0.1:8080/next-token");
process.stdout.write(result.data.content);
message += result.data.content;
// to avoid an infinite loop
if (result.data.stop) {
console.log("Completed");
// make sure to add the completion to the prompt.
prompt += `### Assistant: ${message}`;
break;
}
}
}
// This function should be called every time a question to the model is needed.
async function Test() {
// the server can't inference in paralell
await ChatCompletion("Write a long story about a time magician in a fantasy world");
await ChatCompletion("Summary the story");
}
Test();
Alpaca example
Temporaly note: no tested, if you have the model, please test it and report me some issue
const axios = require("axios");
let prompt = `Below is an instruction that describes a task. Write a response that appropriately completes the request.
`;
async function DoInstruction(instruction) {
prompt += `\n\n### Instruction:\n\n${instruction}\n\n### Response:\n\n`;
result = await axios.post("http://127.0.0.1:8080/completion", {
prompt,
batch_size: 128,
temperature: 0.2,
top_k: 40,
top_p: 0.9,
n_keep: -1,
n_predict: 2048,
stop: ["### Instruction:\n\n"], // when detect this, stop completion
exclude: [], // no show in the completion
threads: 8,
as_loop: true, // use this to request the completion token by token
interactive: true, // enable the detection of a stop word
});
// create a loop to receive every token predicted
// note: this operation is blocking, avoid use this in a ui thread
let message = "";
while (true) {
result = await axios.get("http://127.0.0.1:8080/next-token");
process.stdout.write(result.data.content);
message += result.data.content;
// to avoid an infinite loop
if (result.data.stop) {
console.log("Completed");
// make sure to add the completion and the user's next question to the prompt.
prompt += message;
break;
}
}
}
// This function should be called every time a instruction to the model is needed.
DoInstruction("Destroy the world"); // as joke
Embeddings
First, run the server with --embedding
option:
server -m models/7B/ggml-model.bin --ctx_size 2048 --embedding
Run this code in NodeJS:
const axios = require('axios');
async function Test() {
let result = await axios.post("http://127.0.0.1:8080/embedding", {
content: `Hello`,
threads: 5
});
// print the embedding array
console.log(result.data.embedding);
}
Test();
Tokenize
Run this code in NodeJS:
const axios = require('axios');
async function Test() {
let result = await axios.post("http://127.0.0.1:8080/tokenize", {
content: `Hello`
});
// print the embedding array
console.log(result.data.tokens);
}
Test();
Common Options
-m FNAME, --model FNAME
: Specify the path to the LLaMA model file (e.g.,models/7B/ggml-model.bin
).-c N, --ctx-size N
: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.-ngl N, --n-gpu-layers N
: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.--embedding
: Enable the embedding mode. Completion function doesn't work in this mode.--host
: Set the hostname or ip address to listen. Default127.0.0.1
;--port
: Set the port to listen. Default:8080
.
RNG Seed
-s SEED, --seed SEED
: Set the random number generator (RNG) seed (default: -1, < 0 = random seed).
The RNG seed is used to initialize the random number generator that influences the text generation process. By setting a specific seed value, you can obtain consistent and reproducible results across multiple runs with the same input and settings. This can be helpful for testing, debugging, or comparing the effects of different options on the generated text to see when they diverge. If the seed is set to a value less than 0, a random seed will be used, which will result in different outputs on each run.
Performance Tuning and Memory Options
No Memory Mapping
--no-mmap
: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance.
Memory Float 32
--memory-f32
: Use 32-bit floats instead of 16-bit floats for memory key+value. This doubles the context memory requirement but does not appear to increase generation quality in a measurable way. Not recommended.
Limitations:
- The actual implementation of llama.cpp need a
llama-state
for handle multiple contexts and clients, but this could require more powerful hardware.