mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-23 09:59:18 +01:00
74 lines
2.1 KiB
Markdown
74 lines
2.1 KiB
Markdown
|
# llama.cpp/example/server-parallel
|
||
|
|
||
|
This example demonstrates a PoC HTTP API server that handles simulataneus requests. Long prompts are not supported.
|
||
|
|
||
|
## Quick Start
|
||
|
|
||
|
To get started right away, run the following command, making sure to use the correct path for the model you have:
|
||
|
|
||
|
### Unix-based systems (Linux, macOS, etc.):
|
||
|
|
||
|
```bash
|
||
|
./server-parallel -m models/7B/ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching
|
||
|
```
|
||
|
|
||
|
### Windows:
|
||
|
|
||
|
```powershell
|
||
|
server-parallel.exe -m models\7B\ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching
|
||
|
```
|
||
|
The above command will start a server that by default listens on `127.0.0.1:8080`.
|
||
|
|
||
|
## API Endpoints
|
||
|
|
||
|
- **GET** `/props`: Return the user and assistant name for generate the prompt.
|
||
|
|
||
|
*Response:*
|
||
|
```json
|
||
|
{
|
||
|
"user_name": "User:",
|
||
|
"assistant_name": "Assistant:"
|
||
|
}
|
||
|
```
|
||
|
|
||
|
- **POST** `/completion`: Given a prompt, it returns the predicted completion, just streaming mode.
|
||
|
|
||
|
*Options:*
|
||
|
|
||
|
`temperature`: Adjust the randomness of the generated text (default: 0.1).
|
||
|
|
||
|
`prompt`: Provide a prompt as a string, It should be a coherent continuation of the system prompt.
|
||
|
|
||
|
`system_prompt`: Provide a system prompt as a string.
|
||
|
|
||
|
`anti_prompt`: Provide the name of the user coherent with the system prompt.
|
||
|
|
||
|
`assistant_name`: Provide the name of the assistant coherent with the system prompt.
|
||
|
|
||
|
*Example request:*
|
||
|
```json
|
||
|
{
|
||
|
// this changes the system prompt on runtime
|
||
|
"system_prompt": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
|
||
|
|
||
|
Human: Hello
|
||
|
Assistant: Hi, how may I help you?
|
||
|
Human:",
|
||
|
"anti_prompt": "Human:",
|
||
|
"assistant_name": "Assistant:",
|
||
|
|
||
|
// required options
|
||
|
"prompt": "When is the day of independency of US?",
|
||
|
"temperature": 0.2
|
||
|
}
|
||
|
```
|
||
|
|
||
|
*Response:*
|
||
|
```json
|
||
|
{
|
||
|
"content": "<token_str>"
|
||
|
}
|
||
|
```
|
||
|
|
||
|
# This example is a Proof of Concept, have some bugs and unexpected behaivors, this not supports long prompts.
|