llama.cpp/examples/server-parallel/README.md

# llama.cpp/example/server-parallel

This example demonstrates a PoC HTTP API server that handles simulataneus requests. Long prompts are not supported.

## Quick Start

To get started right away, run the following command, making sure to use the correct path for the model you have:

### Unix-based systems (Linux, macOS, etc.):

```bash
./server-parallel -m models/7B/ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching
```

### Windows:

```powershell
server-parallel.exe -m models\7B\ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching
```
The above command will start a server that by default listens on `127.0.0.1:8080`.

## API Endpoints

-   **GET** `/props`: Return the user and assistant name for generate the prompt.

*Response:*
```json
{
    "user_name": "User:",
    "assistant_name": "Assistant:"
}
```

-   **POST** `/completion`: Given a prompt, it returns the predicted completion, just streaming mode.

    *Options:*

    `temperature`: Adjust the randomness of the generated text (default: 0.1).

    `prompt`: Provide a prompt as a string, It should be a coherent continuation of the system prompt.

    `system_prompt`: Provide a system prompt as a string.

    `anti_prompt`: Provide the name of the user coherent with the system prompt.

    `assistant_name`: Provide the name of the assistant coherent with the system prompt.

*Example request:*
```json
{
    "system_prompt": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nHuman: Hello\nAssistant: Hi, how may I help you?\nHuman:",
    "anti_prompt": "Human:",
    "assistant_name": "Assistant:",
    "prompt": "When is the day of independency of US?",
    "temperature": 0.2
}
```

*Response:*
```json
{
    "content": "<token_str>"
}
```

# This example is a Proof of Concept, have some bugs and unexpected behaivors, this not supports long prompts.
server handling multiple clients with cam 2023-10-05 21:12:39 +02:00			`# llama.cpp/example/server-parallel`

			`This example demonstrates a PoC HTTP API server that handles simulataneus requests. Long prompts are not supported.`

			`## Quick Start`

			`To get started right away, run the following command, making sure to use the correct path for the model you have:`

			`### Unix-based systems (Linux, macOS, etc.):`

			```bash
			`./server-parallel -m models/7B/ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching`
			```

			`### Windows:`

			```powershell
			`server-parallel.exe -m models\7B\ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching`
			```
			The above command will start a server that by default listens on `127.0.0.1:8080`.

			`## API Endpoints`

			- GET `/props`: Return the user and assistant name for generate the prompt.

			`Response:`
			```json
			`{`
			`"user_name": "User:",`
			`"assistant_name": "Assistant:"`
			`}`
			```

			- POST `/completion`: Given a prompt, it returns the predicted completion, just streaming mode.

			`Options:`

			`temperature`: Adjust the randomness of the generated text (default: 0.1).

			`prompt`: Provide a prompt as a string, It should be a coherent continuation of the system prompt.

			`system_prompt`: Provide a system prompt as a string.

			`anti_prompt`: Provide the name of the user coherent with the system prompt.

			`assistant_name`: Provide the name of the assistant coherent with the system prompt.

			`Example request:`
			```json
			`{`
fix json format README 2023-10-05 21:23:58 +02:00			`"system_prompt": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nHuman: Hello\nAssistant: Hi, how may I help you?\nHuman:",`
server handling multiple clients with cam 2023-10-05 21:12:39 +02:00			`"anti_prompt": "Human:",`
			`"assistant_name": "Assistant:",`
			`"prompt": "When is the day of independency of US?",`
			`"temperature": 0.2`
			`}`
			```

			`Response:`
			```json
			`{`
			`"content": "<token_str>"`
			`}`
			```

			`# This example is a Proof of Concept, have some bugs and unexpected behaivors, this not supports long prompts.`