mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-27 04:23:06 +01:00
docs: update server streaming mode documentation (#9519)
Provide more documentation for streaming mode.
This commit is contained in:
parent
973f328b1e
commit
5555c0c1f6
@ -303,23 +303,23 @@ mkdir llama-client
|
|||||||
cd llama-client
|
cd llama-client
|
||||||
```
|
```
|
||||||
|
|
||||||
Create a index.js file and put this inside:
|
Create an index.js file and put this inside:
|
||||||
|
|
||||||
```javascript
|
```javascript
|
||||||
const prompt = `Building a website can be done in 10 simple steps:`;
|
const prompt = "Building a website can be done in 10 simple steps:"
|
||||||
|
|
||||||
async function Test() {
|
async function test() {
|
||||||
let response = await fetch("http://127.0.0.1:8080/completion", {
|
let response = await fetch("http://127.0.0.1:8080/completion", {
|
||||||
method: 'POST',
|
method: "POST",
|
||||||
body: JSON.stringify({
|
body: JSON.stringify({
|
||||||
prompt,
|
prompt,
|
||||||
n_predict: 512,
|
n_predict: 64,
|
||||||
})
|
})
|
||||||
})
|
})
|
||||||
console.log((await response.json()).content)
|
console.log((await response.json()).content)
|
||||||
}
|
}
|
||||||
|
|
||||||
Test()
|
test()
|
||||||
```
|
```
|
||||||
|
|
||||||
And run it:
|
And run it:
|
||||||
@ -381,7 +381,7 @@ Multiple prompts are also supported. In this case, the completion result will be
|
|||||||
`n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. The number excludes the BOS token.
|
`n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. The number excludes the BOS token.
|
||||||
By default, this value is set to `0`, meaning no tokens are kept. Use `-1` to retain all tokens from the prompt.
|
By default, this value is set to `0`, meaning no tokens are kept. Use `-1` to retain all tokens from the prompt.
|
||||||
|
|
||||||
`stream`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.
|
`stream`: Allows receiving each predicted token in real-time instead of waiting for the completion to finish (uses a different response format). To enable this, set to `true`.
|
||||||
|
|
||||||
`stop`: Specify a JSON array of stopping strings.
|
`stop`: Specify a JSON array of stopping strings.
|
||||||
These words will not be included in the completion, so make sure to add them to the prompt for the next iteration. Default: `[]`
|
These words will not be included in the completion, so make sure to add them to the prompt for the next iteration. Default: `[]`
|
||||||
@ -446,7 +446,7 @@ These words will not be included in the completion, so make sure to add them to
|
|||||||
|
|
||||||
**Response format**
|
**Response format**
|
||||||
|
|
||||||
- Note: When using streaming mode (`stream`), only `content` and `stop` will be returned until end of completion.
|
- Note: In streaming mode (`stream`), only `content` and `stop` will be returned until end of completion. Responses are sent using the [Server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html) standard. Note: the browser's `EventSource` interface cannot be used due to its lack of `POST` request support.
|
||||||
|
|
||||||
- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure:
|
- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure:
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user