mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2024-12-25 05:48:47 +01:00
Fix some documentation typos/grammar mistakes (#4032)
* typos * Update examples/parallel/README.md Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
This commit is contained in:
parent
e86fc56f75
commit
532dd74e38
@ -424,7 +424,7 @@ Building the program with BLAS support may lead to some performance improvements
|
|||||||
```
|
```
|
||||||
|
|
||||||
The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
|
The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
|
||||||
If your GPU is not officialy supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 or 11.0.0 on RDNA3.
|
If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 or 11.0.0 on RDNA3.
|
||||||
The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above):
|
The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above):
|
||||||
|
|
||||||
| Option | Legal values | Default | Description |
|
| Option | Legal values | Default | Description |
|
||||||
|
@ -17,7 +17,7 @@ llama_model_load_internal: [cublas] total VRAM used: 17223 MB
|
|||||||
If you see these lines, then the GPU is being used.
|
If you see these lines, then the GPU is being used.
|
||||||
|
|
||||||
## Verifying that the CPU is not oversaturated
|
## Verifying that the CPU is not oversaturated
|
||||||
llama accepts a `-t N` (or `--threads N`) parameter. It's extremely important that this parameter is not too large. If your token generation is extremely slow, try setting this number to 1. If this significantly improves your token generation speed, then your CPU is being oversaturated and you need to explicitly set this parameter to the number of the physicial CPU cores on your machine (even if you utilize a GPU). If in doubt, start with 1 and double the amount until you hit a performance bottleneck, then scale the number down.
|
llama accepts a `-t N` (or `--threads N`) parameter. It's extremely important that this parameter is not too large. If your token generation is extremely slow, try setting this number to 1. If this significantly improves your token generation speed, then your CPU is being oversaturated and you need to explicitly set this parameter to the number of the physical CPU cores on your machine (even if you utilize a GPU). If in doubt, start with 1 and double the amount until you hit a performance bottleneck, then scale the number down.
|
||||||
|
|
||||||
# Example of runtime flags effect on inference speed benchmark
|
# Example of runtime flags effect on inference speed benchmark
|
||||||
These runs were tested on the following machine:
|
These runs were tested on the following machine:
|
||||||
|
@ -142,7 +142,7 @@ The `--ctx-size` option allows you to set the size of the prompt context used by
|
|||||||
|
|
||||||
### Extended Context Size
|
### Extended Context Size
|
||||||
|
|
||||||
Some fine-tuned models have extened the context length by scaling RoPE. For example, if the original pretrained model have a context length (max sequence length) of 4096 (4k) and the fine-tuned model have 32k. That is a scaling factor of 8, and should work by setting the above `--ctx-size` to 32768 (32k) and `--rope-scale` to 8.
|
Some fine-tuned models have extended the context length by scaling RoPE. For example, if the original pre-trained model have a context length (max sequence length) of 4096 (4k) and the fine-tuned model have 32k. That is a scaling factor of 8, and should work by setting the above `--ctx-size` to 32768 (32k) and `--rope-scale` to 8.
|
||||||
|
|
||||||
- `--rope-scale N`: Where N is the linear scaling factor used by the fine-tuned model.
|
- `--rope-scale N`: Where N is the linear scaling factor used by the fine-tuned model.
|
||||||
|
|
||||||
|
@ -1,3 +1,3 @@
|
|||||||
# llama.cpp/example/parallel
|
# llama.cpp/example/parallel
|
||||||
|
|
||||||
Simplified simluation for serving incoming requests in parallel
|
Simplified simulation of serving incoming requests in parallel
|
||||||
|
@ -55,7 +55,7 @@ The order of symbols in a sequence matter. For example, in `"1. " move " " move
|
|||||||
|
|
||||||
Alternatives, denoted by `|`, give different sequences that are acceptable. For example, in `move ::= pawn | nonpawn | castle`, `move` can be a `pawn` move, a `nonpawn` move, or a `castle`.
|
Alternatives, denoted by `|`, give different sequences that are acceptable. For example, in `move ::= pawn | nonpawn | castle`, `move` can be a `pawn` move, a `nonpawn` move, or a `castle`.
|
||||||
|
|
||||||
Parentheses `()` can be used to group sequences, which allows for embedding alternatives in a larger rule or applying repetition and optptional symbols (below) to a sequence.
|
Parentheses `()` can be used to group sequences, which allows for embedding alternatives in a larger rule or applying repetition and optional symbols (below) to a sequence.
|
||||||
|
|
||||||
## Repetition and Optional Symbols
|
## Repetition and Optional Symbols
|
||||||
|
|
||||||
@ -67,7 +67,7 @@ Parentheses `()` can be used to group sequences, which allows for embedding alte
|
|||||||
|
|
||||||
Comments can be specified with `#`:
|
Comments can be specified with `#`:
|
||||||
```
|
```
|
||||||
# defines optional whitspace
|
# defines optional whitespace
|
||||||
ws ::= [ \t\n]+
|
ws ::= [ \t\n]+
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user