mirror of
https://github.com/oobabooga/text-generation-webui.git
synced 2024-11-25 09:19:23 +01:00
Update 04 - Model Tab.md
This commit is contained in:
parent
306d764ff6
commit
82c11be067
@ -86,7 +86,7 @@ Loads: GGUF models. Note: GGML models have been deprecated and do not work anymo
|
||||
Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF
|
||||
|
||||
* **n-gpu-layers**: The number of layers to allocate to the GPU. If set to 0, only the CPU will be used. If you want to offload all layers, you can simply set this to the maximum value.
|
||||
* **n-ctx**: Context length of the model. In llama.cpp, the context is preallocated, so the higher this value, the higher the RAM/VRAM usage will be. It gets automatically updated with the value in the GGUF metadata for the model when you select it in the Model dropdown.
|
||||
* **n_ctx**: Context length of the model. In llama.cpp, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on the metadata inside the GGUF file, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "n_ctx" so that you don't have to set the same thing twice.
|
||||
* **threads**: Number of threads. Recommended value: your number of physical cores.
|
||||
* **threads_batch**: Number of threads for batch processing. Recommended value: your total number of cores (physical + virtual).
|
||||
* **n_batch**: Batch size for prompt processing. Higher values are supposed to make generation faster, but I have never obtained any benefit from changing this value.
|
||||
|
Loading…
Reference in New Issue
Block a user