Update 04 - Model Tab.md

2024-11-25 09:19:23 +01:00 · 2023-10-23 12:49:07 -07:00 · 2023-10-23 12:49:07 -07:00 · 82c11be067
commit 82c11be067
parent 306d764ff6
1 changed files with 1 additions and 1 deletions
--- a/docs/04
+++ b/docs/04
@ -86,7 +86,7 @@ Loads: GGUF models. Note: GGML models have been deprecated and do not work anymo
 Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF

 * **n-gpu-layers**: The number of layers to allocate to the GPU. If set to 0, only the CPU will be used. If you want to offload all layers, you can simply set this to the maximum value.
-* **n-ctx**: Context length of the model. In llama.cpp, the context is preallocated, so the higher this value, the higher the RAM/VRAM usage will be. It gets automatically updated with the value in the GGUF metadata for the model when you select it in the Model dropdown.
+* **n_ctx**: Context length of the model. In llama.cpp, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on the metadata inside the GGUF file, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "n_ctx" so that you don't have to set the same thing twice.
 * **threads**: Number of threads. Recommended value: your number of physical cores. 
 * **threads_batch**: Number of threads for batch processing. Recommended value: your total number of cores (physical + virtual).
 * **n_batch**: Batch size for prompt processing. Higher values are supposed to make generation faster, but I have never obtained any benefit from changing this value.