From 82c11be067657d5786ab272b3121b2a8afe6d8b5 Mon Sep 17 00:00:00 2001 From: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Mon, 23 Oct 2023 12:49:07 -0700 Subject: [PATCH] Update 04 - Model Tab.md --- docs/04 ‐ Model Tab.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/04 ‐ Model Tab.md b/docs/04 ‐ Model Tab.md index 05046fb3..be8ad392 100644 --- a/docs/04 ‐ Model Tab.md +++ b/docs/04 ‐ Model Tab.md @@ -86,7 +86,7 @@ Loads: GGUF models. Note: GGML models have been deprecated and do not work anymo Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF * **n-gpu-layers**: The number of layers to allocate to the GPU. If set to 0, only the CPU will be used. If you want to offload all layers, you can simply set this to the maximum value. -* **n-ctx**: Context length of the model. In llama.cpp, the context is preallocated, so the higher this value, the higher the RAM/VRAM usage will be. It gets automatically updated with the value in the GGUF metadata for the model when you select it in the Model dropdown. +* **n_ctx**: Context length of the model. In llama.cpp, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on the metadata inside the GGUF file, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "n_ctx" so that you don't have to set the same thing twice. * **threads**: Number of threads. Recommended value: your number of physical cores. * **threads_batch**: Number of threads for batch processing. Recommended value: your total number of cores (physical + virtual). * **n_batch**: Batch size for prompt processing. Higher values are supposed to make generation faster, but I have never obtained any benefit from changing this value.