README updates and improvements (#3198)

2024-11-22 16:17:57 +01:00 · 2023-07-25 17:58:13 -04:00 · 2023-07-25 17:58:13 -04:00 · f653546484
commit f653546484
parent b09e4f10fd
5 changed files with 38 additions and 37 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 # Text generation web UI
-A gradio web UI for running Large Language Models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and GALACTICA.
+A gradio web UI for running Large Language Models like LLaMA (v1 and v2), GPT-J, Pythia, OPT, and GALACTICA.
 Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) of text generation.
@ -71,9 +71,11 @@ conda activate textgen
 | System | GPU | Command |
 |--------|---------|---------|
 | Linux/WSL | NVIDIA | `pip3 install torch torchvision torchaudio` |
 | Linux/WSL | CPU only | `pip3 install torch torchvision torchaudio` |
 | Linux | AMD | `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2` |
 | MacOS + MPS (untested) | Any | `pip3 install torch torchvision torchaudio` |
 | Windows | NVIDIA | `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117` |
 | Windows | CPU only | `pip3 install torch torchvision torchaudio` |
 The up-to-date commands can be found here: https://pytorch.org/get-started/locally/. 
@ -139,36 +141,7 @@ For example:
 To download a protected model, set env vars `HF_USER` and `HF_PASS` to your Hugging Face username and password (or [User Access Token](https://huggingface.co/settings/tokens)). The model's terms must first be accepted on the HF website.
-#### GGML models
+Many types of models and quantizations such as RWKV, GGML, and GPTQ are supported. For most users quantization is highly recommended due to the performance and memory benefits it provides. For detailed instructions [check out the specific documentation for each type](docs/README.md).
 You can drop these directly into the `models/` folder, making sure that the file name contains `ggml` somewhere and ends in `.bin`.
 #### GPT-4chan
 <details>
 <summary>
 Instructions
 </summary>
 [GPT-4chan](https://huggingface.co/ykilcher/gpt-4chan) has been shut down from Hugging Face, so you need to download it elsewhere. You have two options:
 * Torrent: [16-bit](https://archive.org/details/gpt4chan_model_float16) / [32-bit](https://archive.org/details/gpt4chan_model)
 * Direct download: [16-bit](https://theswissbay.ch/pdf/_notpdf_/gpt4chan_model_float16/) / [32-bit](https://theswissbay.ch/pdf/_notpdf_/gpt4chan_model/)
 The 32-bit version is only relevant if you intend to run the model in CPU mode. Otherwise, you should use the 16-bit version.
 After downloading the model, follow these steps:
 1. Place the files under `models/gpt4chan_model_float16` or `models/gpt4chan_model`.
 2. Place GPT-J 6B's config.json file in that same folder: [config.json](https://huggingface.co/EleutherAI/gpt-j-6B/raw/main/config.json).
 3. Download GPT-J 6B's tokenizer files (they will be automatically detected when you attempt to load GPT-4chan):
 ```
 python download-model.py EleutherAI/gpt-j-6B --text-only
 ```
 When you load this model in default or notebook modes, the "HTML" tab will show the generated text in 4chan format.
 </details>
 ## Starting the web UI
@ -266,8 +239,6 @@ Optionally, you can use the following command-line flags:
 |------------------|-------------|
 |`--gpu-split`     | Comma-separated list of VRAM (in GB) to use per GPU device for model layers, e.g. `20,7,7` |
 |`--max_seq_len MAX_SEQ_LEN`           | Maximum sequence length. |
 |`--compress_pos_emb COMPRESS_POS_EMB` | Positional embeddings compression factor. Should typically be set to max_seq_len / 2048. |
 |`--alpha_value ALPHA_VALUE`           | Positional embeddings alpha factor for NTK RoPE scaling. Same as above. Use either this or compress_pos_emb, not both. `
 #### GPTQ-for-LLaMa
@ -306,6 +277,13 @@ Optionally, you can use the following command-line flags:
 | `--rwkv-strategy RWKV_STRATEGY` | RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". |
 | `--rwkv-cuda-on`                | RWKV: Compile the CUDA kernel for better performance. |
 #### RoPE (for llama.cpp and ExLlama only)
 | Flag             | Description |
 |------------------|-------------|
 |`--compress_pos_emb COMPRESS_POS_EMB` | Positional embeddings compression factor. Should typically be set to max_seq_len / 2048. |
 |`--alpha_value ALPHA_VALUE`           | Positional embeddings alpha factor for NTK RoPE scaling. Scaling is not identical to embedding compression. Use either this or compress_pos_emb, not both. |
 #### Gradio
 | Flag                                  | Description |
@ -333,7 +311,7 @@ Optionally, you can use the following command-line flags:
 |---------------------------------------|-------------|
 | `--multimodal-pipeline PIPELINE`      | The multimodal pipeline to use. Examples: `llava-7b`, `llava-13b`. |
-Out of memory errors? [Check the low VRAM guide](docs/Low-VRAM-guide.md).
+Out of memory errors? Try out [GGML](docs/GGML-llama.cpp-models.md) and [GPTQ](docs/GPTQ-models-(4-bit-mode).md) quantizations. Alternatively check out [the low VRAM guide](docs/Low-VRAM-guide.md).
 ## Presets
--- a/docs/GGML-llama.cpp-models.md
+++ b/docs/GGML-llama.cpp-models.md
--- a/docs/GPT-4chan-model.md
+++ b/docs/GPT-4chan-model.md
@ -0,0 +1,20 @@
 ## GPT-4chan
 [GPT-4chan](https://huggingface.co/ykilcher/gpt-4chan) has been shut down from Hugging Face, so you need to download it elsewhere. You have two options:
 * Torrent: [16-bit](https://archive.org/details/gpt4chan_model_float16) / [32-bit](https://archive.org/details/gpt4chan_model)
 * Direct download: [16-bit](https://theswissbay.ch/pdf/_notpdf_/gpt4chan_model_float16/) / [32-bit](https://theswissbay.ch/pdf/_notpdf_/gpt4chan_model/)
 The 32-bit version is only relevant if you intend to run the model in CPU mode. Otherwise, you should use the 16-bit version.
 After downloading the model, follow these steps:
 1. Place the files under `models/gpt4chan_model_float16` or `models/gpt4chan_model`.
 2. Place GPT-J 6B's config.json file in that same folder: [config.json](https://huggingface.co/EleutherAI/gpt-j-6B/raw/main/config.json).
 3. Download GPT-J 6B's tokenizer files (they will be automatically detected when you attempt to load GPT-4chan):
 ```
 python download-model.py EleutherAI/gpt-j-6B --text-only
 ```
 When you load this model in default or notebook modes, the "HTML" tab will show the generated text in 4chan format.
--- a/docs/README.md
+++ b/docs/README.md
@ -10,8 +10,9 @@
 * [Extensions](Extensions.md)
 * [FlexGen](FlexGen.md)
 * [Generation parameters](Generation-parameters.md)
 * [GGML (llama.cpp) models](GGML-llama.cpp-models.md)
 * [GPT-4chan model](GPT-4chan-model.md)
 * [GPTQ models (4 bit mode)](GPTQ-models-(4-bit-mode).md)
 * [llama.cpp models](llama.cpp-models.md)
 * [LLaMA model](LLaMA-model.md)
 * [LoRA](LoRA.md)
 * [Low VRAM guide](Low-VRAM-guide.md)
--- a/modules/shared.py
+++ b/modules/shared.py
@ -153,8 +153,6 @@ parser.add_argument('--desc_act', action='store_true', help='For models that don
 # ExLlama
 parser.add_argument('--gpu-split', type=str, help="Comma-separated list of VRAM (in GB) to use per GPU device for model layers, e.g. 20,7,7")
 parser.add_argument('--max_seq_len', type=int, default=2048, help="Maximum sequence length.")
 parser.add_argument('--compress_pos_emb', type=int, default=1, help="Positional embeddings compression factor. Should typically be set to max_seq_len / 2048.")
 parser.add_argument('--alpha_value', type=int, default=1, help="Positional embeddings alpha factor for NTK RoPE scaling. Same as above. Use either this or compress_pos_emb, not both.")
 # FlexGen
 parser.add_argument('--flexgen', action='store_true', help='DEPRECATED')
@ -171,6 +169,10 @@ parser.add_argument('--local_rank', type=int, default=0, help='DeepSpeed: Option
 parser.add_argument('--rwkv-strategy', type=str, default=None, help='RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8".')
 parser.add_argument('--rwkv-cuda-on', action='store_true', help='RWKV: Compile the CUDA kernel for better performance.')
 # RoPE
 parser.add_argument('--compress_pos_emb', type=int, default=1, help="Positional embeddings compression factor. Should typically be set to max_seq_len / 2048.")
 parser.add_argument('--alpha_value', type=int, default=1, help="Positional embeddings alpha factor for NTK RoPE scaling. Scaling is not identical to embedding compression. Use either this or compress_pos_emb, not both.")
 # Gradio
 parser.add_argument('--listen', action='store_true', help='Make the web UI reachable from your local network.')
 parser.add_argument('--listen-host', type=str, help='The hostname that the server will use.')