README updates and improvements (#3198)

This commit is contained in:
Eve 2023-07-25 17:58:13 -04:00 committed by GitHub
parent b09e4f10fd
commit f653546484
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 38 additions and 37 deletions

View File

@ -1,6 +1,6 @@
# Text generation web UI # Text generation web UI
A gradio web UI for running Large Language Models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and GALACTICA. A gradio web UI for running Large Language Models like LLaMA (v1 and v2), GPT-J, Pythia, OPT, and GALACTICA.
Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) of text generation. Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) of text generation.
@ -71,9 +71,11 @@ conda activate textgen
| System | GPU | Command | | System | GPU | Command |
|--------|---------|---------| |--------|---------|---------|
| Linux/WSL | NVIDIA | `pip3 install torch torchvision torchaudio` | | Linux/WSL | NVIDIA | `pip3 install torch torchvision torchaudio` |
| Linux/WSL | CPU only | `pip3 install torch torchvision torchaudio` |
| Linux | AMD | `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2` | | Linux | AMD | `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2` |
| MacOS + MPS (untested) | Any | `pip3 install torch torchvision torchaudio` | | MacOS + MPS (untested) | Any | `pip3 install torch torchvision torchaudio` |
| Windows | NVIDIA | `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117` | | Windows | NVIDIA | `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117` |
| Windows | CPU only | `pip3 install torch torchvision torchaudio` |
The up-to-date commands can be found here: https://pytorch.org/get-started/locally/. The up-to-date commands can be found here: https://pytorch.org/get-started/locally/.
@ -139,36 +141,7 @@ For example:
To download a protected model, set env vars `HF_USER` and `HF_PASS` to your Hugging Face username and password (or [User Access Token](https://huggingface.co/settings/tokens)). The model's terms must first be accepted on the HF website. To download a protected model, set env vars `HF_USER` and `HF_PASS` to your Hugging Face username and password (or [User Access Token](https://huggingface.co/settings/tokens)). The model's terms must first be accepted on the HF website.
#### GGML models Many types of models and quantizations such as RWKV, GGML, and GPTQ are supported. For most users quantization is highly recommended due to the performance and memory benefits it provides. For detailed instructions [check out the specific documentation for each type](docs/README.md).
You can drop these directly into the `models/` folder, making sure that the file name contains `ggml` somewhere and ends in `.bin`.
#### GPT-4chan
<details>
<summary>
Instructions
</summary>
[GPT-4chan](https://huggingface.co/ykilcher/gpt-4chan) has been shut down from Hugging Face, so you need to download it elsewhere. You have two options:
* Torrent: [16-bit](https://archive.org/details/gpt4chan_model_float16) / [32-bit](https://archive.org/details/gpt4chan_model)
* Direct download: [16-bit](https://theswissbay.ch/pdf/_notpdf_/gpt4chan_model_float16/) / [32-bit](https://theswissbay.ch/pdf/_notpdf_/gpt4chan_model/)
The 32-bit version is only relevant if you intend to run the model in CPU mode. Otherwise, you should use the 16-bit version.
After downloading the model, follow these steps:
1. Place the files under `models/gpt4chan_model_float16` or `models/gpt4chan_model`.
2. Place GPT-J 6B's config.json file in that same folder: [config.json](https://huggingface.co/EleutherAI/gpt-j-6B/raw/main/config.json).
3. Download GPT-J 6B's tokenizer files (they will be automatically detected when you attempt to load GPT-4chan):
```
python download-model.py EleutherAI/gpt-j-6B --text-only
```
When you load this model in default or notebook modes, the "HTML" tab will show the generated text in 4chan format.
</details>
## Starting the web UI ## Starting the web UI
@ -266,8 +239,6 @@ Optionally, you can use the following command-line flags:
|------------------|-------------| |------------------|-------------|
|`--gpu-split` | Comma-separated list of VRAM (in GB) to use per GPU device for model layers, e.g. `20,7,7` | |`--gpu-split` | Comma-separated list of VRAM (in GB) to use per GPU device for model layers, e.g. `20,7,7` |
|`--max_seq_len MAX_SEQ_LEN` | Maximum sequence length. | |`--max_seq_len MAX_SEQ_LEN` | Maximum sequence length. |
|`--compress_pos_emb COMPRESS_POS_EMB` | Positional embeddings compression factor. Should typically be set to max_seq_len / 2048. |
|`--alpha_value ALPHA_VALUE` | Positional embeddings alpha factor for NTK RoPE scaling. Same as above. Use either this or compress_pos_emb, not both. `
#### GPTQ-for-LLaMa #### GPTQ-for-LLaMa
@ -306,6 +277,13 @@ Optionally, you can use the following command-line flags:
| `--rwkv-strategy RWKV_STRATEGY` | RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". | | `--rwkv-strategy RWKV_STRATEGY` | RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". |
| `--rwkv-cuda-on` | RWKV: Compile the CUDA kernel for better performance. | | `--rwkv-cuda-on` | RWKV: Compile the CUDA kernel for better performance. |
#### RoPE (for llama.cpp and ExLlama only)
| Flag | Description |
|------------------|-------------|
|`--compress_pos_emb COMPRESS_POS_EMB` | Positional embeddings compression factor. Should typically be set to max_seq_len / 2048. |
|`--alpha_value ALPHA_VALUE` | Positional embeddings alpha factor for NTK RoPE scaling. Scaling is not identical to embedding compression. Use either this or compress_pos_emb, not both. |
#### Gradio #### Gradio
| Flag | Description | | Flag | Description |
@ -333,7 +311,7 @@ Optionally, you can use the following command-line flags:
|---------------------------------------|-------------| |---------------------------------------|-------------|
| `--multimodal-pipeline PIPELINE` | The multimodal pipeline to use. Examples: `llava-7b`, `llava-13b`. | | `--multimodal-pipeline PIPELINE` | The multimodal pipeline to use. Examples: `llava-7b`, `llava-13b`. |
Out of memory errors? [Check the low VRAM guide](docs/Low-VRAM-guide.md). Out of memory errors? Try out [GGML](docs/GGML-llama.cpp-models.md) and [GPTQ](docs/GPTQ-models-(4-bit-mode).md) quantizations. Alternatively check out [the low VRAM guide](docs/Low-VRAM-guide.md).
## Presets ## Presets

20
docs/GPT-4chan-model.md Normal file
View File

@ -0,0 +1,20 @@
## GPT-4chan
[GPT-4chan](https://huggingface.co/ykilcher/gpt-4chan) has been shut down from Hugging Face, so you need to download it elsewhere. You have two options:
* Torrent: [16-bit](https://archive.org/details/gpt4chan_model_float16) / [32-bit](https://archive.org/details/gpt4chan_model)
* Direct download: [16-bit](https://theswissbay.ch/pdf/_notpdf_/gpt4chan_model_float16/) / [32-bit](https://theswissbay.ch/pdf/_notpdf_/gpt4chan_model/)
The 32-bit version is only relevant if you intend to run the model in CPU mode. Otherwise, you should use the 16-bit version.
After downloading the model, follow these steps:
1. Place the files under `models/gpt4chan_model_float16` or `models/gpt4chan_model`.
2. Place GPT-J 6B's config.json file in that same folder: [config.json](https://huggingface.co/EleutherAI/gpt-j-6B/raw/main/config.json).
3. Download GPT-J 6B's tokenizer files (they will be automatically detected when you attempt to load GPT-4chan):
```
python download-model.py EleutherAI/gpt-j-6B --text-only
```
When you load this model in default or notebook modes, the "HTML" tab will show the generated text in 4chan format.

View File

@ -10,8 +10,9 @@
* [Extensions](Extensions.md) * [Extensions](Extensions.md)
* [FlexGen](FlexGen.md) * [FlexGen](FlexGen.md)
* [Generation parameters](Generation-parameters.md) * [Generation parameters](Generation-parameters.md)
* [GGML (llama.cpp) models](GGML-llama.cpp-models.md)
* [GPT-4chan model](GPT-4chan-model.md)
* [GPTQ models (4 bit mode)](GPTQ-models-(4-bit-mode).md) * [GPTQ models (4 bit mode)](GPTQ-models-(4-bit-mode).md)
* [llama.cpp models](llama.cpp-models.md)
* [LLaMA model](LLaMA-model.md) * [LLaMA model](LLaMA-model.md)
* [LoRA](LoRA.md) * [LoRA](LoRA.md)
* [Low VRAM guide](Low-VRAM-guide.md) * [Low VRAM guide](Low-VRAM-guide.md)

View File

@ -153,8 +153,6 @@ parser.add_argument('--desc_act', action='store_true', help='For models that don
# ExLlama # ExLlama
parser.add_argument('--gpu-split', type=str, help="Comma-separated list of VRAM (in GB) to use per GPU device for model layers, e.g. 20,7,7") parser.add_argument('--gpu-split', type=str, help="Comma-separated list of VRAM (in GB) to use per GPU device for model layers, e.g. 20,7,7")
parser.add_argument('--max_seq_len', type=int, default=2048, help="Maximum sequence length.") parser.add_argument('--max_seq_len', type=int, default=2048, help="Maximum sequence length.")
parser.add_argument('--compress_pos_emb', type=int, default=1, help="Positional embeddings compression factor. Should typically be set to max_seq_len / 2048.")
parser.add_argument('--alpha_value', type=int, default=1, help="Positional embeddings alpha factor for NTK RoPE scaling. Same as above. Use either this or compress_pos_emb, not both.")
# FlexGen # FlexGen
parser.add_argument('--flexgen', action='store_true', help='DEPRECATED') parser.add_argument('--flexgen', action='store_true', help='DEPRECATED')
@ -171,6 +169,10 @@ parser.add_argument('--local_rank', type=int, default=0, help='DeepSpeed: Option
parser.add_argument('--rwkv-strategy', type=str, default=None, help='RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8".') parser.add_argument('--rwkv-strategy', type=str, default=None, help='RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8".')
parser.add_argument('--rwkv-cuda-on', action='store_true', help='RWKV: Compile the CUDA kernel for better performance.') parser.add_argument('--rwkv-cuda-on', action='store_true', help='RWKV: Compile the CUDA kernel for better performance.')
# RoPE
parser.add_argument('--compress_pos_emb', type=int, default=1, help="Positional embeddings compression factor. Should typically be set to max_seq_len / 2048.")
parser.add_argument('--alpha_value', type=int, default=1, help="Positional embeddings alpha factor for NTK RoPE scaling. Scaling is not identical to embedding compression. Use either this or compress_pos_emb, not both.")
# Gradio # Gradio
parser.add_argument('--listen', action='store_true', help='Make the web UI reachable from your local network.') parser.add_argument('--listen', action='store_true', help='Make the web UI reachable from your local network.')
parser.add_argument('--listen-host', type=str, help='The hostname that the server will use.') parser.add_argument('--listen-host', type=str, help='The hostname that the server will use.')