Merge pull request #2535 from oobabooga/dev

Dev branch merge
This commit is contained in:
oobabooga 2023-06-05 17:07:54 -03:00 committed by GitHub
commit 60bfd0b722
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
12 changed files with 112 additions and 117 deletions

View File

@ -10,28 +10,23 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
## Features ## Features
* Dropdown menu for switching between models * 3 interface modes: default, notebook, and chat
* Notebook mode that resembles OpenAI's playground * Multiple model backends: tranformers, llama.cpp, AutoGPTQ, GPTQ-for-LLaMa, RWKV, FlexGen
* Chat mode for conversation and role-playing * Dropdown menu for quickly switching between different models
* Instruct mode compatible with various formats, including Alpaca, Vicuna, Open Assistant, Dolly, Koala, ChatGLM, MOSS, RWKV-Raven, Galactica, StableLM, WizardLM, Baize, Ziya, Chinese-Vicuna, MPT, INCITE, Wizard Mega, KoAlpaca, Vigogne, Bactrian, h2o, and OpenBuddy * LoRA: load and unload LoRAs on the fly, load multiple LoRAs at the same time, train a new LoRA
* Precise instruction templates for chat mode, including Alpaca, Vicuna, Open Assistant, Dolly, Koala, ChatGLM, MOSS, RWKV-Raven, Galactica, StableLM, WizardLM, Baize, Ziya, Chinese-Vicuna, MPT, INCITE, Wizard Mega, KoAlpaca, Vigogne, Bactrian, h2o, and OpenBuddy
* [Multimodal pipelines, including LLaVA and MiniGPT-4](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal) * [Multimodal pipelines, including LLaVA and MiniGPT-4](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal)
* 8-bit and 4-bit inference through bitsandbytes
* CPU mode for transformers models
* [DeepSpeed ZeRO-3 inference](docs/DeepSpeed.md)
* [Extensions](docs/Extensions.md)
* [Custom chat characters](docs/Chat-mode.md)
* Very efficient text streaming
* Markdown output with LaTeX rendering, to use for instance with [GALACTICA](https://github.com/paperswithcode/galai) * Markdown output with LaTeX rendering, to use for instance with [GALACTICA](https://github.com/paperswithcode/galai)
* Nice HTML output for GPT-4chan * Nice HTML output for GPT-4chan
* [Custom chat characters](docs/Chat-mode.md) * API, including endpoints for websocket streaming ([see the examples](https://github.com/oobabooga/text-generation-webui/blob/main/api-examples))
* Advanced chat features (send images, get audio responses with TTS)
* Very efficient text streaming To learn how to use the various features, check out the Documentation: https://github.com/oobabooga/text-generation-webui/tree/main/docs
* Parameter presets
* [LLaMA model](docs/LLaMA-model.md)
* [4-bit GPTQ mode](docs/GPTQ-models-(4-bit-mode).md)
* [LoRA (loading and training)](docs/Using-LoRAs.md)
* [llama.cpp](docs/llama.cpp-models.md)
* 8-bit and 4-bit through bitsandbytes
* Layers splitting across GPU(s), CPU, and disk
* CPU mode
* [FlexGen](docs/FlexGen.md)
* [DeepSpeed ZeRO-3](docs/DeepSpeed.md)
* API [with](https://github.com/oobabooga/text-generation-webui/blob/main/api-example-stream.py) streaming and [without](https://github.com/oobabooga/text-generation-webui/blob/main/api-example.py) streaming
* [Extensions](docs/Extensions.md) - see the [user extensions list](https://github.com/oobabooga/text-generation-webui-extensions)
## Installation ## Installation
@ -95,14 +90,6 @@ cd text-generation-webui
pip install -r requirements.txt pip install -r requirements.txt
``` ```
#### 4. Install GPTQ
The base installation covers [transformers](https://github.com/huggingface/transformers) models (`AutoModelForCausalLM` and `AutoModelForSeq2SeqLM` specifically) and [llama.cpp](https://github.com/ggerganov/llama.cpp) (GGML) models.
To use GPTQ models, the additional installation steps below are necessary:
[GPTQ models (4 bit mode)](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md)
#### llama.cpp with GPU acceleration #### llama.cpp with GPU acceleration
Requires the additional compilation step described here: [GPU acceleration](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md#gpu-acceleration). Requires the additional compilation step described here: [GPU acceleration](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md#gpu-acceleration).
@ -154,9 +141,7 @@ For example:
python download-model.py facebook/opt-1.3b python download-model.py facebook/opt-1.3b
* If you want to download a model manually, note that all you need are the json, txt, and pytorch\*.bin (or model*.safetensors) files. The remaining files are not necessary. To download a protected model, set env vars `HF_USER` and `HF_PASS` to your Hugging Face username and password (or [User Access Token](https://huggingface.co/settings/tokens)). The model's terms must first be accepted on the HF website.
* Set env vars `HF_USER` and `HF_PASS` to your Hugging Face username and password (or [User Access Token](https://huggingface.co/settings/tokens)) to download a protected model. The model's terms must first be accepted on the HF website.
#### GGML models #### GGML models
@ -164,6 +149,11 @@ You can drop these directly into the `models/` folder, making sure that the file
#### GPT-4chan #### GPT-4chan
<details>
<summary>
Instructions
</summary>
[GPT-4chan](https://huggingface.co/ykilcher/gpt-4chan) has been shut down from Hugging Face, so you need to download it elsewhere. You have two options: [GPT-4chan](https://huggingface.co/ykilcher/gpt-4chan) has been shut down from Hugging Face, so you need to download it elsewhere. You have two options:
* Torrent: [16-bit](https://archive.org/details/gpt4chan_model_float16) / [32-bit](https://archive.org/details/gpt4chan_model) * Torrent: [16-bit](https://archive.org/details/gpt4chan_model_float16) / [32-bit](https://archive.org/details/gpt4chan_model)
@ -181,6 +171,9 @@ After downloading the model, follow these steps:
python download-model.py EleutherAI/gpt-j-6B --text-only python download-model.py EleutherAI/gpt-j-6B --text-only
``` ```
When you load this model in default or notebook modes, the "HTML" tab will show the generated text in 4chan format.
</details>
## Starting the web UI ## Starting the web UI
conda activate textgen conda activate textgen
@ -252,10 +245,18 @@ Optionally, you can use the following command-line flags:
| `--n_ctx N_CTX` | Size of the prompt context. | | `--n_ctx N_CTX` | Size of the prompt context. |
| `--llama_cpp_seed SEED` | Seed for llama-cpp models. Default 0 (random). | | `--llama_cpp_seed SEED` | Seed for llama-cpp models. Default 0 (random). |
#### GPTQ #### AutoGPTQ
| Flag | Description |
|------------------|-------------|
| `--triton` | Use triton. |
| `--desc_act` | For models that don't have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig. |
#### GPTQ-for-LLaMa
| Flag | Description | | Flag | Description |
|---------------------------|-------------| |---------------------------|-------------|
| `--gptq-for-llama` | Use GPTQ-for-LLaMa to load the GPTQ model instead of AutoGPTQ. |
| `--wbits WBITS` | Load a pre-quantized model with specified precision in bits. 2, 3, 4 and 8 are supported. | | `--wbits WBITS` | Load a pre-quantized model with specified precision in bits. 2, 3, 4 and 8 are supported. |
| `--model_type MODEL_TYPE` | Model type of pre-quantized model. Currently LLaMA, OPT, and GPT-J are supported. | | `--model_type MODEL_TYPE` | Model type of pre-quantized model. Currently LLaMA, OPT, and GPT-J are supported. |
| `--groupsize GROUPSIZE` | Group size. | | `--groupsize GROUPSIZE` | Group size. |
@ -266,14 +267,6 @@ Optionally, you can use the following command-line flags:
| `--warmup_autotune` | (triton) Enable warmup autotune. | | `--warmup_autotune` | (triton) Enable warmup autotune. |
| `--fused_mlp` | (triton) Enable fused mlp. | | `--fused_mlp` | (triton) Enable fused mlp. |
#### AutoGPTQ
| Flag | Description |
|------------------|-------------|
| `--autogptq` | Use AutoGPTQ for loading quantized models instead of the internal GPTQ loader. |
| `--triton` | Use triton. |
|` --desc_act` | For models that don't have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig. |
#### FlexGen #### FlexGen
| Flag | Description | | Flag | Description |
@ -331,15 +324,7 @@ Out of memory errors? [Check the low VRAM guide](docs/Low-VRAM-guide.md).
Inference settings presets can be created under `presets/` as yaml files. These files are detected automatically at startup. Inference settings presets can be created under `presets/` as yaml files. These files are detected automatically at startup.
By default, 10 presets based on NovelAI and KoboldAI presets are included. These were selected out of a sample of 43 presets after applying a K-Means clustering algorithm and selecting the elements closest to the average of each cluster. By default, 10 presets based on NovelAI and KoboldAI presets are included. These were selected out of a sample of 43 presets after applying a K-Means clustering algorithm and selecting the elements closest to the average of each cluster: [tSNE visualization](https://user-images.githubusercontent.com/112222186/228956352-1addbdb9-2456-465a-b51d-089f462cd385.png).
[Visualization](https://user-images.githubusercontent.com/112222186/228956352-1addbdb9-2456-465a-b51d-089f462cd385.png)
## Documentation
Make sure to check out the documentation for an in-depth guide on how to use the web UI.
https://github.com/oobabooga/text-generation-webui/tree/main/docs
## Contributing ## Contributing

View File

@ -6,14 +6,6 @@ GPTQ is a clever quantization algorithm that lightly reoptimizes the weights dur
There are two ways of loading GPTQ models in the web UI at the moment: There are two ways of loading GPTQ models in the web UI at the moment:
* Using GPTQ-for-LLaMa directly:
* faster CPU offloading
* faster multi-GPU inference
* supports loading LoRAs using a monkey patch
* included by default in the one-click installers
* requires you to manually figure out the wbits/groupsize/model_type parameters for the model to be able to load it
* supports either only cuda or only triton depending on the branch
* Using AutoGPTQ: * Using AutoGPTQ:
* supports more models * supports more models
* standardized (no need to guess any parameter) * standardized (no need to guess any parameter)
@ -21,8 +13,59 @@ There are two ways of loading GPTQ models in the web UI at the moment:
* ~no wheels are presently available so it requires manual compilation~ * ~no wheels are presently available so it requires manual compilation~
* supports loading both triton and cuda models * supports loading both triton and cuda models
* Using GPTQ-for-LLaMa directly:
* faster CPU offloading
* faster multi-GPU inference
* supports loading LoRAs using a monkey patch
* requires you to manually figure out the wbits/groupsize/model_type parameters for the model to be able to load it
* supports either only cuda or only triton depending on the branch
For creating new quantizations, I recommend using AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ For creating new quantizations, I recommend using AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ
## AutoGPTQ
### Installation
No additional steps are necessary as AutoGPTQ is already in the `requirements.txt` for the webui. If you still want or need to install it manually for whatever reason, these are the commands:
```
conda activate textgen
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install .
```
The last command requires `nvcc` to be installed (see the [instructions above](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md#step-1-install-nvcc)).
### Usage
When you quantize a model using AutoGPTQ, a folder containing a filed called `quantize_config.json` will be generated. Place that folder inside your `models/` folder and load it with the `--autogptq` flag:
```
python server.py --autogptq --model model_name
```
Alternatively, check the `autogptq` box in the "Model" tab of the UI before loading the model.
### Offloading
In order to do CPU offloading or multi-gpu inference with AutoGPTQ, use the `--gpu-memory` flag. It is currently somewhat slower than offloading with the `--pre_layer` option in GPTQ-for-LLaMA.
For CPU offloading:
```
python server.py --autogptq --gpu-memory 3000MiB --model model_name
```
For multi-GPU inference:
```
python server.py --autogptq --gpu-memory 3000MiB 6000MiB --model model_name
```
### Using LoRAs with AutoGPTQ
Not supported yet.
## GPTQ-for-LLaMa ## GPTQ-for-LLaMa
GPTQ-for-LLaMa is the original adaptation of GPTQ for the LLaMA model. It was made possible by [@qwopqwop200](https://github.com/qwopqwop200/GPTQ-for-LLaMa): https://github.com/qwopqwop200/GPTQ-for-LLaMa GPTQ-for-LLaMa is the original adaptation of GPTQ for the LLaMA model. It was made possible by [@qwopqwop200](https://github.com/qwopqwop200/GPTQ-for-LLaMa): https://github.com/qwopqwop200/GPTQ-for-LLaMa
@ -108,23 +151,21 @@ These are models that you can simply download and place in your `models` folder.
### Starting the web UI: ### Starting the web UI:
Use the `--gptq-for-llama` flag.
For the models converted without `group-size`: For the models converted without `group-size`:
``` ```
python server.py --model llama-7b-4bit python server.py --model llama-7b-4bit --gptq-for-llama
``` ```
For the models converted with `group-size`: For the models converted with `group-size`:
``` ```
python server.py --model llama-13b-4bit-128g python server.py --model llama-13b-4bit-128g --gptq-for-llama --wbits 4 --groupsize 128
``` ```
The command-line flags `--wbits` and `--groupsize` are automatically detected based on the folder names, but you can also specify them manually like The command-line flags `--wbits` and `--groupsize` are automatically detected based on the folder names in many cases.
```
python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128
```
### CPU offloading ### CPU offloading
@ -171,46 +212,4 @@ pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit
python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
``` ```
## AutoGPTQ
### Installation
No additional steps are necessary as AutoGPTQ is already in the `requirements.txt` for the webui. If you still want or need to install it manually for whatever reason, these are the commands:
```
conda activate textgen
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install .
```
The last command requires `nvcc` to be installed (see the [instructions above](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md#step-1-install-nvcc)).
### Usage
When you quantize a model using AutoGPTQ, a folder containing a filed called `quantize_config.json` will be generated. Place that folder inside your `models/` folder and load it with the `--autogptq` flag:
```
python server.py --autogptq --model model_name
```
Alternatively, check the `autogptq` box in the "Model" tab of the UI before loading the model.
### Offloading
In order to do CPU offloading or multi-gpu inference with AutoGPTQ, use the `--gpu-memory` flag. It is currently somewhat slower than offloading with the `--pre_layer` option in GPTQ-for-LLaMA.
For CPU offloading:
```
python server.py --autogptq --gpu-memory 3000MiB --model model_name
```
For multi-GPU inference:
```
python server.py --autogptq --gpu-memory 3000MiB 6000MiB --model model_name
```
### Using LoRAs with AutoGPTQ
Not supported yet.

View File

@ -180,3 +180,11 @@ llama-65b-gptq-3bit:
.*bluemoonrp-(30|13)b: .*bluemoonrp-(30|13)b:
mode: 'instruct' mode: 'instruct'
instruction_template: 'Bluemoon' instruction_template: 'Bluemoon'
truncation_length: 4096
.*Nous-Hermes-13b:
mode: 'instruct'
instruction_template: 'Alpaca'
.*airoboros-13b-gpt4:
mode: 'instruct'
instruction_template: 'Vicuna-v1.1'
truncation_length: 4096

View File

@ -325,6 +325,10 @@ def generate_chat_reply(text, history, state, regenerate=False, _continue=False,
# Same as above but returns HTML for the UI # Same as above but returns HTML for the UI
def generate_chat_reply_wrapper(text, start_with, state, regenerate=False, _continue=False): def generate_chat_reply_wrapper(text, start_with, state, regenerate=False, _continue=False):
if start_with != '' and _continue == False: if start_with != '' and _continue == False:
if regenerate == True:
text = remove_last_message()
regenerate = False
_continue = True _continue = True
send_dummy_message(text) send_dummy_message(text)
send_dummy_reply(start_with) send_dummy_reply(start_with)

View File

@ -81,10 +81,10 @@ def load_model(model_name):
logger.error('The path to the model does not exist. Exiting.') logger.error('The path to the model does not exist. Exiting.')
return None, None return None, None
if shared.args.autogptq: if shared.args.gptq_for_llama:
load_func = AutoGPTQ_loader
elif shared.args.wbits > 0:
load_func = GPTQ_loader load_func = GPTQ_loader
elif Path(f'{shared.args.model_dir}/{model_name}/quantize_config.json').exists() or shared.args.wbits > 0:
load_func = AutoGPTQ_loader
elif shared.model_type == 'llamacpp': elif shared.model_type == 'llamacpp':
load_func = llamacpp_loader load_func = llamacpp_loader
elif shared.model_type == 'rwkv': elif shared.model_type == 'rwkv':

View File

@ -141,7 +141,8 @@ parser.add_argument('--warmup_autotune', action='store_true', help='(triton) Ena
parser.add_argument('--fused_mlp', action='store_true', help='(triton) Enable fused mlp.') parser.add_argument('--fused_mlp', action='store_true', help='(triton) Enable fused mlp.')
# AutoGPTQ # AutoGPTQ
parser.add_argument('--autogptq', action='store_true', help='Use AutoGPTQ for loading quantized models instead of the internal GPTQ loader.') parser.add_argument('--gptq-for-llama', action='store_true', help='Use GPTQ-for-LLaMa to load the GPTQ model instead of AutoGPTQ.')
parser.add_argument('--autogptq', action='store_true', help='DEPRECATED')
parser.add_argument('--triton', action='store_true', help='Use triton.') parser.add_argument('--triton', action='store_true', help='Use triton.')
parser.add_argument('--desc_act', action='store_true', help='For models that don\'t have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig.') parser.add_argument('--desc_act', action='store_true', help='For models that don\'t have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig.')
@ -181,12 +182,9 @@ parser.add_argument('--multimodal-pipeline', type=str, default=None, help='The m
args = parser.parse_args() args = parser.parse_args()
args_defaults = parser.parse_args([]) args_defaults = parser.parse_args([])
# Deprecation warnings for parameters that have been renamed # Deprecation warnings
deprecated_dict = {} if args.autogptq:
for k in deprecated_dict: logger.warning('--autogptq has been deprecated and will be removed soon. AutoGPTQ is now used by default for GPTQ models.')
if getattr(args, k) != deprecated_dict[k][1]:
logger.warning(f"--{k} is deprecated and will be removed. Use --{deprecated_dict[k][0]} instead.")
setattr(args, deprecated_dict[k][0], getattr(args, k))
# Security warnings # Security warnings
if args.trust_remote_code: if args.trust_remote_code:

View File

@ -30,7 +30,7 @@ theme = gr.themes.Default(
def list_model_elements(): def list_model_elements():
elements = ['cpu_memory', 'auto_devices', 'disk', 'cpu', 'bf16', 'load_in_8bit', 'trust_remote_code', 'load_in_4bit', 'compute_dtype', 'quant_type', 'use_double_quant', 'wbits', 'groupsize', 'model_type', 'pre_layer', 'autogptq', 'triton', 'desc_act', 'threads', 'n_batch', 'no_mmap', 'mlock', 'n_gpu_layers', 'n_ctx', 'llama_cpp_seed'] elements = ['cpu_memory', 'auto_devices', 'disk', 'cpu', 'bf16', 'load_in_8bit', 'trust_remote_code', 'load_in_4bit', 'compute_dtype', 'quant_type', 'use_double_quant', 'gptq_for_llama', 'wbits', 'groupsize', 'model_type', 'pre_layer', 'triton', 'desc_act', 'threads', 'n_batch', 'no_mmap', 'mlock', 'n_gpu_layers', 'n_ctx', 'llama_cpp_seed']
for i in range(torch.cuda.device_count()): for i in range(torch.cuda.device_count()):
elements.append(f'gpu_memory_{i}') elements.append(f'gpu_memory_{i}')

View File

@ -393,12 +393,12 @@ def create_model_menus():
with gr.Row(): with gr.Row():
with gr.Column(): with gr.Column():
gr.Markdown('AutoGPTQ') gr.Markdown('AutoGPTQ')
shared.gradio['autogptq'] = gr.Checkbox(label="autogptq", value=shared.args.autogptq, info='Activate AutoGPTQ loader. gpu-memory should be used for CPU offloading instead of pre_layer.')
shared.gradio['triton'] = gr.Checkbox(label="triton", value=shared.args.triton) shared.gradio['triton'] = gr.Checkbox(label="triton", value=shared.args.triton)
shared.gradio['desc_act'] = gr.Checkbox(label="desc_act", value=shared.args.desc_act, info='\'desc_act\', \'wbits\', and \'groupsize\' are used for old models without a quantize_config.json.') shared.gradio['desc_act'] = gr.Checkbox(label="desc_act", value=shared.args.desc_act, info='\'desc_act\', \'wbits\', and \'groupsize\' are used for old models without a quantize_config.json.')
with gr.Column(): with gr.Column():
gr.Markdown('GPTQ-for-LLaMa') gr.Markdown('GPTQ-for-LLaMa')
shared.gradio['gptq_for_llama'] = gr.Checkbox(label="gptq-for-llama", value=shared.args.gptq_for_llama, info='Use GPTQ-for-LLaMa to load the GPTQ model instead of AutoGPTQ. pre_layer should be used for CPU offloading instead of gpu-memory.')
with gr.Row(): with gr.Row():
shared.gradio['wbits'] = gr.Dropdown(label="wbits", choices=["None", 1, 2, 3, 4, 8], value=shared.args.wbits if shared.args.wbits > 0 else "None") shared.gradio['wbits'] = gr.Dropdown(label="wbits", choices=["None", 1, 2, 3, 4, 8], value=shared.args.wbits if shared.args.wbits > 0 else "None")
shared.gradio['groupsize'] = gr.Dropdown(label="groupsize", choices=["None", 32, 64, 128, 1024], value=shared.args.groupsize if shared.args.groupsize > 0 else "None") shared.gradio['groupsize'] = gr.Dropdown(label="groupsize", choices=["None", 32, 64, 128, 1024], value=shared.args.groupsize if shared.args.groupsize > 0 else "None")
@ -1049,6 +1049,7 @@ if __name__ == "__main__":
'mode': shared.settings['mode'], 'mode': shared.settings['mode'],
'skip_special_tokens': shared.settings['skip_special_tokens'], 'skip_special_tokens': shared.settings['skip_special_tokens'],
'custom_stopping_strings': shared.settings['custom_stopping_strings'], 'custom_stopping_strings': shared.settings['custom_stopping_strings'],
'truncation_length': shared.settings['truncation_length'],
} }
shared.model_config.move_to_end('.*', last=False) # Move to the beginning shared.model_config.move_to_end('.*', last=False) # Move to the beginning