Merge branch 'oobabooga:main' into searx_integration_bs4

2024-11-25 01:09:22 +01:00 · 2023-04-22 23:28:07 -07:00 · 2023-04-22 23:28:07 -07:00 · 9208682065
commit 9208682065
parent 1a91a3cdad 7ff645899e
31 changed files with 1284 additions and 93 deletions
--- a/README.md
+++ b/README.md
@ -16,23 +16,23 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
 * Instruct mode compatible with Alpaca, Vicuna, Open Assistant, Dolly, Koala, and ChatGLM formats 
 * Nice HTML output for GPT-4chan
 * Markdown output for [GALACTICA](https://github.com/paperswithcode/galai), including LaTeX rendering
-* [Custom chat characters](https://github.com/oobabooga/text-generation-webui/wiki/Custom-chat-characters)
+* [Custom chat characters](docs/Custom-chat-characters.md)
 * Advanced chat features (send images, get audio responses with TTS)
 * Very efficient text streaming
 * Parameter presets
 * 8-bit mode
 * Layers splitting across GPU(s), CPU, and disk
 * CPU mode
-* [FlexGen](https://github.com/oobabooga/text-generation-webui/wiki/FlexGen)
+* [FlexGen](docs/FlexGen.md)
-* [DeepSpeed ZeRO-3](https://github.com/oobabooga/text-generation-webui/wiki/DeepSpeed)
+* [DeepSpeed ZeRO-3](docs/DeepSpeed.md)
 * API [with](https://github.com/oobabooga/text-generation-webui/blob/main/api-example-stream.py) streaming and [without](https://github.com/oobabooga/text-generation-webui/blob/main/api-example.py) streaming
-* [LLaMA model](https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model)
+* [LLaMA model](docs/LLaMA-model.md)
-* [4-bit GPTQ mode](https://github.com/oobabooga/text-generation-webui/wiki/GPTQ-models-(4-bit-mode))
+* [4-bit GPTQ mode](docs/GPTQ-models-(4-bit-mode).md)
-* [llama.cpp](https://github.com/oobabooga/text-generation-webui/wiki/llama.cpp-models)
+* [llama.cpp](docs/llama.cpp-models.md)
-* [RWKV model](https://github.com/oobabooga/text-generation-webui/wiki/RWKV-model)
+* [RWKV model](docs/RWKV-model.md)
-* [LoRA (loading and training)](https://github.com/oobabooga/text-generation-webui/wiki/Using-LoRAs)
+* [LoRA (loading and training)](docs/Using-LoRAs.md)
 * Softprompts
-* [Extensions](https://github.com/oobabooga/text-generation-webui/wiki/Extensions)
+* [Extensions](docs/Extensions.md) - see the [user extensions list](https://github.com/oobabooga/text-generation-webui-extensions)
 ## Installation
@ -52,7 +52,7 @@ Just download the zip above, extract it, and double click on "start". The web UI
 Recommended if you have some experience with the command-line.
-On Windows, I additionally recommend carrying out the installation on WSL instead of the base system: [WSL installation guide](https://github.com/oobabooga/text-generation-webui/wiki/WSL-installation-guide).
+On Windows, I additionally recommend carrying out the installation on WSL instead of the base system: [WSL installation guide](https://github.com/oobabooga/text-generation-webui/blob/main/docs/WSL-installation-guide.md).
 #### 0. Install Conda
@ -105,7 +105,7 @@ pip install -r requirements.txt
 ### Alternative: manual Windows installation
-As an alternative to the recommended WSL method, you can install the web UI natively on Windows using this guide. It will be a lot harder and the performance may be slower: [Windows installation guide](https://github.com/oobabooga/text-generation-webui/wiki/Windows-installation-guide).
+As an alternative to the recommended WSL method, you can install the web UI natively on Windows using this guide. It will be a lot harder and the performance may be slower: [Windows installation guide](https://github.com/oobabooga/text-generation-webui/blob/main/docs/Windows-installation-guide.md).
 ### Alternative: Docker
@ -116,7 +116,7 @@ cp docker/.env.example .env
 docker compose up --build
 ```
-You need to have docker compose v2.17 or higher installed in your system. To see how to install docker compose itself, see the guide in https://github.com/oobabooga/text-generation-webui/tree/main/docker.
+You need to have docker compose v2.17 or higher installed in your system. To see how to install docker compose itself, see the guide in [here](https://github.com/oobabooga/text-generation-webui/blob/main/docs/Docker.md).
 Contributed by [@loeken](https://github.com/loeken) in [#633](https://github.com/oobabooga/text-generation-webui/pull/633)
@ -230,9 +230,9 @@ Optionally, you can use the following command-line flags:
 | `--groupsize GROUPSIZE`   | Group size. |
 | `--pre_layer PRE_LAYER`   | The number of layers to allocate to the GPU. Setting this parameter enables CPU offloading for 4-bit models. |
 | `--monkey-patch`          | Apply the monkey patch for using LoRAs with quantized models.
-| `--no-quant_attn`         | (triton) Disable quant attention. If you encounter incoherent results try disabling this.
+| `--quant_attn`         | (triton) Enable quant attention.
-| `--no-warmup_autotune`    | (triton) Disable warmup autotune.
+| `--warmup_autotune`    | (triton) Enable warmup autotune.
-| `--no-fused_mlp`          | (triton) Disable fused mlp. If you encounter "Unexpected mma -> mma layout conversion" try disabling this.
+| `--fused_mlp`          | (triton) Enable fused mlp.
 #### FlexGen
@ -269,7 +269,7 @@ Optionally, you can use the following command-line flags:
 | `--auto-launch`                       | Open the web UI in the default browser upon launch. |
 | `--gradio-auth-path GRADIO_AUTH_PATH` | Set the gradio authentication file path. The file should contain one or more user:password pairs in this format: "u1:p1,u2:p2,u3:p3" |
-Out of memory errors? [Check the low VRAM guide](https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide).
+Out of memory errors? [Check the low VRAM guide](docs/Low-VRAM-guide.md).
 ## Presets
@ -281,7 +281,7 @@ By default, 10 presets by NovelAI and KoboldAI are included. These were selected
 ## System requirements
-Check the [wiki](https://github.com/oobabooga/text-generation-webui/wiki/System-requirements) for some examples of VRAM and RAM usage in both GPU and CPU mode.
+Check the [wiki](docs/System-requirements.md) for some examples of VRAM and RAM usage in both GPU and CPU mode.
 ## Contributing
--- a/css/main.css
+++ b/css/main.css
@ -107,3 +107,33 @@ footer {
 button {
  font-size: 14px !important;
 }
 .small-button {
  max-width: 171px;
 }
 /* Align the elements for SD_api_picture extension */
 .SDAP #sampler_box {
  padding-top: var(--spacing-sm);
  padding-bottom: var(--spacing-sm);
 }
 .SDAP #seed_box,
 .SDAP #cfg_box {
  padding-top: var(--spacing-md);
 }
 .SDAP #sampler_box span,
 .SDAP #seed_box span,
 .SDAP #cfg_box span{
  margin-bottom: var(--spacing-sm);
 }
 .SDAP svg.dropdown-arrow {
  flex-shrink: 0 !important;
  margin: 0px !important;
 }
 .SDAP .hires_opts input[type="number"] {
  width: 6em !important;
 }
--- a/docs/Custom-chat-characters.md
+++ b/docs/Custom-chat-characters.md
@ -0,0 +1,30 @@
 Custom chat mode characters are defined by `.yaml` files inside the `characters` folder. An example is included: [Example.yaml](https://github.com/oobabooga/text-generation-webui/blob/main/characters/Example.yaml)
 The following fields may be defined:
 | Field | Description |
 |-------|-------------|
 | `name` | The character's name. |
 | `context` | A string that appears at the top of the prompt. It usually contains a description of the character's personality. |
 | `greeting` (optional) | The character's opening message when a new conversation is started. |
 | `example_dialogue` (optional) | A few example messages to guide the model. |
 | `your_name` (optional) | Your name. This overwrites what you had previously written in the `Your name` field in the interface. |
 #### Special tokens
 * `{{char}}` or `<BOT>`: are replaced with the character's name
 * `{{user}}` or `<USER>`: are replaced with your name
 These replacements happen when the character is loaded, and they apply to the `context`, `greeting`, and `example_dialogue` fields.
 #### How do I add a profile picture for my character?
 Put an image with the same name as your character's yaml file into the `characters` folder. For example, if your bot is `Character.yaml`, add `Character.jpg` or `Character.png` to the folder.
 #### Is the chat history truncated in the prompt?
 Once your prompt reaches the 2048 token limit, old messages will be removed one at a time. The context string will always stay at the top of the prompt and will never get truncated.
 #### Pygmalion format characters
 These are also supported out of the box. Simply put the JSON file in the `characters` folder, or upload it directly from the web UI by clicking on the "Upload character" tab at the bottom.
--- a/docs/DeepSpeed.md
+++ b/docs/DeepSpeed.md
@ -0,0 +1,23 @@
 An alternative way of reducing the GPU memory usage of models is to use the `DeepSpeed ZeRO-3` optimization.
 With this, I have been able to load a 6b model (GPT-J 6B) with less than 6GB of VRAM. The speed of text generation is very decent and much better than what would be accomplished with `--auto-devices --gpu-memory 6`.
 As far as I know, DeepSpeed is only available for Linux at the moment.
 ### How to use it
 1. Install DeepSpeed: 
 ```
 pip install deepspeed
 ```
 2. Start the web UI replacing `python` with `deepspeed --num_gpus=1` and adding the `--deepspeed` flag. Example:
 ```
 deepspeed --num_gpus=1 server.py --deepspeed --chat --model gpt-j-6B
 ```
 ### Learn more
 For more information, check out [this comment](https://github.com/oobabooga/text-generation-webui/issues/40#issuecomment-1412038622) by 81300, who came up with the DeepSpeed support in this web UI.
--- a/docker/README.md
+++ b/docker/README.md
--- a/docs/Extensions.md
+++ b/docs/Extensions.md
@ -0,0 +1,157 @@
 This web UI supports extensions. They are simply files under 
 ```
 extensions/your_extension_name/script.py
 ```
 which can be invoked with the 
 ```
 --extension your_extension_name
 ```
 command-line flag.
 ## [text-generation-webui-extensions](https://github.com/oobabooga/text-generation-webui-extensions)
 The link above contains a directory of user extensions for text-generation-webui.
 ## Built-in extensions
 |Extension|Description|
 |---------|-----------|
 |[google_translate](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/google_translate)| Automatically translates inputs and outputs using Google Translate.|
 |[character_bias](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/character_bias)| Just a very simple example that biases the bot's responses in chat mode.|
 |[gallery](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/gallery/)| Creates a gallery with the chat characters and their pictures. |
 |[silero_tts](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/silero_tts)| Text-to-speech extension using [Silero](https://github.com/snakers4/silero-models). When used in chat mode, it replaces the responses with an audio widget. |
 |[elevenlabs_tts](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/elevenlabs_tts)| Text-to-speech extension using the [ElevenLabs](https://beta.elevenlabs.io/) API. You need an API key to use it. Author: [@MetaIX](https://github.com/MetaIX). |
 |[send_pictures](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/send_pictures/)| Creates an image upload field that can be used to send images to the bot in chat mode. Captions are automatically generated using BLIP. Author: [@SillyLossy](https://github.com/sillylossy).|
 |[api](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/api)| Creates an API similar to the one provided by KoboldAI. Works with TavernAI: start the web UI with `python server.py --no-stream --extensions api` and set the API URL to `http://127.0.0.1:5000/api`. Author: [@mayaeary](https://github.com/mayaeary).|
 |[whisper_stt](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/whisper_stt)| Allows you to enter your inputs in chat mode using your microphone. Author: [@EliasVincent](https://github.com/EliasVincent).|
 |[sd_api_pictures](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/sd_api_pictures)| Allows you to request pictures from the bot in chat mode, which will be generated using the AUTOMATIC1111 Stable Diffusion API. See examples [here](https://github.com/oobabooga/text-generation-webui/pull/309). Author: [@Brawlence](https://github.com/Brawlence).|
 ## How to write an extension
 `script.py` has access to all variables in the UI through the `modules.shared` module, and it may define the following functions:
 | Function        | Description |
 |-------------|-------------|
 | `def ui()` | Creates custom gradio elements when the UI is launched. | 
 | `def input_modifier(string)`  | Modifies the input string before it enters the model. In chat mode, it is applied to the user message. Otherwise, it is applied to the entire prompt. |
 | `def output_modifier(string)`  | Modifies the output string before it is presented in the UI. In chat mode, it is applied to the bot's reply. Otherwise, it is applied to the entire output. |
 | `def bot_prefix_modifier(string)`  | Applied in chat mode to the prefix for the bot's reply (more on that below). |
 | `def custom_generate_chat_prompt(...)` | Overrides the prompt generator in chat mode. |
 Additionally, the script may define two special global variables:
 #### `params` dictionary
 ```python
 params = {
    "language string": "ja",
 }
 ```
 This dicionary can be used to make the extension parameters customizable by adding entries to a `settings.json` file like this:
 ```python
 "google_translate-language string": "fr",
 ``` 
 #### `input_hijack` dictionary
 ```python
 input_hijack = {
    'state': False,
    'value': ["", ""]
 }
 ```
 This is only relevant in chat mode. If your extension sets `input_hijack['state']` to `True` at any moment, the next call to `modules.chat.chatbot_wrapper` will use the vales inside `input_hijack['value']` as the user input for text generation. See the `send_pictures` extension above for an example.
 ## The `bot_prefix_modifier`
 In chat mode, this function modifies the prefix for a new bot message. For instance, if your bot is named `Marie Antoinette`, the default prefix for a new message will be
 ```
 Marie Antoinette:
 ```
 Using `bot_prefix_modifier`, you can change it to:
 ```
 Marie Antoinette: *I am very enthusiastic*
 ```
 Marie Antoinette will become very enthusiastic in all her messages.
 ## Using multiple extensions at the same time
 In order to use your extension, you must start the web UI with the `--extensions` flag followed by the name of your extension (the folder under `text-generation-webui/extension` where `script.py` resides).
 You can activate more than one extension at a time by providing their names separated by spaces. The input, output and bot prefix modifiers will be applied in the specified order. For `custom_generate_chat_prompt`, only the first declaration encountered will be used and the rest will be ignored.
 ```
 python server.py --extensions enthusiasm translate # First apply enthusiasm, then translate
 python server.py --extensions translate enthusiasm # First apply translate, then enthusiasm
 ```
 ## `custom_generate_chat_prompt` example
 Below is an extension that just reproduces the default prompt generator in `modules/chat.py`. You can modify it freely to come up with your own prompts in chat mode.
 ```python
 def custom_generate_chat_prompt(user_input, state, **kwargs):
    impersonate = kwargs['impersonate'] if 'impersonate' in kwargs else False
    _continue = kwargs['_continue'] if '_continue' in kwargs else False
    also_return_rows = kwargs['also_return_rows'] if 'also_return_rows' in kwargs else False
    is_instruct = state['mode'] == 'instruct'
    rows = [f"{state['context'].strip()}\n"]
    # Finding the maximum prompt size
    chat_prompt_size = state['chat_prompt_size']
    if shared.soft_prompt:
        chat_prompt_size -= shared.soft_prompt_tensor.shape[1]
    max_length = min(get_max_prompt_length(state), chat_prompt_size)
    if is_instruct:
        prefix1 = f"{state['name1']}\n"
        prefix2 = f"{state['name2']}\n"
    else:
        prefix1 = f"{state['name1']}: "
        prefix2 = f"{state['name2']}: "
    i = len(shared.history['internal']) - 1
    while i >= 0 and len(encode(''.join(rows))[0]) < max_length:
        if _continue and i == len(shared.history['internal']) - 1:
            rows.insert(1, f"{prefix2}{shared.history['internal'][i][1]}")
        else:
            rows.insert(1, f"{prefix2}{shared.history['internal'][i][1].strip()}{state['end_of_turn']}\n")
        string = shared.history['internal'][i][0]
        if string not in ['', '<|BEGIN-VISIBLE-CHAT|>']:
            rows.insert(1, f"{prefix1}{string.strip()}{state['end_of_turn']}\n")
        i -= 1
    if impersonate:
        rows.append(f"{prefix1.strip() if not is_instruct else prefix1}")
        limit = 2
    elif _continue:
        limit = 3
    else:
        # Adding the user message
        user_input = fix_newlines(user_input)
        if len(user_input) > 0:
            rows.append(f"{prefix1}{user_input}{state['end_of_turn']}\n")
        # Adding the Character prefix
        rows.append(apply_extensions(f"{prefix2.strip() if not is_instruct else prefix2}", "bot_prefix"))
        limit = 3
    while len(rows) > limit and len(encode(''.join(rows))[0]) >= max_length:
        rows.pop(1)
    prompt = ''.join(rows)
    if also_return_rows:
        return prompt, rows
    else:
        return prompt
 ```
--- a/docs/FlexGen.md
+++ b/docs/FlexGen.md
@ -0,0 +1,64 @@
 >FlexGen is a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card!).
 https://github.com/FMInference/FlexGen
 ## Installation
 No additional installation steps are necessary. FlexGen is in the `requirements.txt` file for this project.
 ## Converting a model
 FlexGen only works with the OPT model, and it needs to be converted to numpy format before starting the web UI:
 ```
 python convert-to-flexgen.py models/opt-1.3b/
 ```
 The output will be saved to `models/opt-1.3b-np/`.
 ## Usage
 The basic command is the following:
 ```
 python server.py --model opt-1.3b  --flexgen
 ```
 For large models, the RAM usage may be too high and your computer may freeze. If that happens, you can try this:
 ```
 python server.py --model opt-1.3b  --flexgen --compress-weight
 ```
 With this second command, I was able to run both OPT-6.7b and OPT-13B with **2GB VRAM**, and the speed was good in both cases.
 You can also manually set the offload strategy with
 ```
 python server.py --model opt-1.3b  --flexgen --percent 0 100 100 0 100 0
 ```
 where the six numbers after `--percent` are:
 ```
 the percentage of weight on GPU
 the percentage of weight on CPU
 the percentage of attention cache on GPU
 the percentage of attention cache on CPU
 the percentage of activations on GPU
 the percentage of activations on CPU
 ```
 You should typically only change the first two numbers. If their sum is less than 100, the remaining layers will be offloaded to the disk, by default into the `text-generation-webui/cache` folder.
 ## Performance
 In my experiments with OPT-30B using a RTX 3090 on Linux, I have obtained these results:
 * `--flexgen --compress-weight --percent 0 100 100 0 100 0`: 0.99 seconds per token.
 * `--flexgen --compress-weight --percent 100 0 100 0 100 0`: 0.765 seconds per token.
 ## Limitations
 * Only works with the OPT models.
 * Only two generation parameters are available: `temperature` and `do_sample`.
--- a/docs/GPTQ-models-(4-bit-mode).md
+++ b/docs/GPTQ-models-(4-bit-mode).md
@ -0,0 +1,140 @@
 In 4-bit mode, models are loaded with just 25% of their regular VRAM usage. So LLaMA-7B fits into a 6GB GPU, and LLaMA-30B fits into a 24GB GPU.
 This is possible thanks to [@qwopqwop200](https://github.com/qwopqwop200/GPTQ-for-LLaMa)'s adaptation of the GPTQ algorithm for LLaMA: https://github.com/qwopqwop200/GPTQ-for-LLaMa
 GPTQ is a clever quantization algorithm that lightly reoptimizes the weights during quantization so that the accuracy loss is compensated relative to a round-to-nearest quantization. See the paper for more details: https://arxiv.org/abs/2210.17323
 ## GPTQ-for-LLaMa branches
 Different branches of GPTQ-for-LLaMa are available:
 | Branch | Comment |
 |----|----|
 | [Old CUDA branch (recommended)](https://github.com/oobabooga/GPTQ-for-LLaMa/) | The fastest branch, works on Windows and Linux. |
 | [Up-to-date triton branch](https://github.com/qwopqwop200/GPTQ-for-LLaMa) | Slightly more precise than the old CUDA branch from 13b upwards, significantly more precise for 7b. 2x slower for small context size and only works on Linux. |
 | [Up-to-date CUDA branch](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda) | As precise as the up-to-date triton branch, 10x slower than the old cuda branch for small context size. |
 Overall, I recommend using the old CUDA branch. It is included by default in the one-click-installer for this web UI.
 ## Installation
 ### Step 0: install nvcc
 ```
 conda activate textgen
 conda install -c conda-forge cudatoolkit-dev
 ```
 The command above takes some 10 minutes to run and shows no progress bar or updates along the way.
 See this issue for more details: https://github.com/oobabooga/text-generation-webui/issues/416#issuecomment-1475078571
 ### Step 1: install GPTQ-for-LLaMa
 Clone the GPTQ-for-LLaMa repository into the `text-generation-webui/repositories` subfolder and install it:
 ```
 mkdir repositories
 cd repositories
 git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
 cd GPTQ-for-LLaMa
 python setup_cuda.py install
 ```
 You are going to need to have a C++ compiler installed into your system for the last command. On Linux, `sudo apt install build-essential` or equivalent is enough.
 If you want to you to use the up-to-date CUDA or triton branches instead of the old CUDA branch, use these commands:
 ```
 cd repositories
 rm -r GPTQ-for-LLaMa
 pip uninstall -y quant-cuda
 git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda
 ...
 ```
 ```
 cd repositories
 rm -r GPTQ-for-LLaMa
 pip uninstall -y quant-cuda
 git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b triton
 ...
 ```
 https://github.com/qwopqwop200/GPTQ-for-LLaMa
 ### Step 2: get the pre-converted weights
 * Converted without `group-size` (better for the 7b model): https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617
 * Converted with `group-size` (better from 13b upwards): https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941105 
 ⚠️ The tokenizer files in the sources above may be outdated. Make sure to obtain the universal LLaMA tokenizer as described [here](https://github.com/oobabooga/text-generation-webui/blob/main/docs/LLaMA-model.md#option-1-pre-converted-weights).
 ### Step 3: Start the web UI:
 For the models converted without `group-size`:
 ```
 python server.py --model llama-7b-4bit 
 ```
 For the models converted with `group-size`:
 ```
 python server.py --model llama-13b-4bit-128g 
 ```
 The command-line flags `--wbits` and `--groupsize` are automatically detected based on the folder names, but you can also specify them manually like 
 ```
 python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128
 ```
 ## CPU offloading
 It is possible to offload part of the layers of the 4-bit model to the CPU with the `--pre_layer` flag. The higher the number after `--pre_layer`, the more layers will be allocated to the GPU.
 With this command, I can run llama-7b with 4GB VRAM:
 ```
 python server.py --model llama-7b-4bit --pre_layer 20
 ```
 This is the performance:
 ```
 Output generated in 123.79 seconds (1.61 tokens/s, 199 tokens)
 ```
 ## Using LoRAs in 4-bit mode
 At the moment, this feature is not officially supported by the relevant libraries, but a patch exists and is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit
 In order to use it:
 1. Make sure that your requirements are up to date:
 ```
 cd text-generation-webui
 pip install -r requirements.txt --upgrade
 ```
 2. Clone `johnsmith0031/alpaca_lora_4bit` into the repositories folder:
 ```
 cd text-generation-webui/repositories
 git clone https://github.com/johnsmith0031/alpaca_lora_4bit
 ```
 3. Install https://github.com/sterlind/GPTQ-for-LLaMa with this command:
 ```
 pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit
 ```
 4. Start the UI with the `--monkey-patch` flag:
 ```
 python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
 ```
--- a/docs/LLaMA-model.md
+++ b/docs/LLaMA-model.md
@ -0,0 +1,45 @@
 LLaMA is a Large Language Model developed by Meta AI. 
 It was trained on more tokens than previous models. The result is that the smallest version with 7 billion parameters has similar performance to GPT-3 with 175 billion parameters.
 This guide will cover usage through the official `transformers` implementation. For 4-bit mode, head over to [GPTQ models (4 bit mode)
 ](GPTQ-models-(4-bit-mode).md).
 ## Getting the weights
 ### Option 1: pre-converted weights
 * Torrent: https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789
 * Direct download: https://huggingface.co/Neko-Institute-of-Science
 ⚠️ The tokenizers for the sources above and also for many LLaMA fine-tunes available on Hugging Face may be outdated, so I recommend downloading the following universal LLaMA tokenizer: 
 ```
 python download-model.py oobabooga/llama-tokenizer
 ```
 Once downloaded, it will be automatically applied to **every** `LlamaForCausalLM` model that you try to load.
 ### Option 2: convert the weights yourself
 1. Install the `protobuf` library:
 ```
 pip install protobuf
 ```
 2. Use the script below to convert the model in `.pth` format that you, a fellow academic, downloaded using Meta's official link:
 ### [convert_llama_weights_to_hf.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py)
 ```
 python convert_llama_weights_to_hf.py --input_dir /path/to/LLaMA --model_size 7B --output_dir /tmp/outputs/llama-7b
 ```
 3. Move the `llama-7b` folder inside your `text-generation-webui/models` folder.
 ## Starting the web UI
 ```python
 python server.py --model llama-7b
 ```
--- a/docs/Low-VRAM-guide.md
+++ b/docs/Low-VRAM-guide.md
@ -0,0 +1,51 @@
 If you GPU is not large enough to fit a model, try these in the following order:
 ### Load the model in 8-bit mode
 ```
 python server.py --load-in-8bit
 ```
 This reduces the memory usage by half with no noticeable loss in quality. Only newer GPUs support 8-bit mode.
 ### Split the model across your GPU and CPU
 ```
 python server.py --auto-devices
 ```
 If you can load the model with this command but it runs out of memory when you try to generate text, try increasingly limiting the amount of memory allocated to the GPU until the error stops happening:
 ```
 python server.py --auto-devices --gpu-memory 10
 python server.py --auto-devices --gpu-memory 9
 python server.py --auto-devices --gpu-memory 8
 ...
 ```
 where the number is in GiB.
 For finer control, you can also specify the unit in MiB explicitly:
 ```
 python server.py --auto-devices --gpu-memory 8722MiB
 python server.py --auto-devices --gpu-memory 4725MiB
 python server.py --auto-devices --gpu-memory 3500MiB
 ...
 ```
 Additionally, you can also set the `--no-cache` value to reduce the GPU usage while generating text at a performance cost. This may allow you to set a higher value for `--gpu-memory`, resulting in a net performance gain.
 ### Send layers to a disk cache
 As a desperate last measure, you can split the model across your GPU, CPU, and disk:
 ```
 python server.py --auto-devices --disk
 ```
 With this, I am able to load a 30b model into my RTX 3090, but it takes 10 seconds to generate 1 word.
 ### DeepSpeed (experimental)
 An experimental alternative to all of the above is to use DeepSpeed: [guide](DeepSpeed.md).
--- a/docs/README.md
+++ b/docs/README.md
@ -0,0 +1,19 @@
 # text-generation-webui manual
 ## Table of contents
 * [Custom-chat-characters](Custom-chat-characters.md)
 * [Docker Compose](Docker.md)
 * [DeepSpeed](DeepSpeed.md)
 * [Extensions](Extensions.md)
 * [FlexGen](FlexGen.md)
 * [GPTQ-models-(4-bit-mode)](GPTQ-models-(4-bit-mode).md)
 * [llama.cpp-models](llama.cpp-models.md)
 * [LLaMA-model](LLaMA-model.md)
 * [Low-VRAM-guide](Low-VRAM-guide.md)
 * [RWKV-model](RWKV-model.md)
 * [Spell-book](Spell-book.md)
 * [System-requirements](System-requirements.md)
 * [Using-LoRAs](Using-LoRAs.md)
 * [Windows-installation-guide](Windows-installation-guide.md)
 * [WSL-installation-guide](WSL-installation-guide.md)
--- a/docs/RWKV-model.md
+++ b/docs/RWKV-model.md
@ -0,0 +1,54 @@
 > RWKV: RNN with Transformer-level LLM Performance
 >
 > It combines the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).
 https://github.com/BlinkDL/RWKV-LM
 https://github.com/BlinkDL/ChatRWKV
 ## Using RWKV in the web UI
 #### 1. Download the model
 It is available in different sizes:
 * https://huggingface.co/BlinkDL/rwkv-4-pile-3b/
 * https://huggingface.co/BlinkDL/rwkv-4-pile-7b/
 * https://huggingface.co/BlinkDL/rwkv-4-pile-14b/
 There are also older releases with smaller sizes like:
 * https://huggingface.co/BlinkDL/rwkv-4-pile-169m/resolve/main/RWKV-4-Pile-169M-20220807-8023.pth
 Download the chosen `.pth` and put it directly in the `models` folder. 
 #### 2. Download the tokenizer
 [20B_tokenizer.json](https://raw.githubusercontent.com/BlinkDL/ChatRWKV/main/v2/20B_tokenizer.json)
 Also put it directly in the `models` folder. Make sure to not rename it. It should be called `20B_tokenizer.json`.
 #### 3. Launch the web UI
 No additional steps are required. Just launch it as you would with any other model.
 ```
 python server.py --listen  --no-stream --model RWKV-4-Pile-169M-20220807-8023.pth
 ```
 ## Setting a custom strategy
 It is possible to have very fine control over the offloading and precision for the model with the `--rwkv-strategy` flag. Possible values include:
 ```
 "cpu fp32" # CPU mode
 "cuda fp16" # GPU mode with float16 precision
 "cuda fp16 *30 -> cpu fp32" # GPU+CPU offloading. The higher the number after *, the higher the GPU allocation.
 "cuda fp16i8" # GPU mode with 8-bit precision
 ```
 See the README for the PyPl package for more details: https://pypi.org/project/rwkv/
 ## Compiling the CUDA kernel
 You can compile the CUDA kernel for the model with `--rwkv-cuda-on`. This should improve the performance a lot but I haven't been able to get it to work yet.
--- a/docs/Spell-book.md
+++ b/docs/Spell-book.md
@ -0,0 +1,107 @@
 You have now entered a hidden corner of the internet.
 A confusing yet intriguing realm of paradoxes and contradictions.
 A place where you will find out that what you thought you knew, you in fact didn't know, and what you didn't know was in front of you all along.
 ![](https://i.pinimg.com/originals/6e/e2/7b/6ee27bad351d3aca470d80f1033ba9c6.jpg)
 *In other words, here I will document little-known facts about this web UI that I could not find another place for in the wiki.*
 #### You can train LoRAs in CPU mode
 Load the web UI with
 ```
 python server.py --cpu
 ```
 and start training the LoRA from the training tab as usual.
 #### 8-bit mode works with CPU offloading
 ```
 python server.py --load-in-8bit --gpu-memory 4000MiB
 ```
 #### `--pre_layer`, and not `--gpu-memory`, is the right way to do CPU offloading with 4-bit models
 ```
 python server.py --wbits 4 --groupsize 128 --pre_layer 20
 ```
 #### Models can be loaded in 32-bit, 16-bit, 8-bit, and 4-bit modes
 ```
 python server.py --cpu
 python server.py
 python server.py --load-in-8bit
 python server.py --wbits 4
 ```
 #### The web UI works with any version of GPTQ-for-LLaMa
 Including the up to date triton and cuda branches. But you have to delete the `repositories/GPTQ-for-LLaMa` folder and reinstall the new one every time:
 ```
 cd text-generation-webui/repositories
 rm -r GPTQ-for-LLaMa
 pip uninstall quant-cuda
 git clone https://github.com/oobabooga/GPTQ-for-LLaMa -b cuda # or any other repository and branch
 cd GPTQ-for-LLaMa
 python setup_cuda.py install
 ```
 #### Instruction-following templates are represented as chat characters
 https://github.com/oobabooga/text-generation-webui/tree/main/characters/instruction-following
 #### The right way to run Alpaca, Open Assistant, Vicuna, etc is Instruct mode, not normal chat mode
 Otherwise the prompt will not be formatted correctly.
 1. Start the web UI with
 ```
 python server.py --chat
 ```
 2. Click on the "instruct" option under "Chat modes"
 3. Select the correct template in the hidden dropdown menu that will become visible. 
 #### Notebook mode is best mode
 Ascended individuals have realized that notebook mode is the superset of chat mode and can do chats with ultimate flexibility, including group chats, editing replies, starting a new bot reply in a given way, and impersonating.
 #### RWKV is a RNN 
 Most models are transformers, but not RWKV, which is a RNN. It's a great model.
 #### `--gpu-memory` is not a hard limit on the GPU memory
 It is simply a parameter that is passed to the `accelerate` library while loading the model. More memory will be allocated during generation. That's why this parameter has to be set to less than your total GPU memory.
 #### Contrastive search perhaps the best preset
 But it uses a ton of VRAM.
 #### You can check the sha256sum of downloaded models with the download script
 ```
 python download-model.py facebook/galactica-125m --check
 ```
 #### The download script continues interrupted downloads by default
 It doesn't start over.
 #### You can download models with multiple threads
 ```
 python download-model.py facebook/galactica-125m --threads 8
 ```
 #### LoRAs work in 4-bit mode
 You need to follow [these instructions](GPTQ-models-(4-bit-mode).md#using-loras-in-4-bit-mode) and then start the web UI with the `--monkey-patch` flag.
--- a/docs/System-requirements.md
+++ b/docs/System-requirements.md
@ -0,0 +1,42 @@
 These are the VRAM and RAM requirements (in MiB) to run some examples of models **in 16-bit (default) precision**:
 | model                  |   VRAM (GPU) |     RAM |
 |:-----------------------|-------------:|--------:|
 | arxiv_ai_gpt2          |      1512.37 | 5824.2  |
 | blenderbot-1B-distill  |      2441.75 | 4425.91 |
 | opt-1.3b               |      2509.61 | 4427.79 |
 | gpt-neo-1.3b           |      2605.27 | 5851.58 |
 | opt-2.7b               |      5058.05 | 4863.95 |
 | gpt4chan_model_float16 |     11653.7  | 4437.71 |
 | gpt-j-6B               |     11653.7  | 5633.79 |
 | galactica-6.7b         |     12697.9  | 4429.89 |
 | opt-6.7b               |     12700    | 4368.66 |
 | bloomz-7b1-p3          |     13483.1  | 4470.34 |
 #### GPU mode with 8-bit precision
 Allows you to load models that would not normally fit into your GPU. Enabled by default for 13b and 20b models in this web UI.
 | model          |   VRAM (GPU) |     RAM |
 |:---------------|-------------:|--------:|
 | opt-13b        |      12528.1 | 1152.39 |
 | gpt-neox-20b   |      20384   | 2291.7  |
 #### CPU mode (32-bit precision)
 A lot slower, but does not require a GPU. 
 On my i5-12400F, 6B models take around 10-20 seconds to respond in chat mode, and around 5 minutes to generate a 200 tokens completion. 
 | model                  |      RAM |
 |:-----------------------|---------:|
 | arxiv_ai_gpt2          |  4430.82 |
 | gpt-neo-1.3b           |  6089.31 |
 | opt-1.3b               |  8411.12 |
 | blenderbot-1B-distill  |  8508.16 |
 | opt-2.7b               | 14969.3  |
 | bloomz-7b1-p3          | 21371.2  |
 | gpt-j-6B               | 24200.3  |
 | gpt4chan_model         | 24246.3  |
 | galactica-6.7b         | 26561.4  |
 | opt-6.7b               | 29596.6  |
--- a/docs/Using-LoRAs.md
+++ b/docs/Using-LoRAs.md
@ -0,0 +1,89 @@
 Based on https://github.com/tloen/alpaca-lora
 ## Instructions
 1. Download a LoRA, for instance:
 ```
 python download-model.py tloen/alpaca-lora-7b
 ```
 2. Load the LoRA. 16-bit, 8-bit, and CPU modes work:
 ```
 python server.py --model llama-7b-hf --lora alpaca-lora-7b
 python server.py --model llama-7b-hf --lora alpaca-lora-7b --load-in-8bit
 python server.py --model llama-7b-hf --lora alpaca-lora-7b --cpu
 ```
 * For using LoRAs in 4-bit mode, follow [these special instructions](GPTQ-models-(4-bit-mode).md#using-loras-in-4-bit-mode).
 * Instead of using the `--lora` command-line flag, you can also select the LoRA in the "Parameters" tab of the interface.
 ## Prompt
 For the Alpaca LoRA in particular, the prompt must be formatted like this:
 ```
 Below is an instruction that describes a task. Write a response that appropriately completes the request.
 ### Instruction:
 Write a Python script that generates text using the transformers library.
 ### Response:
 ```
 Sample output:
 ```
 Below is an instruction that describes a task. Write a response that appropriately completes the request.
 ### Instruction:
 Write a Python script that generates text using the transformers library.
 ### Response:
 import transformers
 from transformers import AutoTokenizer, AutoModelForCausalLM
 tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
 model = AutoModelForCausalLM.from_pretrained("bert-base-uncased")
 texts = ["Hello world", "How are you"]
 for sentence in texts:
 sentence = tokenizer(sentence)
 print(f"Generated {len(sentence)} tokens from '{sentence}'")
 output = model(sentences=sentence).predict()
 print(f"Predicted {len(output)} tokens for '{sentence}':\n{output}")
 ```
 ## Training a LoRA
 The Training tab in the interface can be used to train a LoRA. The parameters are self-documenting and good defaults are included.
 You can interrupt and resume LoRA training in this tab. If the name and rank are the same, training will resume using the `adapter_model.bin` in your LoRA folder. You can resume from a past checkpoint by replacing this file using the contents of one of the checkpoint folders. Note that the learning rate and steps will be reset, and you may want to set the learning rate to the last reported rate in the console output.
 LoRA training was contributed by [mcmonkey4eva](https://github.com/mcmonkey4eva) in PR [#570](https://github.com/oobabooga/text-generation-webui/pull/570).
 #### Using the original alpaca-lora code
 Kept here for reference. The Training tab has much more features than this method.
 ```
 conda activate textgen
 git clone https://github.com/tloen/alpaca-lora
 ```
 Edit those two lines in `alpaca-lora/finetune.py` to use your existing model folder instead of downloading everything from decapoda:
 ```
 model = LlamaForCausalLM.from_pretrained(
    "models/llama-7b",
    load_in_8bit=True,
    device_map="auto",
 )
 tokenizer = LlamaTokenizer.from_pretrained(
    "models/llama-7b", add_eos_token=True
 )
 ```
 Run the script with:
 ```
 python finetune.py
 ```
 It just works. It runs at 22.32s/it, with 1170 iterations in total, so about 7 hours and a half for training a LoRA. RTX 3090, 18153MiB VRAM used, drawing maximum power (350W, room heater mode).
--- a/docs/WSL-installation-guide.md
+++ b/docs/WSL-installation-guide.md
@ -0,0 +1,73 @@
 Guide created by [@jfryton](https://github.com/jfryton). Thank you jfryton.
 -----
 Here's an easy-to-follow, step-by-step guide for installing Windows Subsystem for Linux (WSL) with Ubuntu on Windows 10/11:
 ## Step 1: Enable WSL
 1. Press the Windows key + X and click on "Windows PowerShell (Admin)" or "Windows Terminal (Admin)" to open PowerShell or Terminal with administrator privileges.
 2. In the PowerShell window, type the following command and press Enter:
 ```
 wsl --install
 ```
 If this command doesn't work, you can enable WSL with the following command for Windows 10:
 ```
 wsl --set-default-version 1
 ```
 For Windows 11, you can use:
 ```
 wsl --set-default-version 2
 ```
 You may be prompted to restart your computer. If so, save your work and restart.
 ## Step 2: Install Ubuntu
 1. Open the Microsoft Store.
 2. Search for "Ubuntu" in the search bar.
 3. Choose the desired Ubuntu version (e.g., Ubuntu 20.04 LTS) and click "Get" or "Install" to download and install the Ubuntu app.
 4. Once the installation is complete, click "Launch" or search for "Ubuntu" in the Start menu and open the app.
 ## Step 3: Set up Ubuntu
 1. When you first launch the Ubuntu app, it will take a few minutes to set up. Be patient as it installs the necessary files and sets up your environment.
 2. Once the setup is complete, you will be prompted to create a new UNIX username and password. Choose a username and password, and make sure to remember them, as you will need them for future administrative tasks within the Ubuntu environment.
 ## Step 4: Update and upgrade packages
 1. After setting up your username and password, it's a good idea to update and upgrade your Ubuntu system. Run the following commands in the Ubuntu terminal:
 ```
 sudo apt update
 sudo apt upgrade
 ```
 2. Enter your password when prompted. This will update the package list and upgrade any outdated packages.
 Congratulations! You have now installed WSL with Ubuntu on your Windows 10/11 system. You can use the Ubuntu terminal for various tasks, like running Linux commands, installing packages, or managing files.
 You can launch your WSL Ubuntu installation by selecting the Ubuntu app (like any other program installed on your computer) or typing 'ubuntu' into Powershell or Terminal.
 ## Step 5: Proceed with Linux instructions
 1. You can now follow the Linux setup instructions. If you receive any error messages about a missing tool or package, just install them using apt:
 ```
 sudo apt install [missing package]
 ```
 If you face any issues or need to troubleshoot, you can always refer to the official Microsoft documentation for WSL: https://docs.microsoft.com/en-us/windows/wsl/
 ## Bonus: Port Forwarding
 By default, you won't be able to access the webui from another device on your local network. You will need to setup the appropriate port forwarding using the following command (using PowerShell or Terminal with administrator privileges). 
 ```
 netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=7860 connectaddress=localhost connectport=7860
 ```
--- a/docs/Windows-installation-guide.md
+++ b/docs/Windows-installation-guide.md
@ -0,0 +1,9 @@
 If you are having trouble following the installation instructions in the README, Reddit user [Technical_Leather949](https://www.reddit.com/user/Technical_Leather949/) has created a more detailed, step-by-step guide covering:
 * Windows installation
 * 8-bit mode on Windows
 * LLaMA
 * LLaMA 4-bit
 The guide can be found here: https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/
--- a/docs/llama.cpp-models.md
+++ b/docs/llama.cpp-models.md
@ -0,0 +1,17 @@
 ## Using llama.cpp in the web UI
 #### Pre-converted models
 Simply place the model in the `models` folder, making sure that its name contains `ggml` somewhere and ends in `.bin`.
 #### Convert LLaMA yourself
 Follow the instructions in the llama.cpp README to generate the `ggml-model-q4_0.bin` file: https://github.com/ggerganov/llama.cpp#usage
 ## Performance
 This was the performance of llama-7b int4 on my i5-12400F:
 > Output generated in 33.07 seconds (6.05 tokens/s, 200 tokens, context 17)
 You can change the number of threads with `--threads N`.
--- a/download-model.py
+++ b/download-model.py
@ -41,13 +41,17 @@ def select_model_from_default_options():
        char = chr(ord('A') + i)
        choices[char] = name
        print(f"{char}) {name}")
-    char = chr(ord('A') + len(models))
+    char_hugging = chr(ord('A') + len(models))
-    print(f"{char}) None of the above")
+    print(f"{char_hugging}) Manually specify a Hugging Face model")
    char_exit = chr(ord('A') + len(models) + 1)
    print(f"{char_exit}) Do not download a model")
    print()
    print("Input> ", end='')
    choice = input()[0].strip().upper()
-    if choice == char:
+    if choice == char_exit:
        exit()
    elif choice == char_hugging:
        print("""\nThen type the name of your desired Hugging Face model in the format organization/name.
 Examples:
--- a/extensions/api/script.py
+++ b/extensions/api/script.py
@ -59,6 +59,7 @@ class Handler(BaseHTTPRequestHandler):
                'truncation_length': int(body.get('truncation_length', 2048)),
                'ban_eos_token': bool(body.get('ban_eos_token', False)),
                'skip_special_tokens': bool(body.get('skip_special_tokens', True)),
                'custom_stopping_strings': '',  # leave this blank
                'stopping_strings': body.get('stopping_strings', []),
            }
            stopping_strings = generate_params.pop('stopping_strings')
@ -72,10 +73,11 @@ class Handler(BaseHTTPRequestHandler):
            response = json.dumps({
                'results': [{
-                    'text': answer[len(prompt):]
+                    'text': answer if shared.is_chat() else answer[len(prompt):]
                }]
            })
            self.wfile.write(response.encode('utf-8'))
        elif self.path == '/api/v1/token-count':
            # Not compatible with KoboldAI api
            self.send_response(200)
@ -89,6 +91,7 @@ class Handler(BaseHTTPRequestHandler):
                }]
            })
            self.wfile.write(response.encode('utf-8'))
        else:
            self.send_error(404)
--- a/extensions/sd_api_pictures/README.MD
+++ b/extensions/sd_api_pictures/README.MD
@ -1,7 +1,7 @@
 ## Description:
 TL;DR: Lets the bot answer you with a picture!  
-Stable Diffusion API pictures for TextGen, v.1.1.1  
+Stable Diffusion API pictures for TextGen, v.1.2.0  
 An extension to [oobabooga's textgen-webui](https://github.com/oobabooga/text-generation-webui) allowing you to receive pics generated by [Automatic1111's SD-WebUI API](https://github.com/AUTOMATIC1111/stable-diffusion-webui)
 <details>
--- a/extensions/sd_api_pictures/script.py
+++ b/extensions/sd_api_pictures/script.py
@ -25,7 +25,11 @@ params = {
    'negative_prompt': '(worst quality, low quality:1.3)',
    'width': 512,
    'height': 512,
    'denoising_strength': 0.61,
    'restore_faces': False,
    'enable_hr': False,
    'hr_upscaler': 'ESRGAN_4x',
    'hr_scale': '1.0',
    'seed': -1,
    'sampler_name': 'DDIM',
    'steps': 32,
@ -74,7 +78,6 @@ SD_models = ['NeverEndingDream']  # TODO: get with http://{address}}/sdapi/v1/sd
 streaming_state = shared.args.no_stream  # remember if chat streaming was enabled
 picture_response = False  # specifies if the next model response should appear as a picture
 def remove_surrounded_chars(string):
    # this expression matches to 'as few symbols as possible (0 upwards) between any asterisks' OR
    # 'as few symbols as possible (0 upwards) between an asterisk and the end of the string'
@ -123,11 +126,16 @@ def get_SD_pictures(description):
        "prompt": params['prompt_prefix'] + description,
        "seed": params['seed'],
        "sampler_name": params['sampler_name'],
        "enable_hr": params['enable_hr'],
        "hr_scale": params['hr_scale'],
        "hr_upscaler": params['hr_upscaler'],
        "denoising_strength": params['denoising_strength'],
        "steps": params['steps'],
        "cfg_scale": params['cfg_scale'],
        "width": params['width'],
        "height": params['height'],
        "restore_faces": params['restore_faces'],
        "override_settings_restore_afterwards": True,
        "negative_prompt": params['negative_prompt']
    }
@ -246,15 +254,15 @@ def SD_api_address_update(address):
    return gr.Textbox.update(label=msg)
 def ui():
    # Gradio elements
    # gr.Markdown('### Stable Diffusion API Pictures') # Currently the name of extension is shown as the title
-    with gr.Accordion("Parameters", open=True):
+    with gr.Accordion("Parameters", open=True, elem_classes="SDAP"):
        with gr.Row():
            address = gr.Textbox(placeholder=params['address'], value=params['address'], label='Auto1111\'s WebUI address')
-            mode = gr.Dropdown(["Manual", "Immersive/Interactive", "Picturebook/Adventure"], value="Manual", label="Mode of operation", type="index")
+            modes_list = ["Manual", "Immersive/Interactive", "Picturebook/Adventure"]
            mode = gr.Dropdown(modes_list, value=modes_list[params['mode']], label="Mode of operation", type="index")
            with gr.Column(scale=1, min_width=300):
                manage_VRAM = gr.Checkbox(value=params['manage_VRAM'], label='Manage VRAM')
                save_img = gr.Checkbox(value=params['save_img'], label='Keep original images and use them in chat')
@ -264,17 +272,25 @@ def ui():
        with gr.Accordion("Generation parameters", open=False):
            prompt_prefix = gr.Textbox(placeholder=params['prompt_prefix'], value=params['prompt_prefix'], label='Prompt Prefix (best used to describe the look of the character)')
            with gr.Row():
                with gr.Column():
            negative_prompt = gr.Textbox(placeholder=params['negative_prompt'], value=params['negative_prompt'], label='Negative Prompt')
-                    sampler_name = gr.Textbox(placeholder=params['sampler_name'], value=params['sampler_name'], label='Sampler')
+            with gr.Row():
                with gr.Column():
                    width = gr.Slider(256, 768, value=params['width'], step=64, label='Width')
                    height = gr.Slider(256, 768, value=params['height'], step=64, label='Height')
                with gr.Column():
                    sampler_name = gr.Textbox(placeholder=params['sampler_name'], value=params['sampler_name'], label='Sampling method', elem_id="sampler_box")
                    steps = gr.Slider(1, 150, value=params['steps'], step=1, label="Sampling steps")
            with gr.Row():
-                steps = gr.Number(label="Steps:", value=params['steps'])
+                seed = gr.Number(label="Seed", value=params['seed'], elem_id="seed_box")
-                seed = gr.Number(label="Seed:", value=params['seed'])
+                cfg_scale = gr.Number(label="CFG Scale", value=params['cfg_scale'], elem_id="cfg_box")
-                cfg_scale = gr.Number(label="CFG Scale:", value=params['cfg_scale'])
+                with gr.Column() as hr_options:
                    restore_faces = gr.Checkbox(value=params['restore_faces'], label='Restore faces')
                    enable_hr = gr.Checkbox(value=params['enable_hr'], label='Hires. fix')                    
            with gr.Row(visible=params['enable_hr'], elem_classes="hires_opts") as hr_options:
                    hr_scale = gr.Slider(1, 4, value=params['hr_scale'], step=0.1, label='Upscale by')
                    denoising_strength = gr.Slider(0, 1, value=params['denoising_strength'], step=0.01, label='Denoising strength')
                    hr_upscaler = gr.Textbox(placeholder=params['hr_upscaler'], value=params['hr_upscaler'], label='Upscaler')                    
    # Event functions to update the parameters in the backend
    address.change(lambda x: params.update({"address": filter_address(x)}), address, None)
@ -289,6 +305,12 @@ def ui():
    negative_prompt.change(lambda x: params.update({"negative_prompt": x}), negative_prompt, None)
    width.change(lambda x: params.update({"width": x}), width, None)
    height.change(lambda x: params.update({"height": x}), height, None)
    hr_scale.change(lambda x: params.update({"hr_scale": x}), hr_scale, None)
    denoising_strength.change(lambda x: params.update({"denoising_strength": x}), denoising_strength, None)
    restore_faces.change(lambda x: params.update({"restore_faces": x}), restore_faces, None)
    hr_upscaler.change(lambda x: params.update({"hr_upscaler": x}), hr_upscaler, None)
    enable_hr.change(lambda x: params.update({"enable_hr": x}), enable_hr, None)
    enable_hr.change(lambda x: hr_options.update(visible=params["enable_hr"]), enable_hr, hr_options)
    sampler_name.change(lambda x: params.update({"sampler_name": x}), sampler_name, None)
    steps.change(lambda x: params.update({"steps": x}), steps, None)
--- a/modules/GPTQ_loader.py
+++ b/modules/GPTQ_loader.py
@ -12,7 +12,11 @@ import modules.shared as shared
 sys.path.insert(0, str(Path("repositories/GPTQ-for-LLaMa")))
 import llama_inference_offload
 try:
    from modelutils import find_layers
 except ImportError:
    from utils import find_layers
 try:
    from quant import make_quant
@ -75,14 +79,14 @@ def _load_quant(model, checkpoint, wbits, groupsize=-1, faster_kernel=False, exc
        model.load_state_dict(torch.load(checkpoint), strict=False)
    if is_triton:
-        if not shared.args.no_quant_attn:
+        if shared.args.quant_attn:
            quant.make_quant_attn(model)
-        if eval and not shared.args.no_fused_mlp:
+        if eval and shared.args.fused_mlp:
            quant.make_fused_mlp(model)
-        if not shared.args.no_warmup_autotune:
+        if shared.args.warmup_autotune:
            quant.autotune_warmup_linear(model, transpose=not eval)
-            if eval and not shared.args.no_fused_mlp:
+            if eval and shared.args.fused_mlp:
                quant.autotune_warmup_fused(model)
    model.seqlen = 2048
--- a/modules/evaluate.py
+++ b/modules/evaluate.py
@ -0,0 +1,142 @@
 import datetime
 import traceback
 from pathlib import Path
 import pandas as pd
 import torch
 from datasets import load_dataset
 from tqdm import tqdm
 from modules import shared
 from modules.models import load_model, unload_model
 from modules.text_generation import encode
 from server import get_model_specific_settings, update_model_parameters
 def load_past_evaluations():
    if Path('logs/evaluations.csv').exists():
        df = pd.read_csv(Path('logs/evaluations.csv'), dtype=str)
        df['Perplexity'] = pd.to_numeric(df['Perplexity'])
        return df
    else:
        return pd.DataFrame(columns=['Model', 'LoRAs', 'Dataset', 'Perplexity', 'stride', 'max_length', 'Date', 'Comment'])
 past_evaluations = load_past_evaluations()
 def save_past_evaluations(df):
    global past_evaluations
    past_evaluations = df
    df.to_csv(Path('logs/evaluations.csv'), index=False)
 def calculate_perplexity(models, input_dataset, stride, _max_length):
    '''
    Based on:
    https://huggingface.co/docs/transformers/perplexity#calculating-ppl-with-fixedlength-models
    '''
    global past_evaluations
    cumulative_log = ''
    cumulative_log += "Loading the input dataset...\n"
    yield cumulative_log
    # Copied from https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/triton/utils/datautils.py
    if input_dataset == 'wikitext':
        data = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
        text = "\n\n".join(data['text'])
    elif input_dataset == 'ptb':
        data = load_dataset('ptb_text_only', 'penn_treebank', split='validation')
        text = "\n\n".join(data['sentence'])
    elif input_dataset == 'ptb_new':
        data = load_dataset('ptb_text_only', 'penn_treebank', split='test')
        text = " ".join(data['sentence'])
    else:
        with open(Path(f'training/datasets/{input_dataset}.txt'), 'r', encoding='utf-8') as f:
            text = f.read()
    for model in models:
        if is_in_past_evaluations(model, input_dataset, stride, _max_length):
            cumulative_log += f"{model} has already been tested. Ignoring.\n"
            yield cumulative_log
            continue
        if model != 'current model':
            try:
                yield cumulative_log + f"Loading {model}...\n"
                model_settings = get_model_specific_settings(model)
                shared.settings.update(model_settings)  # hijacking the interface defaults
                update_model_parameters(model_settings)  # hijacking the command-line arguments
                shared.model_name = model
                unload_model()
                shared.model, shared.tokenizer = load_model(shared.model_name)
            except:
                cumulative_log += f"Failed to load {model}. Moving on.\n"
                yield cumulative_log
                continue
        cumulative_log += f"Processing {model}...\n"
        yield cumulative_log + "Tokenizing the input dataset...\n"
        encodings = encode(text, add_special_tokens=False)
        seq_len = encodings.shape[1]
        max_length = _max_length or shared.model.config.max_position_embeddings
        nlls = []
        prev_end_loc = 0
        for begin_loc in tqdm(range(0, seq_len, stride)):
            yield cumulative_log + f"Evaluating... {100*begin_loc/seq_len:.2f}%"
            end_loc = min(begin_loc + max_length, seq_len)
            trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
            input_ids = encodings[:, begin_loc:end_loc]
            target_ids = input_ids.clone()
            target_ids[:, :-trg_len] = -100
            with torch.no_grad():
                outputs = shared.model(input_ids, labels=target_ids)
                # loss is calculated using CrossEntropyLoss which averages over valid labels
                # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
                # to the left by 1.
                neg_log_likelihood = outputs.loss
            nlls.append(neg_log_likelihood)
            prev_end_loc = end_loc
            if end_loc == seq_len:
                break
        ppl = torch.exp(torch.stack(nlls).mean())
        add_entry_to_past_evaluations(float(ppl), shared.model_name, input_dataset, stride, _max_length)
        save_past_evaluations(past_evaluations)
        cumulative_log += f"Done. The perplexity is: {float(ppl)}\n\n"
        yield cumulative_log
 def add_entry_to_past_evaluations(perplexity, model, dataset, stride, max_length):
    global past_evaluations
    entry = {
        'Model': model,
        'LoRAs': ', '.join(shared.lora_names) or '-',
        'Dataset': dataset,
        'Perplexity': perplexity,
        'stride': str(stride),
        'max_length': str(max_length),
        'Date': datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'Comment': ''
    }
    past_evaluations = pd.concat([past_evaluations, pd.DataFrame([entry])], ignore_index=True)
 def is_in_past_evaluations(model, dataset, stride, max_length):
    entries = past_evaluations[(past_evaluations['Model'] == model) &
                               (past_evaluations['Dataset'] == dataset) &
                               (past_evaluations['max_length'] == str(max_length)) &
                               (past_evaluations['stride'] == str(stride))]
    if entries.shape[0] > 0:
        return True
    else:
        return False
 def generate_markdown_table():
    sorted_df = past_evaluations.sort_values(by=['Dataset', 'stride', 'Perplexity', 'Date'])
    return sorted_df
--- a/modules/models.py
+++ b/modules/models.py
@ -38,13 +38,30 @@ if shared.args.deepspeed:
    dschf = HfDeepSpeedConfig(ds_config)  # Keep this object alive for the Transformers integration
 def find_model_type(model_name):
    model_name = model_name.lower()
    if 'rwkv-' in model_name:
        return 'rwkv'
    elif len(list(Path(f'{shared.args.model_dir}/{model_name}').glob('*ggml*.bin'))) > 0:
        return 'llamacpp'
    elif re.match('.*ggml.*\.bin', model_name):
        return 'llamacpp'
    elif 'chatglm' in model_name:
        return 'chatglm'
    elif 'galactica' in model_name:
        return 'galactica'
    elif any((k in model_name for k in ['gpt4chan', 'gpt-4chan'])):
        return 'gpt4chan'
    else:
        return 'HF_generic'
 def load_model(model_name):
    print(f"Loading {model_name}...")
    t0 = time.time()
-    shared.is_RWKV = 'rwkv-' in model_name.lower()
+    shared.model_type = find_model_type(model_name)
-    shared.is_llamacpp = len(list(Path(f'{shared.args.model_dir}/{model_name}').glob('ggml*.bin'))) > 0
+    if shared.model_type == 'chatglm':
    if 'chatglm' in model_name.lower():
        LoaderClass = AutoModel
        trust_remote_code = shared.args.trust_remote_code
    else:
@ -52,8 +69,8 @@ def load_model(model_name):
        trust_remote_code = False
    # Load the model in simple 16-bit mode by default
-    if not any([shared.args.cpu, shared.args.load_in_8bit, shared.args.wbits, shared.args.auto_devices, shared.args.disk, shared.args.gpu_memory is not None, shared.args.cpu_memory is not None, shared.args.deepspeed, shared.args.flexgen, shared.is_RWKV, shared.is_llamacpp]):
+    if not any([shared.args.cpu, shared.args.load_in_8bit, shared.args.wbits, shared.args.auto_devices, shared.args.disk, shared.args.gpu_memory is not None, shared.args.cpu_memory is not None, shared.args.deepspeed, shared.args.flexgen, shared.model_type in ['rwkv', 'llamacpp']]):
-        model = LoaderClass.from_pretrained(Path(f"{shared.args.model_dir}/{shared.model_name}"), low_cpu_mem_usage=True, torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16, trust_remote_code=trust_remote_code)
+        model = LoaderClass.from_pretrained(Path(f"{shared.args.model_dir}/{model_name}"), low_cpu_mem_usage=True, torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16, trust_remote_code=trust_remote_code)
        if torch.has_mps:
            device = torch.device('mps')
            model = model.to(device)
@ -81,17 +98,17 @@ def load_model(model_name):
                            num_bits=4, group_size=64,
                            group_dim=2, symmetric=False))
-        model = OptLM(f"facebook/{shared.model_name}", env, shared.args.model_dir, policy)
+        model = OptLM(f"facebook/{model_name}", env, shared.args.model_dir, policy)
    # DeepSpeed ZeRO-3
    elif shared.args.deepspeed:
-        model = LoaderClass.from_pretrained(Path(f"{shared.args.model_dir}/{shared.model_name}"), torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16)
+        model = LoaderClass.from_pretrained(Path(f"{shared.args.model_dir}/{model_name}"), torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16)
        model = deepspeed.initialize(model=model, config_params=ds_config, model_parameters=None, optimizer=None, lr_scheduler=None)[0]
        model.module.eval()  # Inference
        print(f"DeepSpeed ZeRO-3 is enabled: {is_deepspeed_zero3_enabled()}")
    # RMKV model (not on HuggingFace)
-    elif shared.is_RWKV:
+    elif shared.model_type == 'rwkv':
        from modules.RWKV import RWKVModel, RWKVTokenizer
        model = RWKVModel.from_pretrained(Path(f'{shared.args.model_dir}/{model_name}'), dtype="fp32" if shared.args.cpu else "bf16" if shared.args.bf16 else "fp16", device="cpu" if shared.args.cpu else "cuda")
@ -100,12 +117,16 @@ def load_model(model_name):
        return model, tokenizer
    # llamacpp model
-    elif shared.is_llamacpp:
+    elif shared.model_type == 'llamacpp':
        from modules.llamacpp_model_alternative import LlamaCppModel
-        model_file = list(Path(f'{shared.args.model_dir}/{model_name}').glob('ggml*.bin'))[0]
+        path = Path(f'{shared.args.model_dir}/{model_name}')
-        print(f"llama.cpp weights detected: {model_file}\n")
+        if path.is_file():
            model_file = path
        else:
            model_file = list(Path(f'{shared.args.model_dir}/{model_name}').glob('*ggml*.bin'))[0]
        print(f"llama.cpp weights detected: {model_file}\n")
        model, tokenizer = LlamaCppModel.from_pretrained(model_file)
        return model, tokenizer
@ -169,7 +190,7 @@ def load_model(model_name):
            if shared.args.disk:
                params["offload_folder"] = shared.args.disk_cache_dir
-        checkpoint = Path(f'{shared.args.model_dir}/{shared.model_name}')
+        checkpoint = Path(f'{shared.args.model_dir}/{model_name}')
        if shared.args.load_in_8bit and params.get('max_memory', None) is not None and params['device_map'] == 'auto':
            config = AutoConfig.from_pretrained(checkpoint)
@ -190,7 +211,7 @@ def load_model(model_name):
        llama_attn_hijack.hijack_llama_attention()
    # Loading the tokenizer
-    if any((k in shared.model_name.lower() for k in ['gpt4chan', 'gpt-4chan'])) and Path(f"{shared.args.model_dir}/gpt-j-6B/").exists():
+    if shared.model_type == 'gpt4chan' and Path(f"{shared.args.model_dir}/gpt-j-6B/").exists():
        tokenizer = AutoTokenizer.from_pretrained(Path(f"{shared.args.model_dir}/gpt-j-6B/"))
    elif type(model) is transformers.LlamaForCausalLM:
        tokenizer = None
@ -205,7 +226,7 @@ def load_model(model_name):
        # Otherwise, load it from the model folder and hope that these
        # are not outdated tokenizer files.
        if tokenizer is None:
-            tokenizer = LlamaTokenizer.from_pretrained(Path(f"{shared.args.model_dir}/{shared.model_name}/"), clean_up_tokenization_spaces=True)
+            tokenizer = LlamaTokenizer.from_pretrained(Path(f"{shared.args.model_dir}/{model_name}/"), clean_up_tokenization_spaces=True)
            try:
                tokenizer.eos_token_id = 2
                tokenizer.bos_token_id = 1
@ -213,7 +234,7 @@ def load_model(model_name):
            except:
                pass
    else:
-        tokenizer = AutoTokenizer.from_pretrained(Path(f"{shared.args.model_dir}/{shared.model_name}/"), trust_remote_code=trust_remote_code)
+        tokenizer = AutoTokenizer.from_pretrained(Path(f"{shared.args.model_dir}/{model_name}/"), trust_remote_code=trust_remote_code)
    print(f"Loaded the model in {(time.time()-t0):.2f} seconds.")
    return model, tokenizer
--- a/modules/shared.py
+++ b/modules/shared.py
@ -6,11 +6,10 @@ import yaml
 model = None
 tokenizer = None
 model_name = "None"
 model_type = None
 lora_names = []
 soft_prompt_tensor = None
 soft_prompt = False
 is_RWKV = False
 is_llamacpp = False
 # Chat variables
 history = {'internal': [], 'visible': []}
@ -124,9 +123,9 @@ parser.add_argument('--model_type', type=str, help='Model type of pre-quantized
 parser.add_argument('--groupsize', type=int, default=-1, help='Group size.')
 parser.add_argument('--pre_layer', type=int, default=0, help='The number of layers to allocate to the GPU. Setting this parameter enables CPU offloading for 4-bit models.')
 parser.add_argument('--monkey-patch', action='store_true', help='Apply the monkey patch for using LoRAs with quantized models.')
-parser.add_argument('--no-quant_attn', action='store_true', help='(triton) Disable quant attention. If you encounter incoherent results try disabling this.')
+parser.add_argument('--quant_attn', action='store_true', help='(triton) Enable quant attention.')
-parser.add_argument('--no-warmup_autotune', action='store_true', help='(triton) Disable warmup autotune.')
+parser.add_argument('--warmup_autotune', action='store_true', help='(triton) Enable warmup autotune.')
-parser.add_argument('--no-fused_mlp', action='store_true', help='(triton) Disable fused mlp. If you encounter "Unexpected mma -> mma layout conversion" try disabling this.')
+parser.add_argument('--fused_mlp', action='store_true', help='(triton) Enable fused mlp.')
 # FlexGen
 parser.add_argument('--flexgen', action='store_true', help='Enable the use of FlexGen offloading.')
--- a/modules/text_generation.py
+++ b/modules/text_generation.py
@ -24,7 +24,7 @@ def get_max_prompt_length(state):
 def encode(prompt, add_special_tokens=True, add_bos_token=True, truncation_length=None):
-    if any((shared.is_RWKV, shared.is_llamacpp)):
+    if shared.model_type in ['rwkv', 'llamacpp']:
        input_ids = shared.tokenizer.encode(str(prompt))
        input_ids = np.array(input_ids).reshape(1, len(input_ids))
        return input_ids
@ -44,7 +44,7 @@ def encode(prompt, add_special_tokens=True, add_bos_token=True, truncation_lengt
    if truncation_length is not None:
        input_ids = input_ids[:, -truncation_length:]
-    if any((shared.is_RWKV, shared.is_llamacpp, shared.args.cpu)):
+    if shared.model_type in ['rwkv', 'llamacpp'] or shared.args.cpu:
        return input_ids
    elif shared.args.flexgen:
        return input_ids.numpy()
@ -97,10 +97,10 @@ def fix_galactica(s):
 def formatted_outputs(reply, model_name):
    if not shared.is_chat():
-        if 'galactica' in model_name.lower():
+        if shared.model_type == 'galactica':
            reply = fix_galactica(reply)
            return reply, reply, generate_basic_html(reply)
-        elif any((k in shared.model_name.lower() for k in ['gpt4chan', 'gpt-4chan'])):
+        elif shared.model_type == 'gpt4chan':
            reply = fix_gpt4chan(reply)
            return reply, 'Only applicable for GALACTICA models.', generate_4chan_html(reply)
        else:
@ -142,7 +142,7 @@ def generate_reply(question, state, eos_token=None, stopping_strings=[]):
    # These models are not part of Hugging Face, so we handle them
    # separately and terminate the function call earlier
-    if any((shared.is_RWKV, shared.is_llamacpp)):
+    if shared.model_type in ['rwkv', 'llamacpp']:
        if shared.args.verbose:
            print(f'\n\n{question}\n--------------------\n')
--- a/modules/training.py
+++ b/modules/training.py
@ -10,9 +10,12 @@ import gradio as gr
 import torch
 import transformers
 from datasets import Dataset, load_dataset
-from peft import LoraConfig, get_peft_model, set_peft_model_state_dict, prepare_model_for_int8_training
+from peft import (LoraConfig, get_peft_model, prepare_model_for_int8_training,
                  set_peft_model_state_dict)
 from modules import shared, ui
 from modules.evaluate import calculate_perplexity, generate_markdown_table, save_past_evaluations
 from server import get_available_loras, get_available_models
 # This mapping is from a very recent commit, not yet released.
 # If not available, default to a backup map for the 3 safe model types.
@ -40,10 +43,6 @@ def get_datasets(path: str, ext: str):
    return ['None'] + sorted(set([k.stem for k in Path(path).glob(f'*.{ext}') if k.stem != 'put-trainer-datasets-here']), key=str.lower)
 def get_available_loras():
    return ['None'] + sorted([item.name for item in list(Path(shared.args.lora_dir).glob('*')) if not item.name.endswith(('.txt', '-np', '.pt', '.json'))], key=str.lower)
 def create_train_interface():
    with gr.Tab('Train LoRA', elem_id='lora-train-tab'):
        with gr.Row():
@ -82,9 +81,9 @@ def create_train_interface():
            eval_steps = gr.Number(label='Evaluate every n steps', value=100, info='If an evaluation dataset is given, test it every time this many steps pass.')
-        with gr.Tab(label='Raw Text File'):
+        with gr.Tab(label="Raw text file"):
            with gr.Row():
-                raw_text_file = gr.Dropdown(choices=get_datasets('training/datasets', 'txt'), value='None', label='Text File', info='The raw text file to use for training.')
+                raw_text_file = gr.Dropdown(choices=get_datasets('training/datasets', 'txt'), value='None', label='Text file', info='The raw text file to use for training.')
                ui.create_refresh_button(raw_text_file, lambda: None, lambda: {'choices': get_datasets('training/datasets', 'txt')}, 'refresh-button')
            with gr.Row():
@ -106,12 +105,49 @@ def create_train_interface():
        output = gr.Markdown(value="Ready")
    with gr.Tab('Perplexity evaluation', elem_id='evaluate-tab'):
        with gr.Row():
            with gr.Column():
                models = gr.Dropdown(get_available_models(), label='Models', multiselect=True)
                evaluate_text_file = gr.Dropdown(choices=['wikitext', 'ptb', 'ptb_new'] + get_datasets('training/datasets', 'txt')[1:], value='wikitext', label='Input dataset', info='The raw text file on which the model will be evaluated. The first options are automatically downloaded: wikitext, ptb, and ptb_new. The next options are your local text files under training/datasets.')
                with gr.Row():
                    stride_length = gr.Slider(label='Stride', minimum=1, maximum=2048, value=512, step=1, info='Used to make the evaluation faster at the cost of accuracy. 1 = slowest but most accurate. 512 is a common value.')
                    max_length = gr.Slider(label='max_length', minimum=0, maximum=8096, value=0, step=1, info='The context for each evaluation. If set to 0, the maximum context length for the model will be used.')
                with gr.Row():
                    start_current_evaluation = gr.Button("Evaluate loaded model")
                    start_evaluation = gr.Button("Evaluate selected models")
                    stop_evaluation = gr.Button("Interrupt")
            with gr.Column():
                evaluation_log = gr.Markdown(value = '')
        evaluation_table = gr.Dataframe(value=generate_markdown_table(), interactive=True)
        save_comments = gr.Button('Save comments')
    # Training events
    all_params = [lora_name, always_override, save_steps, micro_batch_size, batch_size, epochs, learning_rate, lr_scheduler_type, lora_rank, lora_alpha, lora_dropout, cutoff_len, dataset, eval_dataset, format, eval_steps, raw_text_file, overlap_len, newline_favor_len, do_shuffle, higher_rank_limit, warmup_steps, optimizer]
    copy_from.change(do_copy_params, [copy_from] + all_params, all_params)
    start_button.click(do_train, all_params, output)
    stop_button.click(do_interrupt, None, None, queue=False)
    higher_rank_limit.change(change_rank_limit, [higher_rank_limit], [lora_rank, lora_alpha])
    # Evaluation events. For some reason, the interrupt event
    # doesn't work with the .then() syntax, so I write them one
    # by one in this ugly but functional way.
    ev = start_evaluation.click(calculate_perplexity, [models, evaluate_text_file, stride_length, max_length], evaluation_log, show_progress=False)
    start_evaluation.click(generate_markdown_table, None, evaluation_table, show_progress=False)
    tmp = gr.State('')
    start_current_evaluation.click(lambda: ['current model'], None, tmp)
    ev_cur = start_current_evaluation.click(calculate_perplexity, [tmp, evaluate_text_file, stride_length, max_length], evaluation_log, show_progress=False)
    start_current_evaluation.click(generate_markdown_table, None, evaluation_table, show_progress=False)
    stop_evaluation.click(None, None, None, cancels=[ev, ev_cur], queue=False)
    save_comments.click(
        save_past_evaluations, evaluation_table, None).then(
        lambda: "Comments saved.", None, evaluation_log, show_progress=False)
 def do_interrupt():
    global WANT_INTERRUPT
@ -133,6 +169,7 @@ def do_copy_params(lora_name: str, *args):
            result.append(params[key])
        else:
            result.append(args[i])
    return result
@ -155,7 +192,8 @@ def clean_path(base_path: str, path: str):
 def do_train(lora_name: str, always_override: bool, save_steps: int, micro_batch_size: int, batch_size: int, epochs: int, learning_rate: str, lr_scheduler_type: str, lora_rank: int, lora_alpha: int, lora_dropout: float, cutoff_len: int, dataset: str, eval_dataset: str, format: str, eval_steps: int, raw_text_file: str, overlap_len: int, newline_favor_len: int, do_shuffle: bool, higher_rank_limit: bool, warmup_steps: int, optimizer: str):
    if shared.args.monkey_patch:
-        from monkeypatch.peft_tuners_lora_monkey_patch import replace_peft_model_with_gptq_lora_model
+        from monkeypatch.peft_tuners_lora_monkey_patch import \
            replace_peft_model_with_gptq_lora_model
        replace_peft_model_with_gptq_lora_model()
    global WANT_INTERRUPT
@ -300,6 +338,7 @@ def do_train(lora_name: str, always_override: bool, save_steps: int, micro_batch
            if '4bit' in str(type(m)):
                if m.is_v1_model:
                    m.zeros = m.zeros.half()
                m.scales = m.scales.half()
    class Tracked():
--- a/modules/ui.py
+++ b/modules/ui.py
@ -20,7 +20,9 @@ theme = gr.themes.Default(
    font_mono=['IBM Plex Mono', 'ui-monospace', 'Consolas', 'monospace'],
 ).set(
    border_color_primary='#c5c5d2',
-    button_large_padding='6px 12px'
+    button_large_padding='6px 12px',
    body_text_color_subdued='#484848',
    background_fill_secondary='#eaeaea'
 )
 def list_model_elements():
--- a/requirements.txt
+++ b/requirements.txt
@ -5,12 +5,13 @@ flexgen==0.1.7
 gradio==3.25.0
 markdown
 numpy
 pandas
 Pillow>=9.5.0
 pyyaml
 requests
 rwkv==0.7.3
 safetensors==0.3.0
 sentencepiece
 pyyaml
 tqdm
 git+https://github.com/huggingface/peft
 transformers==4.28.1
--- a/server.py
+++ b/server.py
@ -44,7 +44,8 @@ from modules import api, chat, shared, training, ui
 from modules.html_generator import chat_html_wrapper
 from modules.LoRA import add_lora_to_model
 from modules.models import load_model, load_soft_prompt, unload_model
-from modules.text_generation import generate_reply, stop_everything_event
+from modules.text_generation import (encode, generate_reply,
                                     stop_everything_event)
 def get_available_models():
@ -172,6 +173,11 @@ def load_prompt(fname):
            return text
 def count_tokens(text):
    tokens = len(encode(text)[0])
    return f'{tokens} tokens in the input.'
 def download_model_wrapper(repo_id):
    try:
        downloader = importlib.import_module("download-model")
@ -615,15 +621,10 @@ def create_interface():
                            shared.gradio['html'] = gr.HTML()
                        with gr.Row():
-                            with gr.Column():
+                            shared.gradio['Generate'] = gr.Button('Generate', variant='primary', elem_classes="small-button")
-                                with gr.Row():
+                            shared.gradio['Stop'] = gr.Button('Stop', elem_classes="small-button")
-                                    shared.gradio['Generate'] = gr.Button('Generate', variant='primary')
+                            shared.gradio['Undo'] = gr.Button('Undo', elem_classes="small-button")
-                                    shared.gradio['Stop'] = gr.Button('Stop')
+                            shared.gradio['Regenerate'] = gr.Button('Regenerate', elem_classes="small-button")
                                    shared.gradio['Undo'] = gr.Button('Undo')
                                    shared.gradio['Regenerate'] = gr.Button('Regenerate')
                            with gr.Column():
                                pass
                    with gr.Column(scale=1):
                        gr.HTML('<div style="padding-bottom: 13px"></div>')
@ -633,6 +634,7 @@ def create_interface():
                            ui.create_refresh_button(shared.gradio['prompt_menu'], lambda: None, lambda: {'choices': get_available_prompts()}, 'refresh-button')
                        shared.gradio['save_prompt'] = gr.Button('Save prompt')
                        shared.gradio['count_tokens'] = gr.Button('Count tokens')
                        shared.gradio['status'] = gr.Markdown('')
            with gr.Tab("Parameters", elem_id="parameters"):
@ -649,10 +651,11 @@ def create_interface():
                        shared.gradio['textbox'] = gr.Textbox(value=default_text, elem_classes="textbox_default", lines=27, label='Input')
                        shared.gradio['max_new_tokens'] = gr.Slider(minimum=shared.settings['max_new_tokens_min'], maximum=shared.settings['max_new_tokens_max'], step=1, label='max_new_tokens', value=shared.settings['max_new_tokens'])
                        with gr.Row():
-                            shared.gradio['Generate'] = gr.Button('Generate', variant='primary')
+                            shared.gradio['Generate'] = gr.Button('Generate', variant='primary', elem_classes="small-button")
-                            shared.gradio['Stop'] = gr.Button('Stop')
+                            shared.gradio['Stop'] = gr.Button('Stop', elem_classes="small-button")
-                            shared.gradio['Continue'] = gr.Button('Continue')
+                            shared.gradio['Continue'] = gr.Button('Continue', elem_classes="small-button")
-                            shared.gradio['save_prompt'] = gr.Button('Save prompt')
+                            shared.gradio['save_prompt'] = gr.Button('Save prompt', elem_classes="small-button")
                            shared.gradio['count_tokens'] = gr.Button('Count tokens', elem_classes="small-button")
                        with gr.Row():
                            with gr.Column():
@ -843,8 +846,9 @@ def create_interface():
                )
            shared.gradio['Stop'].click(stop_everything_event, None, None, queue=False, cancels=gen_events if shared.args.no_stream else None)
-            shared.gradio['prompt_menu'].change(load_prompt, [shared.gradio['prompt_menu']], [shared.gradio['textbox']], show_progress=False)
+            shared.gradio['prompt_menu'].change(load_prompt, shared.gradio['prompt_menu'], shared.gradio['textbox'], show_progress=False)
-            shared.gradio['save_prompt'].click(save_prompt, [shared.gradio['textbox']], [shared.gradio['status']], show_progress=False)
+            shared.gradio['save_prompt'].click(save_prompt, shared.gradio['textbox'], shared.gradio['status'], show_progress=False)
            shared.gradio['count_tokens'].click(count_tokens, shared.gradio['textbox'], shared.gradio['status'], show_progress=False)
            shared.gradio['interface'].load(None, None, None, _js=f"() => {{{ui.main_js}}}")
    # Launch the interface