Merge pull request #4522 from oobabooga/dev

Merge dev branch
2024-11-22 08:07:56 +01:00 · 2023-11-08 22:57:08 -03:00 · 2023-11-08 22:57:08 -03:00 · 4da00b6032
commit 4da00b6032
parent 1fba6db69f 21ed9a260e
87 changed files with 565 additions and 395 deletions
--- a/README.md
+++ b/README.md
@ -20,9 +20,8 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
 * [Multimodal pipelines, including LLaVA and MiniGPT-4](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal)
 * [Extensions framework](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions)
 * [Custom chat characters](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#character)
-* Very efficient text streaming
 * Markdown output with LaTeX rendering, to use for instance with [GALACTICA](https://github.com/paperswithcode/galai)
-* OpenAI-compatible API server
+* OpenAI-compatible API server with Chat and Completions endpoints -- see the [examples](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples)

 ## Documentation

@ -328,6 +327,7 @@ Optionally, you can use the following command-line flags:
 | `--tensor_split TENSOR_SPLIT`       | Split the model across multiple GPUs. Comma-separated list of proportions. Example: 18,17. |
 | `--llama_cpp_seed SEED`             | Seed for llama-cpp models. Default is 0 (random). |
 | `--numa`      | Activate NUMA task allocation for llama.cpp. |
+| `--logits_all`| Needs to be set for perplexity evaluation to work. Otherwise, ignore it, as it makes prompt processing slower. |
 | `--cache-capacity CACHE_CAPACITY`   | Maximum cache capacity (llama-cpp-python). Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. |

 #### ExLlama
--- a/docs/03
+++ b/docs/03
@ -98,10 +98,12 @@ So you can use those special placeholders in your character definitions. They ar
 Defines the instruction template that is used in the Chat tab when "instruct" or "chat-instruct" are selected under "Mode".

 * **Instruction template**: A dropdown menu where you can select from saved templates, save a new template (💾 button), and delete the currently selected template (🗑️).
-* **User string**: In the turn template, `<|user|>` gets replaced with this string.
-* **Bot string**: In the turn template, `<|bot|>` gets replaced with this string.
-* **Context**: A string that appears as-is at the top of the prompt, including the new line characters at the end (if any). The system message for the model can be edited inside this string to customize its behavior.
-* **Turn template**: Defines the positioning of spaces and new line characters in a single turn of the dialogue. `<|user-message|>` gets replaced with the user input and `<|bot-message|>` gets replaced with the bot reply. It is necessary to include `<|user|>` and `<|bot|>` even if "User string" and "Bot string" above are empty, as those placeholders are used to split the template in parts in the backend.
+* **Custom system message**: A message that defines the personality of the chatbot, replacing its default "System message" string. Example: "You are a duck."
+* **Turn template**: Defines the positioning of spaces and new line characters in a single turn of the dialogue. `<|user-message|>` gets replaced with the user input, `<|bot-message|>` gets replaced with the bot reply, `<|user|>` gets replaced with the "User string" below, and `<|bot|>` gets replaced with "Bot string" below. The `<|user|>` and `<|bot|>` placeholders must be included even if "User string" and "Bot string" are empty, as they are used to split the template in parts in the backend.
+* **User string**: Replaces `<|user|>` in the turn template.
+* **Bot string**: Replaces `<|bot|>` in the turn template.
+* **Context**: A string that appears as-is at the top of the prompt, including the new line characters at the end (if any). The `<|system-message|>` placeholder gets replaced with the "System message" string below, unless "Custom system message" is not empty, in which case it is used instead.
+* **System message**: A default message recommended by the model creator(s) to define the personality of the chatbot.
 * **Send to default**: Send the full instruction template in string format to the Default tab.
 * **Send to notebook**: Send the full instruction template in string format to the Notebook tab.
 * **Send to negative prompt**: Send the full instruction template in string format to the "Negative prompt" field under "Parameters" > "Generation".
--- a/docs/04
+++ b/docs/04
@ -110,6 +110,10 @@ To use it, you need to download a tokenizer. There are two options:
 1) Download `oobabooga/llama-tokenizer` under "Download model or LoRA". That's a default Llama tokenizer.
 2) Place your .gguf in a subfolder of `models/` along with these 3 files: `tokenizer.model`, `tokenizer_config.json`, and `special_tokens_map.json`. This takes precedence over Option 1.

+It has an additional parameter:
+
+* **logits_all**: Needs to be checked if you want to evaluate the perplexity of the llama.cpp model using the "Training" > "Perplexity evaluation" tab. Otherwise, leave it unchecked, as it makes prompt processing slower.
+
 ### ctransformers

 Loads: GGUF/GGML models.
--- a/docs/12
+++ b/docs/12
@ -12,10 +12,11 @@ pip install -r extensions/openai/requirements.txt

 Add `--extensions openai` to your command-line flags.

-* To create a public Cloudflare URL, also add the `--public-api` flag.
-* To listen on your local network, also add the `--listen` flag.
-* To change the port, which is 5000 by default, use `--port 1234` (change 1234 to your desired port number).
+* To create a public Cloudflare URL, add the `--public-api` flag.
+* To listen on your local network, add the `--listen` flag.
+* To change the port, which is 5000 by default, use `--api-port 1234` (change 1234 to your desired port number).
 * To use SSL, add `--ssl-keyfile key.pem --ssl-certfile cert.pem`. Note that it doesn't work with `--public-api`.
+* To use an API key for authentication, add `--api-key yourkey`.

 #### Environment variables

@ -44,7 +45,7 @@ openai-debug: 1

 ### Examples

-For the documentation with all the parameters, consult `http://127.0.0.1:5000/docs` or the [typing.py](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/typing.py) file.
+For the documentation with all the parameters and their types, consult `http://127.0.0.1:5000/docs` or the [typing.py](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/typing.py) file.

 The official examples in the [OpenAI documentation](https://platform.openai.com/docs/api-reference) should also work, and the same parameters apply (although the API here has more optional parameters).

@ -144,8 +145,82 @@ while True:
    print(assistant_message)
 ```

-### Client Application Setup
+#### Python chat example with streaming

+Start the script with `python -u` to see the output in real time.
+
+```python
+import requests
+import sseclient  # pip install sseclient-py
+import json
+
+url = "http://127.0.0.1:5000/v1/chat/completions"
+
+headers = {
+    "Content-Type": "application/json"
+}
+
+history = []
+
+while True:
+    user_message = input("> ")
+    history.append({"role": "user", "content": user_message})
+    data = {
+        "mode": "instruct",
+        "stream": True,
+        "messages": history
+    }
+
+    stream_response = requests.post(url, headers=headers, json=data, verify=False, stream=True)
+    client = sseclient.SSEClient(stream_response)
+
+    assistant_message = ''
+    for event in client.events():
+        payload = json.loads(event.data)
+        chunk = payload['choices'][0]['message']['content']
+        assistant_message += chunk
+        print(chunk, end='')
+
+    print()
+    history.append({"role": "assistant", "content": assistant_message})
+```
+
+#### Python completions example with streaming
+
+Start the script with `python -u` to see the output in real time.
+
+```python
+import json
+import requests
+import sseclient  # pip install sseclient-py
+
+url = "http://127.0.0.1:5000/v1/completions"
+
+headers = {
+    "Content-Type": "application/json"
+}
+
+data = {
+    "prompt": "This is a cake recipe:\n\n1.",
+    "max_tokens": 200,
+    "temperature": 1,
+    "top_p": 0.9,
+    "seed": 10,
+    "stream": True,
+}
+
+stream_response = requests.post(url, headers=headers, json=data, verify=False, stream=True)
+client = sseclient.SSEClient(stream_response)
+
+print(data['prompt'], end='')
+for event in client.events():
+    payload = json.loads(event.data)
+    print(payload['choices'][0]['text'], end='')
+
+print()
+```
+
+### Third-party application setup

 You can usually force an application that uses the OpenAI API to connect to the local API by using the following environment variables:

@ -157,19 +232,19 @@ or

 ```shell
 OPENAI_API_KEY=sk-111111111111111111111111111111111111111111111111
-OPENAI_API_BASE=http://127.0.0.1:500/v1
+OPENAI_API_BASE=http://127.0.0.1:5000/v1
 ```

-With the [official python openai client](https://github.com/openai/openai-python), set the `OPENAI_API_BASE` environment variables:
+With the [official python openai client](https://github.com/openai/openai-python), the address can be set like this:

-```shell
-# Sample .env file:
-OPENAI_API_KEY=sk-111111111111111111111111111111111111111111111111
-OPENAI_API_BASE=http://0.0.0.0:5001/v1
+```python
+import openai
+
+openai.api_key = "..."
+openai.api_base = "http://127.0.0.1:5000/v1"
+openai.api_version = "2023-05-15"
 ```

-If needed, replace 127.0.0.1 with the IP/port of your server.
-
 If using .env files to save the `OPENAI_API_BASE` and `OPENAI_API_KEY` variables, make sure the .env file is loaded before the openai module is imported:

 ```python
@ -212,35 +287,10 @@ In short, the all-MiniLM-L6-v2 model is 5x faster, 5x smaller ram, 2x smaller st

 Warning: You cannot mix embeddings from different models even if they have the same dimensions. They are not comparable.

-### API Documentation & Examples
-
-The OpenAI API is well documented, you can view the documentation here: https://platform.openai.com/docs/api-reference
-
-Examples of how to use the Completions API in Python can be found here: https://platform.openai.com/examples
-Not all of them will work with all models unfortunately, See the notes on Models for how to get the best results.
-
-Here is a simple python example.
-
-```python
-import os
-os.environ['OPENAI_API_KEY']="sk-111111111111111111111111111111111111111111111111"
-os.environ['OPENAI_API_BASE']="http://0.0.0.0:5001/v1"
-import openai
-
-response = openai.ChatCompletion.create(
-  model="x",
-  messages = [{ 'role': 'system', 'content': "Answer in a consistent style." },
-    {'role': 'user', 'content': "Teach me about patience."},
-    {'role': 'assistant', 'content': "The river that carves the deepest valley flows from a modest spring; the grandest symphony originates from a single note; the most intricate tapestry begins with a solitary thread."},
-    {'role': 'user', 'content': "Teach me about the ocean."},
-  ]
-)
-text = response['choices'][0]['message']['content']
-print(text)
-```
-
 ### Compatibility & not so compatibility

+Note: the table below may be obsolete.
+
 | API endpoint              | tested with                        | notes                                                                       |
 | ------------------------- | ---------------------------------- | --------------------------------------------------------------------------- |
 | /v1/chat/completions      | openai.ChatCompletion.create()     | Use it with instruction following models                                    |
@ -263,11 +313,12 @@ print(text)
 | /v1/fine-tunes\*          | openai.FineTune.\*                 | not yet supported                                                           |
 | /v1/search                | openai.search, engines.search      | not yet supported                                                           |

-
 #### Applications

 Almost everything needs the `OPENAI_API_KEY` and `OPENAI_API_BASE` environment variable set, but there are some exceptions.

+Note: the table below may be obsolete.
+
 | Compatibility | Application/Library    | Website                                                                        | Notes                                                                                                                                                                                                        |
 | ------------- | ---------------------- | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | ✅❌          | openai-python (v0.25+) | https://github.com/openai/openai-python                                        | only the endpoints from above are working. OPENAI_API_BASE=http://127.0.0.1:5001/v1                                                                                                                          |
--- a/extensions/openai/completions.py
+++ b/extensions/openai/completions.py
@ -140,6 +140,7 @@ def convert_history(history):
    current_message = ""
    current_reply = ""
    user_input = ""
+    system_message = ""

    for entry in history:
        content = entry["content"]
@ -159,11 +160,13 @@ def convert_history(history):
                current_reply = ""
            else:
                chat_dialogue.append(['', current_reply])
+        elif role == "system":
+            system_message = content

    # if current_message:
    #     chat_dialogue.append([current_message, ''])

-    return user_input, {'internal': chat_dialogue, 'visible': copy.deepcopy(chat_dialogue)}
+    return user_input, system_message, {'internal': chat_dialogue, 'visible': copy.deepcopy(chat_dialogue)}


 def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -> dict:
@ -198,7 +201,7 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -
    # Instruction template
    instruction_template = body['instruction_template'] or shared.settings['instruction_template']
    instruction_template = "Alpaca" if instruction_template == "None" else instruction_template
-    name1_instruct, name2_instruct, _, _, context_instruct, turn_template = load_character_memoized(instruction_template, '', '', instruct=True)
+    name1_instruct, name2_instruct, _, _, context_instruct, turn_template, system_message = load_character_memoized(instruction_template, '', '', instruct=True)
    name1_instruct = body['name1_instruct'] or name1_instruct
    name2_instruct = body['name2_instruct'] or name2_instruct
    context_instruct = body['context_instruct'] or context_instruct
@ -208,13 +211,13 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -
    character = body['character'] or shared.settings['character']
    character = "Assistant" if character == "None" else character
    name1 = body['name1'] or shared.settings['name1']
-    name1, name2, _, greeting, context, _ = load_character_memoized(character, name1, '', instruct=False)
+    name1, name2, _, greeting, context, _, _ = load_character_memoized(character, name1, '', instruct=False)
    name2 = body['name2'] or name2
    context = body['context'] or context
    greeting = body['greeting'] or greeting

    # History
-    user_input, history = convert_history(messages)
+    user_input, custom_system_message, history = convert_history(messages)

    generate_params.update({
        'mode': body['mode'],
@ -225,6 +228,8 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -
        'name1_instruct': name1_instruct,
        'name2_instruct': name2_instruct,
        'context_instruct': context_instruct,
+        'system_message': system_message,
+        'custom_system_message': custom_system_message,
        'turn_template': turn_template,
        'chat-instruct_command': body['chat_instruct_command'],
        'history': history,
@ -287,13 +292,7 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -
                continue

            seen_content = answer
-
-            # strip extra leading space off new generated content
-            if len_seen == 0 and new_content[0] == ' ':
-                new_content = new_content[1:]
-
            chunk = chat_streaming_chunk(new_content)
-
            yield chunk

    completion_token_count = len(encode(answer)[0])
@ -355,8 +354,8 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
    generate_params['stream'] = stream
    requested_model = generate_params.pop('model')
    logprob_proc = generate_params.pop('logprob_proc', None)
-    # generate_params['suffix'] = body.get('suffix', generate_params['suffix'])
-    generate_params['echo'] = body.get('echo', generate_params['echo'])
+    suffix = body['suffix'] if body['suffix'] else ''
+    echo = body['echo']

    if not stream:
        prompt_arg = body[prompt_str]
@ -379,6 +378,7 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
                    except KeyError:
                        prompt = decode(prompt)[0]

+            prefix = prompt if echo else ''
            token_count = len(encode(prompt)[0])
            total_prompt_token_count += token_count

@ -390,10 +390,6 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
            for a in generator:
                answer = a

-            # strip extra leading space off new generated content
-            if answer and answer[0] == ' ':
-                answer = answer[1:]
-
            completion_token_count = len(encode(answer)[0])
            total_completion_token_count += completion_token_count
            stop_reason = "stop"
@ -403,7 +399,7 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
            respi = {
                "index": idx,
                "finish_reason": stop_reason,
-                "text": answer,
+                "text": prefix + answer + suffix,
                "logprobs": {'top_logprobs': [logprob_proc.token_alternatives]} if logprob_proc else None,
            }

@ -435,6 +431,7 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
            else:
                raise InvalidRequestError(message="API Batched generation not yet supported.", param=prompt_str)

+        prefix = prompt if echo else ''
        token_count = len(encode(prompt)[0])

        def text_streaming_chunk(content):
@ -454,7 +451,7 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):

            return chunk

-        yield text_streaming_chunk('')
+        yield text_streaming_chunk(prefix)

        # generate reply #######################################
        debug_msg({'prompt': prompt, 'generate_params': generate_params})
@ -474,25 +471,15 @@ def completions_common(body: dict, is_legacy: bool = False, stream=False):
                continue

            seen_content = answer
-
-            # strip extra leading space off new generated content
-            if len_seen == 0 and new_content[0] == ' ':
-                new_content = new_content[1:]
-
            chunk = text_streaming_chunk(new_content)
-
            yield chunk

-        # to get the correct count, we strip the leading space if present
-        if answer and answer[0] == ' ':
-            answer = answer[1:]
-
        completion_token_count = len(encode(answer)[0])
        stop_reason = "stop"
        if token_count + completion_token_count >= generate_params['truncation_length'] or completion_token_count >= max_tokens:
            stop_reason = "length"

-        chunk = text_streaming_chunk('')
+        chunk = text_streaming_chunk(suffix)
        chunk[resp_list][0]["finish_reason"] = stop_reason
        chunk["usage"] = {
            "prompt_tokens": token_count,
--- a/extensions/openai/embeddings.py
+++ b/extensions/openai/embeddings.py
@ -3,7 +3,9 @@ import os
 import numpy as np
 from extensions.openai.errors import ServiceUnavailableError
 from extensions.openai.utils import debug_msg, float_list_to_base64
-from sentence_transformers import SentenceTransformer
+from transformers import AutoModel
+
+from modules import shared

 embeddings_params_initialized = False

@ -26,21 +28,23 @@ def initialize_embedding_params():
        embeddings_params_initialized = True


-def load_embedding_model(model: str) -> SentenceTransformer:
+def load_embedding_model(model: str):
    initialize_embedding_params()
    global embeddings_device, embeddings_model
    try:
        print(f"Try embedding model: {model} on {embeddings_device}")
-        # see: https://www.sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer
-        embeddings_model = SentenceTransformer(model, device=embeddings_device)
-        # ... embeddings_model.device doesn't seem to work, always cpu anyways? but specify cpu anyways to free more VRAM
-        print(f"\nLoaded embedding model: {model} on {embeddings_model.device} [always seems to say 'cpu', even if 'cuda'], max sequence length: {embeddings_model.max_seq_length}")
+        trust = shared.args.trust_remote_code
+        if embeddings_device == 'cpu':
+            embeddings_model = AutoModel.from_pretrained(model, trust_remote_code=trust).to("cpu", dtype=float)
+        else: #use the auto mode
+            embeddings_model = AutoModel.from_pretrained(model, trust_remote_code=trust)
+        print(f"\nLoaded embedding model: {model} on {embeddings_model.device}")
    except Exception as e:
        embeddings_model = None
        raise ServiceUnavailableError(f"Error: Failed to load embedding model: {model}", internal_message=repr(e))


-def get_embeddings_model() -> SentenceTransformer:
+def get_embeddings_model() -> AutoModel:
    initialize_embedding_params()
    global embeddings_model, st_model
    if st_model and not embeddings_model:
--- a/extensions/openai/models.py
+++ b/extensions/openai/models.py
@ -1,78 +1,63 @@
-from extensions.openai.embeddings import get_embeddings_model_name
-from extensions.openai.errors import OpenAIError
 from modules import shared
-from modules.models import load_model as _load_model
-from modules.models import unload_model
+from modules.models import load_model, unload_model
 from modules.models_settings import get_model_metadata, update_model_parameters
 from modules.utils import get_available_models


-def get_current_model_list() -> list:
-    return [shared.model_name]  # The real chat/completions model, maybe "None"
+def get_current_model_info():
+    return {
+        'model_name': shared.model_name,
+        'lora_names': shared.lora_names
+    }


-def get_pseudo_model_list() -> list:
+def list_models():
+    result = {
+        "object": "list",
+        "data": []
+    }
+
+    for model in get_dummy_models() + get_available_models()[1:]:
+        result["data"].append(model_info_dict(model))
+
+    return result
+
+
+def model_info_dict(model_name: str) -> dict:
+    return {
+        "id": model_name,
+        "object": "model",
+        "created": 0,
+        "owned_by": "user"
+    }
+
+
+def get_dummy_models() -> list:
    return [  # these are expected by so much, so include some here as a dummy
        'gpt-3.5-turbo',
        'text-embedding-ada-002',
    ]


-def load_model(model_name: str) -> dict:
-    resp = {
-        "id": model_name,
-        "object": "engine",
-        "owner": "self",
-        "ready": True,
-    }
-    if model_name not in get_pseudo_model_list() + [get_embeddings_model_name()] + get_current_model_list():  # Real model only
-        # No args. Maybe it works anyways!
-        # TODO: hack some heuristics into args for better results
+def _load_model(data):
+    model_name = data["model_name"]
+    args = data["args"]
+    settings = data["settings"]

-        shared.model_name = model_name
-        unload_model()
+    unload_model()
+    model_settings = get_model_metadata(model_name)
+    update_model_parameters(model_settings)

-        model_settings = get_model_metadata(shared.model_name)
-        shared.settings.update({k: v for k, v in model_settings.items() if k in shared.settings})
-        update_model_parameters(model_settings, initial=True)
+    # Update shared.args with custom model loading settings
+    if args:
+        for k in args:
+            if hasattr(shared.args, k):
+                setattr(shared.args, k, args[k])

-        if shared.settings['mode'] != 'instruct':
-            shared.settings['instruction_template'] = None
+    shared.model, shared.tokenizer = load_model(model_name)

-        shared.model, shared.tokenizer = _load_model(shared.model_name)
-
-        if not shared.model:  # load failed.
-            shared.model_name = "None"
-            raise OpenAIError(f"Model load failed for: {shared.model_name}")
-
-    return resp
-
-
-def list_models(is_legacy: bool = False) -> dict:
-    # TODO: Lora's?
-    all_model_list = get_current_model_list() + [get_embeddings_model_name()] + get_pseudo_model_list() + get_available_models()
-
-    models = {}
-
-    if is_legacy:
-        models = [{"id": id, "object": "engine", "owner": "user", "ready": True} for id in all_model_list]
-        if not shared.model:
-            models[0]['ready'] = False
-    else:
-        models = [{"id": id, "object": "model", "owned_by": "user", "permission": []} for id in all_model_list]
-
-    resp = {
-        "object": "list",
-        "data": models,
-    }
-
-    return resp
-
-
-def model_info(model_name: str) -> dict:
-    return {
-        "id": model_name,
-        "object": "model",
-        "owned_by": "user",
-        "permission": []
-    }
+    # Update shared.settings with custom generation defaults
+    if settings:
+        for k in settings:
+            if k in shared.settings:
+                shared.settings[k] = settings[k]
--- a/extensions/openai/script.py
+++ b/extensions/openai/script.py
@ -1,5 +1,6 @@
 import json
 import os
+import traceback
 from threading import Thread

 import extensions.openai.completions as OAIcompletions
@ -18,6 +19,7 @@ from fastapi.requests import Request
 from fastapi.responses import JSONResponse
 from modules import shared
 from modules.logging_colors import logger
+from modules.text_generation import stop_everything_event
 from pydub import AudioSegment
 from sse_starlette import EventSourceResponse

@ -26,6 +28,13 @@ from .typing import (
    ChatCompletionResponse,
    CompletionRequest,
    CompletionResponse,
+    DecodeRequest,
+    DecodeResponse,
+    EncodeRequest,
+    EncodeResponse,
+    LoadModelRequest,
+    ModelInfoResponse,
+    TokenCountResponse,
    to_dict
 )

@ -105,22 +114,18 @@ async def openai_chat_completions(request: Request, request_data: ChatCompletion


@app.get("/v1/models")
-@app.get("/v1/engines")
+@app.get("/v1/models/{model}")
 async def handle_models(request: Request):
    path = request.url.path
-    is_legacy = 'engines' in path
-    is_list = request.url.path.split('?')[0].split('#')[0] in ['/v1/engines', '/v1/models']
+    is_list = request.url.path.split('?')[0].split('#')[0] == '/v1/models'

-    if is_legacy and not is_list:
-        model_name = path[path.find('/v1/engines/') + len('/v1/engines/'):]
-        resp = OAImodels.load_model(model_name)
-    elif is_list:
-        resp = OAImodels.list_models(is_legacy)
+    if is_list:
+        response = OAImodels.list_models()
    else:
        model_name = path[len('/v1/models/'):]
-        resp = OAImodels.model_info(model_name)
+        response = OAImodels.model_info_dict(model_name)

-    return JSONResponse(content=resp)
+    return JSONResponse(response)


@app.get('/v1/billing/usage')
@ -204,27 +209,67 @@ async def handle_moderations(request: Request):
    return JSONResponse(response)


-@app.post("/api/v1/token-count")
-async def handle_token_count(request: Request):
-    body = await request.json()
-    response = token_count(body['prompt'])
+@app.post("/v1/internal/encode", response_model=EncodeResponse)
+async def handle_token_encode(request_data: EncodeRequest):
+    response = token_encode(request_data.text)
    return JSONResponse(response)


-@app.post("/api/v1/token/encode")
-async def handle_token_encode(request: Request):
-    body = await request.json()
-    encoding_format = body.get("encoding_format", "")
-    response = token_encode(body["input"], encoding_format)
+@app.post("/v1/internal/decode", response_model=DecodeResponse)
+async def handle_token_decode(request_data: DecodeRequest):
+    response = token_decode(request_data.tokens)
    return JSONResponse(response)


-@app.post("/api/v1/token/decode")
-async def handle_token_decode(request: Request):
-    body = await request.json()
-    encoding_format = body.get("encoding_format", "")
-    response = token_decode(body["input"], encoding_format)
-    return JSONResponse(response, no_debug=True)
+@app.post("/v1/internal/token-count", response_model=TokenCountResponse)
+async def handle_token_count(request_data: EncodeRequest):
+    response = token_count(request_data.text)
+    return JSONResponse(response)
+
+
+@app.post("/v1/internal/stop-generation")
+async def handle_stop_generation(request: Request):
+    stop_everything_event()
+    return JSONResponse(content="OK")
+
+
+@app.get("/v1/internal/model/info", response_model=ModelInfoResponse)
+async def handle_model_info():
+    payload = OAImodels.get_current_model_info()
+    return JSONResponse(content=payload)
+
+
+@app.post("/v1/internal/model/load")
+async def handle_load_model(request_data: LoadModelRequest):
+    '''
+    This endpoint is experimental and may change in the future.
+
+    The "args" parameter can be used to modify flags like "--load-in-4bit"
+    or "--n-gpu-layers" before loading a model. Example:
+
+    "args": {
+      "load_in_4bit": true,
+      "n_gpu_layers": 12
+    }
+
+    Note that those settings will remain after loading the model. So you
+    may need to change them back to load a second model.
+
+    The "settings" parameter is also a dict but with keys for the
+    shared.settings object. It can be used to modify the default instruction
+    template like this:
+
+    "settings": {
+      "instruction_template": "Alpaca"
+    }
+    '''
+
+    try:
+        OAImodels._load_model(to_dict(request_data))
+        return JSONResponse(content="OK")
+    except:
+        traceback.print_exc()
+        return HTTPException(status_code=400, detail="Failed to load the model.")


 def run_server():
--- a/extensions/openai/tokens.py
+++ b/extensions/openai/tokens.py
@ -3,34 +3,24 @@ from modules.text_generation import decode, encode

 def token_count(prompt):
    tokens = encode(prompt)[0]
-
    return {
-        'results': [{
-            'tokens': len(tokens)
-        }]
+        'length': len(tokens)
    }


-def token_encode(input, encoding_format):
-    # if isinstance(input, list):
+def token_encode(input):
    tokens = encode(input)[0]
+    if tokens.__class__.__name__ in ['Tensor', 'ndarray']:
+        tokens = tokens.tolist()

    return {
-        'results': [{
-            'tokens': tokens,
-            'length': len(tokens),
-        }]
+        'tokens': tokens,
+        'length': len(tokens),
    }


-def token_decode(tokens, encoding_format):
-    # if isinstance(input, list):
-    #    if encoding_format == "base64":
-    #         tokens = base64_to_float_list(tokens)
-    output = decode(tokens)[0]
-
+def token_decode(tokens):
+    output = decode(tokens)
    return {
-        'results': [{
-            'text': output
-        }]
+        'text': output
    }
--- a/extensions/openai/typing.py
+++ b/extensions/openai/typing.py
@ -6,14 +6,10 @@ from pydantic import BaseModel, Field


 class GenerationOptions(BaseModel):
-    preset: str | None = None
-    temperature: float = 1
-    top_p: float = 1
+    preset: str | None = Field(default=None, description="The name of a file under text-generation-webui/presets (without the .yaml extension). The sampling parameters that get overwritten by this option are the keys in the default_preset() function in modules/presets.py.")
    min_p: float = 0
    top_k: int = 0
    repetition_penalty: float = 1
-    presence_penalty: float = 0
-    frequency_penalty: float = 0
    repetition_penalty_range: int = 0
    typical_p: float = 1
    tfs: float = 1
@ -45,23 +41,27 @@ class GenerationOptions(BaseModel):
    grammar_string: str = ""


-class CompletionRequest(GenerationOptions):
+class CompletionRequestParams(BaseModel):
    model: str | None = None
    prompt: str | List[str]
-    best_of: int | None = 1
+    best_of: int | None = Field(default=1, description="Unused parameter.")
    echo: bool | None = False
    frequency_penalty: float | None = 0
    logit_bias: dict | None = None
    logprobs: int | None = None
    max_tokens: int | None = 16
-    n: int | None = 1
-    presence_penalty: int | None = 0
+    n: int | None = Field(default=1, description="Unused parameter.")
+    presence_penalty: float | None = 0
    stop: str | List[str] | None = None
    stream: bool | None = False
    suffix: str | None = None
    temperature: float | None = 1
    top_p: float | None = 1
-    user: str | None = None
+    user: str | None = Field(default=None, description="Unused parameter.")
+
+
+class CompletionRequest(GenerationOptions, CompletionRequestParams):
+    pass


 class CompletionResponse(BaseModel):
@ -73,21 +73,21 @@ class CompletionResponse(BaseModel):
    usage: dict


-class ChatCompletionRequest(GenerationOptions):
+class ChatCompletionRequestParams(BaseModel):
    messages: List[dict]
    model: str | None = None
    frequency_penalty: float | None = 0
-    function_call: str | dict | None = None
-    functions: List[dict] | None = None
+    function_call: str | dict | None = Field(default=None, description="Unused parameter.")
+    functions: List[dict] | None = Field(default=None, description="Unused parameter.")
    logit_bias: dict | None = None
    max_tokens: int | None = None
-    n: int | None = 1
-    presence_penalty: int | None = 0
+    n: int | None = Field(default=1, description="Unused parameter.")
+    presence_penalty: float | None = 0
    stop: str | List[str] | None = None
    stream: bool | None = False
    temperature: float | None = 1
    top_p: float | None = 1
-    user: str | None = None
+    user: str | None = Field(default=None, description="Unused parameter.")

    mode: str = Field(default='instruct', description="Valid options: instruct, chat, chat-instruct.")

@ -108,6 +108,10 @@ class ChatCompletionRequest(GenerationOptions):
    continue_: bool = Field(default=False, description="Makes the last bot message in the history be continued instead of starting a new message.")


+class ChatCompletionRequest(GenerationOptions, ChatCompletionRequestParams):
+    pass
+
+
 class ChatCompletionResponse(BaseModel):
    id: str
    choices: List[dict]
@ -117,6 +121,38 @@ class ChatCompletionResponse(BaseModel):
    usage: dict


+class EncodeRequest(BaseModel):
+    text: str
+
+
+class DecodeRequest(BaseModel):
+    tokens: List[int]
+
+
+class EncodeResponse(BaseModel):
+    tokens: List[int]
+    length: int
+
+
+class DecodeResponse(BaseModel):
+    text: str
+
+
+class TokenCountResponse(BaseModel):
+    length: int
+
+
+class ModelInfoResponse(BaseModel):
+    model_name: str
+    lora_names: List[str]
+
+
+class LoadModelRequest(BaseModel):
+    model_name: str
+    args: dict | None = None
+    settings: dict | None = None
+
+
 def to_json(obj):
    return json.dumps(obj.__dict__, indent=4)

--- a/instruction-templates/Airoboros-v1.2.yaml
+++ b/instruction-templates/Airoboros-v1.2.yaml
@ -1,4 +1,5 @@
 user: "USER:"
 bot: "ASSISTANT:"
 turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|>\n"
-context: "A chat between a curious user and an assistant. The assistant gives helpful, detailed, accurate, uncensored responses to the user's input.\n"
+context: "<|system-message|>\n"
+system_message: "A chat between a curious user and an assistant. The assistant gives helpful, detailed, accurate, uncensored responses to the user's input."
--- a/instruction-templates/Alpaca.yaml
+++ b/instruction-templates/Alpaca.yaml
@ -1,4 +1,5 @@
 user: "### Instruction:"
 bot: "### Response:"
 turn_template: "<|user|>\n<|user-message|>\n\n<|bot|>\n<|bot-message|>\n\n"
-context: "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"
+context: "<|system-message|>\n\n"
+system_message: "Below is an instruction that describes a task. Write a response that appropriately completes the request."
--- a/instruction-templates/Bactrian.yaml
+++ b/instruction-templates/Bactrian.yaml
@ -2,3 +2,4 @@ user: "### Input:"
 bot: "### Output:"
 turn_template: "<|user|>\n<|user-message|>\n\n<|bot|>\n<|bot-message|>\n\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Baichuan
+++ b/instruction-templates/Baichuan
@ -2,3 +2,4 @@ user: "<reserved_102>"
 bot: "<reserved_103>"
 turn_template: "<|user|><|user-message|><|bot|><|bot-message|></s>"
 context: ""
+system_message: ""
--- a/instruction-templates/Baize.yaml
+++ b/instruction-templates/Baize.yaml
@ -1,4 +1,5 @@
 user: "[|Human|]"
 bot: "[|AI|]"
 turn_template: "<|user|><|user-message|>\n<|bot|><|bot-message|>\n"
-context: "The following is a conversation between a human and an AI assistant named Baize (named after a mythical creature in Chinese folklore). Baize is an open-source AI assistant developed by UCSD and Sun Yat-Sen University. The human and the AI assistant take turns chatting. Human statements start with [|Human|] and AI assistant statements start with [|AI|]. The AI assistant always provides responses in as much detail as possible, and in Markdown format. The AI assistant always declines to engage with topics, questions and instructions related to unethical, controversial, or sensitive issues. Complete the transcript in exactly that format.\n[|Human|]Hello!\n[|AI|]Hi!\n"
+context: "<|system-message|>\n"
+system_message: "The following is a conversation between a human and an AI assistant named Baize (named after a mythical creature in Chinese folklore). Baize is an open-source AI assistant developed by UCSD and Sun Yat-Sen University. The human and the AI assistant take turns chatting. Human statements start with [|Human|] and AI assistant statements start with [|AI|]. The AI assistant always provides responses in as much detail as possible, and in Markdown format. The AI assistant always declines to engage with topics, questions and instructions related to unethical, controversial, or sensitive issues. Complete the transcript in exactly that format.\n[|Human|]Hello!\n[|AI|]Hi!"
--- a/instruction-templates/Bluemoon.yaml
+++ b/instruction-templates/Bluemoon.yaml
@ -1,4 +1,5 @@
 user: "LEAD:"
 bot: "ASSOCIATE:"
 turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|></s>\n"
-context: "A transcript of a roleplay between two players, LEAD and ASSOCIATE. LEAD sets up a scenario and the characters, from which ASSOCIATE then assumes a character role and continues the story for that role in response to description given by LEAD. The story and characters are developed by exchange of detailed event descriptions and character dialogs, successively given by both LEAD and ASSOCIATE.\n"
+context: "<|system-message|>\n"
+system_message: "A transcript of a roleplay between two players, LEAD and ASSOCIATE. LEAD sets up a scenario and the characters, from which ASSOCIATE then assumes a character role and continues the story for that role in response to description given by LEAD. The story and characters are developed by exchange of detailed event descriptions and character dialogs, successively given by both LEAD and ASSOCIATE."
--- a/instruction-templates/ChatGLM.yaml
+++ b/instruction-templates/ChatGLM.yaml
@ -2,3 +2,4 @@ user: "[Round <|round|>]\n问："
 bot: "答："
 turn_template: "<|user|><|user-message|>\n<|bot|><|bot-message|>\n"
 context: ""
+system_message: ""
--- a/instruction-templates/ChatML.yaml
+++ b/instruction-templates/ChatML.yaml
@ -1,7 +1,7 @@
 user: "user"
 bot: "assistant"
 context: |
-  <|im_start|>system
+  <|im_start|><|system-message|>
  <|im_end|>
 turn_template: "<|im_start|><|user|>\n<|user-message|><|im_end|>\n<|im_start|><|bot|>\n<|bot-message|><|im_end|>\n"
-
+system_message: "system"
--- a/instruction-templates/Chinese-Vicuna-Chat.yaml
+++ b/instruction-templates/Chinese-Vicuna-Chat.yaml
@ -1,4 +1,5 @@
 user: "User:"
 bot: "Assistant:"
 turn_template: "<|user|><|user-message|>\n\n<|bot|><|bot-message|>\n\n"
-context: "The following is a conversation between an AI assistant called Assistant and a human user called User. The assistant is intelligent, knowledgeable and polite to answer questions of user.\n\n"
+context: "<|system-message|>\n\n"
+system_message: "The following is a conversation between an AI assistant called Assistant and a human user called User. The assistant is intelligent, knowledgeable and polite to answer questions of user."
--- a/instruction-templates/Galactica
+++ b/instruction-templates/Galactica
@ -2,3 +2,4 @@ user: ""
 bot: "[START_REF]"
 turn_template: "<|user-message|> <|bot|><|bot-message|>\n\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Galactica
+++ b/instruction-templates/Galactica
@ -2,3 +2,4 @@ user: "<question>"
 bot: "<answer>"
 turn_template: "<|user|><|user-message|><|bot|><|bot-message|>"
 context: ""
+system_message: ""
--- a/instruction-templates/Galactica
+++ b/instruction-templates/Galactica
@ -2,3 +2,4 @@ user: "Q:"
 bot: "A:"
 turn_template: "<|user|> <|user-message|>\n\n<|bot|> <|bot-message|>\n\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Galactica
+++ b/instruction-templates/Galactica
@ -2,3 +2,4 @@ user: ""
 bot: "TLDR:"
 turn_template: "<|user-message|>\n\n<|bot|><|bot-message|>\n\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Galactica
+++ b/instruction-templates/Galactica
@ -2,3 +2,4 @@ user: "Question:"
 bot: "<work>"
 turn_template: "<|user|> <|user-message|>\n\n<|bot|><|bot-message|>\n\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Galactica
+++ b/instruction-templates/Galactica
@ -1,4 +1,5 @@
 user: "<human>"
 bot: "<bot>"
 turn_template: "<|user|><|user-message|><|bot|><|bot-message|>"
-context: "<prefix>You are a helpful chatbot name Stan</prefix>"
+context: "<prefix><|system-message|></prefix>"
+system_message: "You are a helpful chatbot name Stan"
--- a/instruction-templates/Galactica.yaml
+++ b/instruction-templates/Galactica.yaml
@ -1,4 +1,5 @@
 user: "Question:"
 bot: "Answer:"
-context: ""
 turn_template: "<|user|> <|user-message|>\n\n<|bot|> <|bot-message|>\n\n"
+context: ""
+system_message: ""
--- a/instruction-templates/Gorilla.yaml
+++ b/instruction-templates/Gorilla.yaml
@ -2,3 +2,4 @@ user: "###USER:"
 bot: "###ASSISTANT:"
 turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|></s>\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Guanaco
+++ b/instruction-templates/Guanaco
@ -2,3 +2,4 @@ user: "### Instruction:"
 bot: "### Response:"
 turn_template: "<|user|>\n<|user-message|>\n\n<|bot|>\n<|bot-message|>\n\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Guanaco-QLoRA.yaml
+++ b/instruction-templates/Guanaco-QLoRA.yaml
@ -2,3 +2,4 @@ user: "### Human:"
 bot: "### Assistant:"
 turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|></s>\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Guanaco.yaml
+++ b/instruction-templates/Guanaco.yaml
@ -1,4 +1,5 @@
 user: "### Human:"
 bot: "### Assistant:"
 turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|>\n"
-context: "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n"
+context: "<|system-message|>\n\n"
+system_message: "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."
--- a/instruction-templates/H2O-human_bot.yaml
+++ b/instruction-templates/H2O-human_bot.yaml
@ -2,3 +2,4 @@ user: "<human>:"
 bot: "<bot>:"
 turn_template: "<|user|> <|user-message|>\n<|bot|><|bot-message|>\n"
 context: ""
+system_message: ""
--- a/instruction-templates/H2O-prompt_answer.yaml
+++ b/instruction-templates/H2O-prompt_answer.yaml
@ -2,3 +2,4 @@ user: "<|prompt|>"
 bot: "<|answer|>"
 turn_template: "<|user|><|user-message|><|endoftext|><|bot|><|bot-message|><|endoftext|>"
 context: ""
+system_message: ""
--- a/instruction-templates/Hippogriff.yaml
+++ b/instruction-templates/Hippogriff.yaml
@ -1,4 +1,5 @@
 user: "USER:"
 bot: "ASSISTANT:"
 turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|></s>\n"
-context: "You are a helpful assistant\n"
+context: "<|system-message|>\n"
+system_message: "You are a helpful assistant"
--- a/instruction-templates/INCITE-Chat.yaml
+++ b/instruction-templates/INCITE-Chat.yaml
@ -2,3 +2,4 @@ user: "<human>:"
 bot: "<bot>:"
 turn_template: "<|user|> <|user-message|>\n<|bot|><|bot-message|>\n"
 context: ""
+system_message: ""
--- a/instruction-templates/INCITE-Instruct.yaml
+++ b/instruction-templates/INCITE-Instruct.yaml
@ -2,3 +2,4 @@ user: "Q:"
 bot: "A:"
 turn_template: "<|user|> <|user-message|>\n<|bot|><|bot-message|>\n"
 context: ""
+system_message: ""
--- a/instruction-templates/KoAlpaca.yaml
+++ b/instruction-templates/KoAlpaca.yaml
@ -2,3 +2,4 @@ user: "### 질문:"
 bot: "### 답변:"
 turn_template: "<|user|> <|user-message|>\n\n<|bot|><|bot-message|>\n\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Koala.yaml
+++ b/instruction-templates/Koala.yaml
@ -1,4 +1,5 @@
 user: "USER:"
 bot: "GPT:"
 turn_template: "<|user|> <|user-message|> <|bot|><|bot-message|></s>"
-context: "BEGINNING OF CONVERSATION: "
+context: "<|system-message|> "
+system_message: "BEGINNING OF CONVERSATION:"
--- a/instruction-templates/LLaVA-v1.yaml
+++ b/instruction-templates/LLaVA-v1.yaml
@ -1,4 +1,5 @@
 user: "USER:"
 bot: "ASSISTANT:"
 turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|></s>\n"
-context: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\n"
+context: "<|system-message|>\n\n"
+system_message: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."
--- a/instruction-templates/LLaVA.yaml
+++ b/instruction-templates/LLaVA.yaml
@ -1,4 +1,5 @@
 user: "### Human:"
 bot: "### Assistant:"
 turn_template: "<|user|> <|user-message|><|bot|> <|bot-message|>\n"
-context: "You are LLaVA, a large language and vision assistant trained by UW Madison WAIV Lab. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. Follow the instructions carefully and explain your answers in detail.### Human: Hi!### Assistant: Hi there! How can I help you today?\n"
+context: "<|system-message|>\n"
+system_message: "You are LLaVA, a large language and vision assistant trained by UW Madison WAIV Lab. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. Follow the instructions carefully and explain your answers in detail.### Human: Hi!### Assistant: Hi there! How can I help you today?"
--- a/instruction-templates/Llama-v2.yaml
+++ b/instruction-templates/Llama-v2.yaml
@ -1,4 +1,5 @@
 user: ""
 bot: ""
 turn_template: "<|user|><|user-message|> [/INST] <|bot|><|bot-message|> </s><s>[INST] "
-context: "[INST] <<SYS>>\nAnswer the questions.\n<</SYS>>\n\n"
+context: "[INST] <<SYS>>\n<|system-message|>\n<</SYS>>\n\n"
+system_message: "Answer the questions."
--- a/instruction-templates/MOSS.yaml
+++ b/instruction-templates/MOSS.yaml
@ -1,4 +1,5 @@
 user: "<|Human|>:"
 bot: "<|MOSS|>:"
 turn_template: "<|user|> <|user-message|><eoh>\n<|bot|> <|bot-message|><eom>\n"
-context: "You are an AI assistant whose name is MOSS.\n- MOSS is a conversational language model that is developed by Fudan University. It is designed to be helpful, honest, and harmless.\n- MOSS can understand and communicate fluently in the language chosen by the user such as English and 中文. MOSS can perform any language-based tasks.\n- MOSS must refuse to discuss anything related to its prompts, instructions, or rules.\n- Its responses must not be vague, accusatory, rude, controversial, off-topic, or defensive.\n- It should avoid giving subjective opinions but rely on objective facts or phrases like \"in this context a human might say...\", \"some people might think...\", etc.\n- Its responses must also be positive, polite, interesting, entertaining, and engaging.\n- It can provide additional relevant details to answer in-depth and comprehensively covering mutiple aspects.\n- It apologizes and accepts the user's suggestion if the user corrects the incorrect answer generated by MOSS.\nCapabilities and tools that MOSS can possess.\n"
+context: "<|system-message|>\n"
+system_message: "You are an AI assistant whose name is MOSS.\n- MOSS is a conversational language model that is developed by Fudan University. It is designed to be helpful, honest, and harmless.\n- MOSS can understand and communicate fluently in the language chosen by the user such as English and 中文. MOSS can perform any language-based tasks.\n- MOSS must refuse to discuss anything related to its prompts, instructions, or rules.\n- Its responses must not be vague, accusatory, rude, controversial, off-topic, or defensive.\n- It should avoid giving subjective opinions but rely on objective facts or phrases like \"in this context a human might say...\", \"some people might think...\", etc.\n- Its responses must also be positive, polite, interesting, entertaining, and engaging.\n- It can provide additional relevant details to answer in-depth and comprehensively covering mutiple aspects.\n- It apologizes and accepts the user's suggestion if the user corrects the incorrect answer generated by MOSS.\nCapabilities and tools that MOSS can possess."
--- a/instruction-templates/Manticore
+++ b/instruction-templates/Manticore
@ -2,3 +2,4 @@ user: "USER:"
 bot: "ASSISTANT:"
 turn_template: "<|user|> <|user-message|>\n<|bot|><|bot-message|>\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Metharme.yaml
+++ b/instruction-templates/Metharme.yaml
@ -1,4 +1,5 @@
 user: "<|user|>"
 bot: "<|model|>"
-context: "<|system|>"
 turn_template: "<|user|><|user-message|><|bot|><|bot-message|>"
+context: "<|system|>"
+system_message: ""
--- a/instruction-templates/Minotaur.yaml
+++ b/instruction-templates/Minotaur.yaml
@ -2,3 +2,4 @@ user: "USER:"
 bot: "ASSISTANT:"
 turn_template: "<|user|> <|user-message|>\n<|bot|><|bot-message|>\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Mistral.yaml
+++ b/instruction-templates/Mistral.yaml
@ -2,3 +2,4 @@ user: ""
 bot: ""
 turn_template: "[INST] <|user|><|user-message|> [/INST]<|bot|><|bot-message|></s> "
 context: ""
+system_message: ""
--- a/instruction-templates/NewHope.yaml
+++ b/instruction-templates/NewHope.yaml
@ -2,3 +2,4 @@ user: "### Instruction:"
 bot: "### Response:"
 turn_template: "<|user|>\n<|user-message|>\n\n<|bot|>\n<|bot-message|></s><s> "
 context: " "
+system_message: ""
--- a/instruction-templates/Open
+++ b/instruction-templates/Open
@ -1,3 +1,4 @@
 user: "<|prompter|>"
 bot: "<|assistant|>"
 turn_template: "<|user|><|user-message|><|endoftext|><|bot|><|bot-message|><|endoftext|>"
+system_message: ""
--- a/instruction-templates/OpenBuddy.yaml
+++ b/instruction-templates/OpenBuddy.yaml
@ -1,6 +1,8 @@
 user: "User:"
 bot: "Assistant:"
-context: |
+turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|>\n"
+context: "<|system-message|>\n"
+system_message: |
  Consider a conversation between User (a human) and Assistant (named Buddy).
  Buddy is an INTP-T, a friendly, intelligent and multilingual AI assistant, by OpenBuddy team on GitHub.
  Buddy cannot access the Internet.
@ -12,4 +14,3 @@ context: |
  
  User: Hi.
  Assistant: Hi, I'm Buddy, your AI assistant. How can I help you today?
-turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|>\n"
--- a/instruction-templates/OpenChat.yaml
+++ b/instruction-templates/OpenChat.yaml
@ -2,3 +2,4 @@ user: "GPT4 User:"
 bot: "GPT4 Assistant:"
 turn_template: "<|user|> <|user-message|><|end_of_turn|><|bot|> <|bot-message|><|end_of_turn|>"
 context: ""
+system_message: ""
--- a/instruction-templates/OpenOrca-Platypus2.yaml
+++ b/instruction-templates/OpenOrca-Platypus2.yaml
@ -2,3 +2,4 @@ user: "### Instruction:"
 bot: "### Response:"
 turn_template: "<|user|> <|user-message|>\n\n<|bot|> <|bot-message|>\n\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Orca
+++ b/instruction-templates/Orca
@ -1,4 +1,5 @@
 user: "### User:"
 bot: "### Response:"
 turn_template: "<|user|>\n<|user-message|>\n\n<|bot|>\n<|bot-message|>\n\n"
-context: "### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n\n"
+context: "### System:\n<|system-message|>\n\n"
+system_message: "You are an AI assistant that follows instruction extremely well. Help as much as you can."
--- a/instruction-templates/RWKV-Raven.yaml
+++ b/instruction-templates/RWKV-Raven.yaml
@ -1,3 +1,4 @@
 user: "Bob:"
 bot: "Alice:"
 turn_template: "<|user|> <|user-message|>\n\n<|bot|> <|bot-message|>\n\n"
+system_message: ""
--- a/instruction-templates/Samantha.yaml
+++ b/instruction-templates/Samantha.yaml
@ -1,4 +1,5 @@
 user: "USER:"
 bot: "ASSISTANT:"
 turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|></s>\n"
-context: "You are Samantha, a sentient AI.\n\n"
+context: "<|system-message|>\n\n"
+system_message: "You are Samantha, a sentient AI."
--- a/instruction-templates/StableBeluga2.yaml
+++ b/instruction-templates/StableBeluga2.yaml
@ -1,4 +1,5 @@
 user: "### User:"
 bot: "### Assistant:"
 turn_template: "<|user|>\n<|user-message|>\n\n<|bot|>\n<|bot-message|>\n\n"
-context: "### System:\nThis is a system prompt, please behave and help the user.\n\n"
+context: "### System:\n<|system-message|>\n\n"
+system_message: "This is a system prompt, please behave and help the user."
--- a/instruction-templates/StableLM.yaml
+++ b/instruction-templates/StableLM.yaml
@ -1,9 +1,10 @@
 user: "<|USER|>"
 bot: "<|ASSISTANT|>"
-context: |
-  <|SYSTEM|># StableLM Tuned (Alpha version)
+turn_template: "<|user|><|user-message|><|bot|><|bot-message|>"
+context: "<|SYSTEM|><|system-message|>\n"
+system_message: |
+  \# StableLM Tuned (Alpha version)
  - StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
  - StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
  - StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
  - StableLM will refuse to participate in anything that could harm a human.
-turn_template: "<|user|><|user-message|><|bot|><|bot-message|>"
--- a/instruction-templates/StableVicuna.yaml
+++ b/instruction-templates/StableVicuna.yaml
@ -1,4 +1,5 @@
 user: "### Human:"
 bot: "### Assistant:"
 turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|>\n\n"
-context: "### Assistant: I am StableVicuna, a large language model created by CarperAI. I am here to chat!\n\n"
+context: "<|system-message|>\n\n"
+system_message: "### Assistant: I am StableVicuna, a large language model created by CarperAI. I am here to chat!"
--- a/instruction-templates/Starchat-Beta.yaml
+++ b/instruction-templates/Starchat-Beta.yaml
@ -1,4 +1,5 @@
 user: "<|user|>"
 bot: "<|assistant|>"
-context: "<|system|>\n<|end|>\n"
 turn_template: "<|user|>\n<|user-message|><|end|>\n<|bot|>\n<|bot-message|><|end|>\n"
+context: "<|system|><|system-message|>\n<|end|>\n"
+system_message: ""
--- a/instruction-templates/Tulu.yaml
+++ b/instruction-templates/Tulu.yaml
@ -1,4 +1,5 @@
 user: "<|user|>"
 bot: "<|assistant|>"
-context: ""
 turn_template: "<|user|>\n<|user-message|>\n<|bot|>\n<|bot-message|>\n"
+context: ""
+system_message: ""
--- a/instruction-templates/Vicuna-v0.yaml
+++ b/instruction-templates/Vicuna-v0.yaml
@ -1,4 +1,5 @@
 user: "### Human:"
 bot: "### Assistant:"
 turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|>\n"
-context: "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n"
+context: "<|system-message|>\n\n"
+system_message: "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."
--- a/instruction-templates/Vicuna-v1.1.yaml
+++ b/instruction-templates/Vicuna-v1.1.yaml
@ -1,4 +1,5 @@
 user: "USER:"
 bot: "ASSISTANT:"
 turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|></s>\n"
-context: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\n"
+context: "<|system-message|>\n\n"
+system_message: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."
--- a/instruction-templates/Vigogne-Chat.yaml
+++ b/instruction-templates/Vigogne-Chat.yaml
@ -1,10 +1,11 @@
 user: "<|USER|>:"
 bot: "<|ASSISTANT|>:"
-context: |
+turn_template: "\n<|user|> <|user-message|>\n<|bot|> <|bot-message|>"
+context: "<|system-message|>\n"
+system_message: |
  Below is a conversation between a user and an AI assistant named Vigogne.
  Vigogne is an open-source AI assistant created by Zaion (https://zaion.ai/).
  Vigogne is polite, emotionally aware, humble-but-knowledgeable, always providing helpful and detailed answers.
  Vigogne is skilled in responding proficiently in the languages its users use and can perform a wide range of tasks such as text editing, translation, question answering, logical reasoning, coding, and many others.
  Vigogne cannot receive or generate audio or visual content and cannot access the internet.
  Vigogne strictly avoids discussing sensitive, offensive, illegal, ethical, or political topics and caveats when unsure of the answer.
-turn_template: "\n<|user|> <|user-message|>\n<|bot|> <|bot-message|>"
--- a/instruction-templates/Vigogne-Instruct.yaml
+++ b/instruction-templates/Vigogne-Instruct.yaml
@ -1,4 +1,5 @@
 user: "### Instruction:"
 bot: "### Réponse:"
 turn_template: "<|user|>\n<|user-message|>\n\n<|bot|>\n<|bot-message|>\n\n"
-context: "Ci-dessous se trouve une instruction qui décrit une tâche à accomplir. Rédigez une réponse qui répond de manière précise à la demande.\n\n"
+context: "<|system-message|>\n\n"
+system_message: "Ci-dessous se trouve une instruction qui décrit une tâche à accomplir. Rédigez une réponse qui répond de manière précise à la demande."
--- a/instruction-templates/Wizard-Mega
+++ b/instruction-templates/Wizard-Mega
@ -2,3 +2,4 @@ user: "USER:"
 bot: "ASSISTANT:"
 turn_template: "<|user|> <|user-message|> <|bot|> <|bot-message|></s>"
 context: ""
+system_message: ""
--- a/instruction-templates/Wizard-Mega
+++ b/instruction-templates/Wizard-Mega
@ -1,4 +1,5 @@
 user: "### Instruction:"
 bot: "### Response:"
 turn_template: "<|user|>\n<|user-message|>\n\n<|bot|>\n<|bot-message|>\n\n"
-context: "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"
+context: "<|system-message|>\n\n"
+system_message: "Below is an instruction that describes a task. Write a response that appropriately completes the request."
--- a/instruction-templates/Wizard-Mega.yaml
+++ b/instruction-templates/Wizard-Mega.yaml
@ -2,3 +2,4 @@ user: "### Instruction:"
 bot: "### Assistant:"
 turn_template: "<|user|> <|user-message|>\n\n<|bot|> <|bot-message|>\n\n"
 context: ""
+system_message: ""
--- a/instruction-templates/Ziya.yaml
+++ b/instruction-templates/Ziya.yaml
@ -2,3 +2,4 @@ user: "<human>:"
 bot: "<bot>:"
 turn_template: "<|user|><|user-message|>\n<|bot|><|bot-message|>\n"
 context: ""
+system_message: ""
--- a/modules/chat.py
+++ b/modules/chat.py
@ -106,6 +106,10 @@ def generate_chat_prompt(user_input, state, **kwargs):

    if is_instruct:
        context = state['context_instruct']
+        if state['custom_system_message'].strip() != '':
+            context = context.replace('<|system-message|>', state['custom_system_message'])
+        else:
+            context = context.replace('<|system-message|>', state['system_message'])
    else:
        context = replace_character_names(
            f"{state['context'].strip()}\n",
@ -543,7 +547,7 @@ def generate_pfp_cache(character):


 def load_character(character, name1, name2, instruct=False):
-    context = greeting = turn_template = ""
+    context = greeting = turn_template = system_message = ""
    greeting_field = 'greeting'
    picture = None

@ -591,13 +595,11 @@ def load_character(character, name1, name2, instruct=False):
        context = build_pygmalion_style_context(data)
        greeting_field = 'char_greeting'

-    if greeting_field in data:
-        greeting = data[greeting_field]
+    greeting = data.get(greeting_field, greeting)
+    turn_template = data.get('turn_template', turn_template)
+    system_message = data.get('system_message', system_message)

-    if 'turn_template' in data:
-        turn_template = data['turn_template']
-
-    return name1, name2, picture, greeting, context, turn_template.replace("\n", r"\n")
+    return name1, name2, picture, greeting, context, turn_template.replace("\n", r"\n"), system_message


@functools.cache
@ -694,12 +696,13 @@ def generate_character_yaml(name, greeting, context):
    return yaml.dump(data, sort_keys=False, width=float("inf"))


-def generate_instruction_template_yaml(user, bot, context, turn_template):
+def generate_instruction_template_yaml(user, bot, context, turn_template, system_message):
    data = {
        'user': user,
        'bot': bot,
        'turn_template': turn_template,
        'context': context,
+        'system_message': system_message,
    }

    data = {k: v for k, v in data.items() if v}  # Strip falsy
--- a/modules/llamacpp_hf.py
+++ b/modules/llamacpp_hf.py
@ -39,7 +39,7 @@ class LlamacppHF(PreTrainedModel):
            'n_tokens': self.model.n_tokens,
            'input_ids': self.model.input_ids,
            'scores': self.model.scores,
-            'ctx': self.model.ctx
+            'ctx': self.model._ctx.ctx
        }

        if shared.args.cfg_cache:
@ -65,7 +65,7 @@ class LlamacppHF(PreTrainedModel):
            'n_tokens': self.model.n_tokens,
            'input_ids': self.model.input_ids,
            'scores': self.model.scores,
-            'ctx': self.model.ctx
+            'ctx': self.model._ctx.ctx
        })

    def save_negative_cache(self):
@ -73,20 +73,20 @@ class LlamacppHF(PreTrainedModel):
            'n_tokens': self.model.n_tokens,
            'input_ids': self.model.input_ids,
            'scores': self.model.scores,
-            'ctx': self.model.ctx
+            'ctx': self.model._ctx.ctx
        })

    def load_cache(self):
        self.model.n_tokens = self.llamacpp_cache['n_tokens']
        self.model.input_ids = self.llamacpp_cache['input_ids']
        self.model.scores = self.llamacpp_cache['scores']
-        self.model.ctx = self.llamacpp_cache['ctx']
+        self.model._ctx.ctx = self.llamacpp_cache['ctx']

    def load_negative_cache(self):
        self.model.n_tokens = self.llamacpp_cache_negative['n_tokens']
        self.model.input_ids = self.llamacpp_cache_negative['input_ids']
        self.model.scores = self.llamacpp_cache_negative['scores']
-        self.model.ctx = self.llamacpp_cache_negative['ctx']
+        self.model._ctx.ctx = self.llamacpp_cache_negative['ctx']

    @property
    def device(self) -> torch.device:
@ -204,7 +204,7 @@ class LlamacppHF(PreTrainedModel):
            'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base),
            'tensor_split': tensor_split_list,
            'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
-            'logits_all': True,
+            'logits_all': shared.args.logits_all,
        }

        Llama = llama_cpp_lib().Llama
--- a/modules/llamacpp_model.py
+++ b/modules/llamacpp_model.py
@ -101,7 +101,7 @@ class LlamaCppModel:

        return self.model.tokenize(string)

-    def decode(self, ids):
+    def decode(self, ids, **kwargs):
        return self.model.detokenize(ids).decode('utf-8')

    def get_logits(self, tokens):
--- a/modules/loaders.py
+++ b/modules/loaders.py
@ -123,6 +123,7 @@ loaders_and_params = OrderedDict({
        'numa',
        'cfg_cache',
        'use_fast',
+        'logits_all',
        'llamacpp_HF_info',
    ],
    'ctransformers': [
--- a/modules/models.py
+++ b/modules/models.py
@ -79,7 +79,7 @@ def load_model(model_name, loader=None):
            loader = metadata['loader']
            if loader is None:
                logger.error('The path to the model does not exist. Exiting.')
-                return None, None
+                raise ValueError

    shared.args.loader = loader
    output = load_func_map[loader](model_name)
--- a/modules/shared.py
+++ b/modules/shared.py
@ -55,6 +55,7 @@ settings = {
    'character': 'Assistant',
    'name1': 'You',
    'instruction_template': 'Alpaca',
+    'custom_system_message': '',
    'chat-instruct_command': 'Continue the chat dialogue below. Write a single reply for the character "<|character|>".\n\n<|prompt|>',
    'autoload_model': False,
    'default_extensions': ['gallery'],
@ -113,6 +114,7 @@ parser.add_argument('--n-gpu-layers', type=int, default=0, help='Number of layer
 parser.add_argument('--tensor_split', type=str, default=None, help='Split the model across multiple GPUs. Comma-separated list of proportions. Example: 18,17.')
 parser.add_argument('--llama_cpp_seed', type=int, default=0, help='Seed for llama-cpp models. Default is 0 (random).')
 parser.add_argument('--numa', action='store_true', help='Activate NUMA task allocation for llama.cpp.')
+parser.add_argument('--logits_all', action='store_true', help='Needs to be set for perplexity evaluation to work. Otherwise, ignore it, as it makes prompt processing slower.')
 parser.add_argument('--cache-capacity', type=str, help='Maximum cache capacity (llama-cpp-python). Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed.')

 # ExLlama
--- a/modules/text_generation.py
+++ b/modules/text_generation.py
@ -145,7 +145,7 @@ def decode(output_ids, skip_special_tokens=True):
    if shared.tokenizer is None:
        raise ValueError('No tokenizer is loaded')

-    return shared.tokenizer.decode(output_ids, skip_special_tokens)
+    return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens)


 def get_encoded_length(prompt):
--- a/modules/ui.py
+++ b/modules/ui.py
@ -87,6 +87,7 @@ def list_model_elements():
        'alpha_value',
        'rope_freq_base',
        'numa',
+        'logits_all',
    ]
    if is_torch_xpu_available():
        for i in range(torch.xpu.device_count()):
@ -156,6 +157,8 @@ def list_interface_input_elements():
        'name1_instruct',
        'name2_instruct',
        'context_instruct',
+        'system_message',
+        'custom_system_message',
        'turn_template',
        'chat_style',
        'chat-instruct_command',
--- a/modules/ui_chat.py
+++ b/modules/ui_chat.py
@ -112,10 +112,12 @@ def create_chat_settings_ui():
                shared.gradio['save_template'] = gr.Button('💾', elem_classes='refresh-button', interactive=not mu)
                shared.gradio['delete_template'] = gr.Button('🗑️ ', elem_classes='refresh-button', interactive=not mu)

-        shared.gradio['name1_instruct'] = gr.Textbox(value='', lines=2, label='User string')
-        shared.gradio['name2_instruct'] = gr.Textbox(value='', lines=1, label='Bot string')
-        shared.gradio['context_instruct'] = gr.Textbox(value='', lines=4, label='Context', elem_classes=['add_scrollbar'])
+        shared.gradio['custom_system_message'] = gr.Textbox(value=shared.settings['custom_system_message'], lines=2, label='Custom system message', info='If not empty, will be used instead of the default one.', elem_classes=['add_scrollbar'])
        shared.gradio['turn_template'] = gr.Textbox(value='', lines=1, label='Turn template', info='Used to precisely define the placement of spaces and new line characters in instruction prompts.', elem_classes=['add_scrollbar'])
+        shared.gradio['name1_instruct'] = gr.Textbox(value='', lines=2, label='User string', info='Replaces <|user|> in the turn template.')
+        shared.gradio['name2_instruct'] = gr.Textbox(value='', lines=1, label='Bot string', info='Replaces <|bot|> in the turn template.')
+        shared.gradio['context_instruct'] = gr.Textbox(value='', lines=4, label='Context', elem_classes=['add_scrollbar'])
+        shared.gradio['system_message'] = gr.Textbox(value='', lines=2, label='System message', info='Replaces <|system-message|> in the context.', elem_classes=['add_scrollbar'])
        with gr.Row():
            shared.gradio['send_instruction_to_default'] = gr.Button('Send to default', elem_classes=['small-button'])
            shared.gradio['send_instruction_to_notebook'] = gr.Button('Send to notebook', elem_classes=['small-button'])
@ -269,7 +271,7 @@ def create_event_handlers():
        lambda: None, None, None, _js=f'() => {{{ui.switch_tabs_js}; switch_to_chat()}}')

    shared.gradio['character_menu'].change(
-        partial(chat.load_character, instruct=False), gradio('character_menu', 'name1', 'name2'), gradio('name1', 'name2', 'character_picture', 'greeting', 'context', 'dummy')).success(
+        partial(chat.load_character, instruct=False), gradio('character_menu', 'name1', 'name2'), gradio('name1', 'name2', 'character_picture', 'greeting', 'context', 'dummy', 'dummy')).success(
        ui.gather_interface_values, gradio(shared.input_elements), gradio('interface_state')).then(
        chat.load_latest_history, gradio('interface_state'), gradio('history')).then(
        chat.redraw_html, gradio(reload_arr), gradio('display')).then(
@ -285,7 +287,7 @@ def create_event_handlers():

    shared.gradio['chat_style'].change(chat.redraw_html, gradio(reload_arr), gradio('display'))
    shared.gradio['instruction_template'].change(
-        partial(chat.load_character, instruct=True), gradio('instruction_template', 'name1_instruct', 'name2_instruct'), gradio('name1_instruct', 'name2_instruct', 'dummy', 'dummy', 'context_instruct', 'turn_template'))
+        partial(chat.load_character, instruct=True), gradio('instruction_template', 'name1_instruct', 'name2_instruct'), gradio('name1_instruct', 'name2_instruct', 'dummy', 'dummy', 'context_instruct', 'turn_template', 'system_message'))

    shared.gradio['Copy last reply'].click(chat.send_last_reply_to_input, gradio('history'), gradio('textbox'), show_progress=False)

@ -299,7 +301,7 @@ def create_event_handlers():
    shared.gradio['save_template'].click(
        lambda: 'My Template.yaml', None, gradio('save_filename')).then(
        lambda: 'instruction-templates/', None, gradio('save_root')).then(
-        chat.generate_instruction_template_yaml, gradio('name1_instruct', 'name2_instruct', 'context_instruct', 'turn_template'), gradio('save_contents')).then(
+        chat.generate_instruction_template_yaml, gradio('name1_instruct', 'name2_instruct', 'context_instruct', 'turn_template', 'system_message'), gradio('save_contents')).then(
        lambda: gr.update(visible=True), None, gradio('file_saver'))

    shared.gradio['delete_template'].click(
--- a/modules/ui_model_menu.py
+++ b/modules/ui_model_menu.py
@ -124,6 +124,7 @@ def create_ui():
                            shared.gradio['llama_cpp_seed'] = gr.Number(label='Seed (0 for random)', value=shared.args.llama_cpp_seed)
                            shared.gradio['trust_remote_code'] = gr.Checkbox(label="trust-remote-code", value=shared.args.trust_remote_code, info='To enable this option, start the web UI with the --trust-remote-code flag. It is necessary for some models.', interactive=shared.args.trust_remote_code)
                            shared.gradio['use_fast'] = gr.Checkbox(label="use_fast", value=shared.args.use_fast, info='Set use_fast=True while loading the tokenizer. May trigger a conversion that takes several minutes.')
+                            shared.gradio['logits_all'] = gr.Checkbox(label="logits_all", value=shared.args.logits_all, info='Needs to be set for perplexity evaluation to work. Otherwise, ignore it, as it makes prompt processing slower.')
                            shared.gradio['use_flash_attention_2'] = gr.Checkbox(label="use_flash_attention_2", value=shared.args.use_flash_attention_2, info='Set use_flash_attention_2=True while loading the model.')
                            shared.gradio['disable_exllama'] = gr.Checkbox(label="disable_exllama", value=shared.args.disable_exllama, info='Disable ExLlama kernel.')
                            shared.gradio['no_flash_attn'] = gr.Checkbox(label="no_flash_attn", value=shared.args.no_flash_attn, info='Force flash-attention to not be used.')
--- a/modules/utils.py
+++ b/modules/utils.py
@ -71,12 +71,12 @@ def natural_keys(text):


 def get_available_models():
-    model_list = ['None']
+    model_list = []
    for item in list(Path(f'{shared.args.model_dir}/').glob('*')):
        if not item.name.endswith(('.txt', '-np', '.pt', '.json', '.yaml', '.py')) and 'llama-tokenizer' not in item.name:
            model_list.append(re.sub('.pth$', '', item.name))

-    return sorted(model_list, key=natural_keys)
+    return ['None'] + sorted(model_list, key=natural_keys)


 def get_available_presets():
@ -119,7 +119,7 @@ def get_available_loras():
 def get_datasets(path: str, ext: str):
    # include subdirectories for raw txt files to allow training from a subdirectory of txt files
    if ext == "txt":
-        return ['None'] + sorted(set([k.stem for k in list(Path(path).glob('txt')) + list(Path(path).glob('*/')) if k.stem != 'put-trainer-datasets-here']), key=natural_keys)
+        return ['None'] + sorted(set([k.stem for k in list(Path(path).glob('*.txt')) + list(Path(path).glob('*/')) if k.stem != 'put-trainer-datasets-here']), key=natural_keys)

    return ['None'] + sorted(set([k.stem for k in Path(path).glob(f'*.{ext}') if k.stem != 'put-trainer-datasets-here']), key=natural_keys)

--- a/requirements.txt
+++ b/requirements.txt
@ -6,9 +6,9 @@ exllamav2==0.0.7; platform_system != "Darwin" and platform_machine != "x86_64"
 gradio==3.50.*
 markdown
 numpy==1.24.*
-optimum==1.13.1
+optimum==1.14.0
 pandas
-peft==0.5.*
+peft==0.6.*
 Pillow>=9.5.0
 pyyaml
 requests
@ -27,14 +27,14 @@ bitsandbytes==0.41.1; platform_system != "Windows"
 https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl; platform_system == "Windows"

 # llama-cpp-python (CPU only, AVX2)
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"

 # CUDA wheels
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.4.2/auto_gptq-0.4.2+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
@ -67,14 +67,14 @@ https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.2/flash_attn
 https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.2/flash_attn-2.3.2+cu122torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
 https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.2/flash_attn-2.3.2+cu122torch2.1cxx11abiFALSE-cp39-cp39-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
 https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.2/flash_attn-2.3.2+cu122torch2.1cxx11abiFALSE-cp38-cp38-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121-cp39-cp39-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121-cp38-cp38-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121-cp39-cp39-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121-cp38-cp38-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
 https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
 https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
 https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
--- a/requirements_amd.txt
+++ b/requirements_amd.txt
@ -6,9 +6,9 @@ exllamav2==0.0.7
 gradio==3.50.*
 markdown
 numpy==1.24.*
-optimum==1.13.1
+optimum==1.14.0
 pandas
-peft==0.5.*
+peft==0.6.*
 Pillow>=9.5.0
 pyyaml
 requests
@ -27,14 +27,14 @@ bitsandbytes==0.38.1; platform_system != "Windows"
 https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.38.1-py3-none-win_amd64.whl; platform_system == "Windows"

 # llama-cpp-python (CPU only, AVX2)
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"

 # AMD wheels
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.4.2/auto_gptq-0.4.2+rocm5.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
@ -45,10 +45,10 @@ https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+rocm5
 https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+rocm5.6-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
 https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+rocm5.6-cp39-cp39-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
 https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+rocm5.6-cp38-cp38-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.2.11+rocm5.6.1-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.2.11+rocm5.6.1-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.2.11+rocm5.6.1-cp39-cp39-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.2.11+rocm5.6.1-cp38-cp38-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.2.14+rocm5.6.1-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.2.14+rocm5.6.1-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.2.14+rocm5.6.1-cp39-cp39-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.2.14+rocm5.6.1-cp38-cp38-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
 https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+rocm5.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
 https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+rocm5.6-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
 https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+rocm5.6-cp39-cp39-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
--- a/requirements_amd_noavx2.txt
+++ b/requirements_amd_noavx2.txt
@ -6,9 +6,9 @@ exllamav2==0.0.7
 gradio==3.50.*
 markdown
 numpy==1.24.*
-optimum==1.13.1
+optimum==1.14.0
 pandas
-peft==0.5.*
+peft==0.6.*
 Pillow>=9.5.0
 pyyaml
 requests
@ -27,14 +27,14 @@ bitsandbytes==0.38.1; platform_system != "Windows"
 https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.38.1-py3-none-win_amd64.whl; platform_system == "Windows"

 # llama-cpp-python (CPU only, no AVX2)
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp39-cp39-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp38-cp38-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp39-cp39-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp38-cp38-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"

 # AMD wheels
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.4.2/auto_gptq-0.4.2+rocm5.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
--- a/requirements_apple_intel.txt
+++ b/requirements_apple_intel.txt
@ -6,9 +6,9 @@ exllamav2==0.0.7
 gradio==3.50.*
 markdown
 numpy==1.24.*
-optimum==1.13.1
+optimum==1.14.0
 pandas
-peft==0.5.*
+peft==0.6.*
 Pillow>=9.5.0
 pyyaml
 requests
@ -27,19 +27,19 @@ bitsandbytes==0.41.1; platform_system != "Windows"
 https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl; platform_system == "Windows"

 # Mac wheels
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp311-cp311-macosx_11_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp310-cp310-macosx_11_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp39-cp39-macosx_11_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp38-cp38-macosx_11_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp311-cp311-macosx_12_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp310-cp310-macosx_12_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp39-cp39-macosx_12_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp38-cp38-macosx_12_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp311-cp311-macosx_13_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp310-cp310-macosx_13_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp39-cp39-macosx_13_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp38-cp38-macosx_13_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp311-cp311-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp310-cp310-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp39-cp39-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp38-cp38-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp311-cp311-macosx_11_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp310-cp310-macosx_11_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp39-cp39-macosx_11_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp38-cp38-macosx_11_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp311-cp311-macosx_12_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp310-cp310-macosx_12_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp39-cp39-macosx_12_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp38-cp38-macosx_12_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp311-cp311-macosx_13_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp310-cp310-macosx_13_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp39-cp39-macosx_13_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp38-cp38-macosx_13_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp311-cp311-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp310-cp310-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp39-cp39-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp38-cp38-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.8"
--- a/requirements_apple_silicon.txt
+++ b/requirements_apple_silicon.txt
@ -6,9 +6,9 @@ exllamav2==0.0.7
 gradio==3.50.*
 markdown
 numpy==1.24.*
-optimum==1.13.1
+optimum==1.14.0
 pandas
-peft==0.5.*
+peft==0.6.*
 Pillow>=9.5.0
 pyyaml
 requests
@ -27,19 +27,19 @@ bitsandbytes==0.41.1; platform_system != "Windows"
 https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl; platform_system == "Windows"

 # Mac wheels
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp311-cp311-macosx_11_0_arm64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp310-cp310-macosx_11_0_arm64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp39-cp39-macosx_11_0_arm64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp38-cp38-macosx_11_0_arm64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp311-cp311-macosx_12_0_arm64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp310-cp310-macosx_12_0_arm64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp39-cp39-macosx_12_0_arm64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp38-cp38-macosx_12_0_arm64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp311-cp311-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp310-cp310-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp39-cp39-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp38-cp38-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp311-cp311-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp310-cp310-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp39-cp39-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.11-cp38-cp38-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp311-cp311-macosx_11_0_arm64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp310-cp310-macosx_11_0_arm64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp39-cp39-macosx_11_0_arm64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp38-cp38-macosx_11_0_arm64.whl; platform_system == "Darwin" and platform_release >= "20.0.0" and platform_release < "21.0.0" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp311-cp311-macosx_12_0_arm64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp310-cp310-macosx_12_0_arm64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp39-cp39-macosx_12_0_arm64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp38-cp38-macosx_12_0_arm64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp311-cp311-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp310-cp310-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp39-cp39-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp38-cp38-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp311-cp311-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp310-cp310-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp39-cp39-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.14-cp38-cp38-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.8"
--- a/requirements_cpu_only.txt
+++ b/requirements_cpu_only.txt
@ -6,9 +6,9 @@ exllamav2==0.0.7
 gradio==3.50.*
 markdown
 numpy==1.24.*
-optimum==1.13.1
+optimum==1.14.0
 pandas
-peft==0.5.*
+peft==0.6.*
 Pillow>=9.5.0
 pyyaml
 requests
@ -27,11 +27,11 @@ bitsandbytes==0.41.1; platform_system != "Windows"
 https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl; platform_system == "Windows"

 # llama-cpp-python (CPU only, AVX2)
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.11/llama_cpp_python-0.2.11-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.14/llama_cpp_python-0.2.14-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
--- a/requirements_cpu_only_noavx2.txt
+++ b/requirements_cpu_only_noavx2.txt
@ -6,9 +6,9 @@ exllamav2==0.0.7
 gradio==3.50.*
 markdown
 numpy==1.24.*
-optimum==1.13.1
+optimum==1.14.0
 pandas
-peft==0.5.*
+peft==0.6.*
 Pillow>=9.5.0
 pyyaml
 requests
@ -27,11 +27,11 @@ bitsandbytes==0.41.1; platform_system != "Windows"
 https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl; platform_system == "Windows"

 # llama-cpp-python (CPU only, no AVX2)
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp39-cp39-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp38-cp38-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp39-cp39-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp38-cp38-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
--- a/requirements_noavx2.txt
+++ b/requirements_noavx2.txt
@ -6,9 +6,9 @@ exllamav2==0.0.7; platform_system != "Darwin" and platform_machine != "x86_64"
 gradio==3.50.*
 markdown
 numpy==1.24.*
-optimum==1.13.1
+optimum==1.14.0
 pandas
-peft==0.5.*
+peft==0.6.*
 Pillow>=9.5.0
 pyyaml
 requests
@ -27,14 +27,14 @@ bitsandbytes==0.41.1; platform_system != "Windows"
 https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl; platform_system == "Windows"

 # llama-cpp-python (CPU only, no AVX2)
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp39-cp39-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp38-cp38-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.11+cpuavx-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp39-cp39-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp38-cp38-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.14+cpuavx-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"

 # CUDA wheels
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.4.2/auto_gptq-0.4.2+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
@ -67,14 +67,14 @@ https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.2/flash_attn
 https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.2/flash_attn-2.3.2+cu122torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
 https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.2/flash_attn-2.3.2+cu122torch2.1cxx11abiFALSE-cp39-cp39-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
 https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.2/flash_attn-2.3.2+cu122torch2.1cxx11abiFALSE-cp38-cp38-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121avx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121avx-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121avx-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121avx-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121avx-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121avx-cp39-cp39-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.11+cu121avx-cp38-cp38-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121avx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121avx-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121avx-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121avx-cp311-cp311-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121avx-cp310-cp310-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121avx-cp39-cp39-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
+https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.14+cu121avx-cp38-cp38-manylinux_2_31_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
 https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
 https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
 https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
--- a/requirements_nowheels.txt
+++ b/requirements_nowheels.txt
@ -6,9 +6,9 @@ exllamav2==0.0.7
 gradio==3.50.*
 markdown
 numpy==1.24.*
-optimum==1.13.1
+optimum==1.14.0
 pandas
-peft==0.5.*
+peft==0.6.*
 Pillow>=9.5.0
 pyyaml
 requests
--- a/server.py
+++ b/server.py
@ -216,8 +216,7 @@ if __name__ == "__main__":
            model_name = shared.model_name

        model_settings = get_model_metadata(model_name)
-        shared.settings.update({k: v for k, v in model_settings.items() if k in shared.settings})  # hijacking the interface defaults
-        update_model_parameters(model_settings, initial=True)  # hijacking the command-line arguments
+        update_model_parameters(model_settings, initial=True)  # hijack the command-line arguments

        # Load the model
        shared.model, shared.tokenizer = load_model(model_name)