mirror of
https://github.com/oobabooga/text-generation-webui.git
synced 2025-01-27 04:23:21 +01:00
Merge branch 'oobabooga:dev' into dev
This commit is contained in:
commit
ac2dc37c76
48
README.md
48
README.md
@ -10,33 +10,29 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
|
||||
|
||||
## Features
|
||||
|
||||
* Multiple backends for text generation in a single UI and API, including [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp) (through [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)), [ExLlamaV2](https://github.com/turboderp/exllamav2), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [HQQ](https://github.com/mobiusml/hqq), and [AQLM](https://github.com/Vahe1994/AQLM) are also supported through the Transformers loader.
|
||||
* OpenAI-compatible API server with Chat and Completions endpoints – see the [examples](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples).
|
||||
* Automatic prompt formatting for each model using the Jinja2 template in its metadata.
|
||||
* Three chat modes: `instruct`, `chat-instruct`, and `chat`, allowing for both instruction-following and casual conversations with characters. `chat-instruct` mode automatically applies the model's template to the chat prompt, ensuring high-quality outputs without manual setup.
|
||||
* "Past chats" menu to quickly switch between conversations and start new ones.
|
||||
* Free-form generation in the Default/Notebook tabs without being limited to chat turns. Send formatted chat conversations from the Chat tab to these tabs.
|
||||
* Multiple sampling parameters and generation options for sophisticated text generation control.
|
||||
* Easy switching between different models through the UI without restarting, using the "Model" tab.
|
||||
* Simple LoRA fine-tuning tool to customize models with your data.
|
||||
* All in one folder. The requirements are installed in a self-contained `installer_files` folder that doesn't interfere with the system's environment.
|
||||
* Extensions support, including numerous built-in and user-contributed extensions. See [the wiki](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions) and [the extensions directory](https://github.com/oobabooga/text-generation-webui-extensions) for details.
|
||||
- Supports multiple text generation backends in one UI/API, including [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp), and [ExLlamaV2](https://github.com/turboderp/exllamav2). [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [HQQ](https://github.com/mobiusml/hqq), and [AQLM](https://github.com/Vahe1994/AQLM) are also supported but you need to install them manually.
|
||||
- OpenAI-compatible API with Chat and Completions endpoints – see [examples](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples).
|
||||
- Automatic prompt formatting using Jinja2 templates.
|
||||
- Three chat modes: `instruct`, `chat-instruct`, and `chat`, with automatic prompt templates in `chat-instruct`.
|
||||
- "Past chats" menu to quickly switch between conversations.
|
||||
- Free-form text generation in the Default/Notebook tabs without being limited to chat turns. You can send formatted conversations from the Chat tab to these.
|
||||
- Multiple sampling parameters and generation options for sophisticated text generation control.
|
||||
- Switch between different models easily in the UI without restarting.
|
||||
- Simple LoRA fine-tuning tool.
|
||||
- Requirements installed in a self-contained `installer_files` directory that doesn't interfere with the system environment.
|
||||
- Extension support, with numerous built-in and user-contributed extensions available. See the [wiki](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions) and [extensions directory](https://github.com/oobabooga/text-generation-webui-extensions) for details.
|
||||
|
||||
## How to install
|
||||
|
||||
1) Clone or [download](https://github.com/oobabooga/text-generation-webui/archive/refs/heads/main.zip) the repository.
|
||||
2) Run the `start_linux.sh`, `start_windows.bat`, `start_macos.sh`, or `start_wsl.bat` script depending on your OS.
|
||||
1) Clone or [download the repository](https://github.com/oobabooga/text-generation-webui/archive/refs/heads/main.zip).
|
||||
2) Run the script that matches your OS: `start_linux.sh`, `start_windows.bat`, `start_macos.sh`, or `start_wsl.bat`.
|
||||
3) Select your GPU vendor when asked.
|
||||
4) Once the installation ends, browse to `http://localhost:7860`.
|
||||
5) Have fun!
|
||||
|
||||
To restart the web UI in the future, run the `start_` script again.
|
||||
To restart the web UI later, just run the same `start_` script. If you need to reinstall, delete the `installer_files` folder created during setup and run the script again.
|
||||
|
||||
This script creates an `installer_files` folder where it sets up the project's requirements. If you need to reinstall the requirements, just delete that folder and start the web UI again.
|
||||
|
||||
The script accepts command-line flags, such as `./start_linux.sh --help`. Alternatively, you can edit the `CMD_FLAGS.txt` file with a text editor and add your flags there, such as `--api` in case you need to use the API.
|
||||
|
||||
To get updates in the future, run `update_wizard_linux.sh`, `update_wizard_windows.bat`, `update_wizard_macos.sh`, or `update_wizard_wsl.bat`.
|
||||
You can use command-line flags, like `./start_linux.sh --help`, or add them to `CMD_FLAGS.txt` (such as `--api` to enable API use). To update the project, run `update_wizard_linux.sh`, `update_wizard_windows.bat`, `update_wizard_macos.sh`, or `update_wizard_wsl.bat`.
|
||||
|
||||
<details>
|
||||
<summary>
|
||||
@ -80,12 +76,12 @@ conda activate textgen
|
||||
|
||||
| System | GPU | Command |
|
||||
|--------|---------|---------|
|
||||
| Linux/WSL | NVIDIA | `pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121` |
|
||||
| Linux/WSL | CPU only | `pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cpu` |
|
||||
| Linux | AMD | `pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/rocm5.6` |
|
||||
| MacOS + MPS | Any | `pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2` |
|
||||
| Windows | NVIDIA | `pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121` |
|
||||
| Windows | CPU only | `pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2` |
|
||||
| Linux/WSL | NVIDIA | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121` |
|
||||
| Linux/WSL | CPU only | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cpu` |
|
||||
| Linux | AMD | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/rocm6.1` |
|
||||
| MacOS + MPS | Any | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1` |
|
||||
| Windows | NVIDIA | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121` |
|
||||
| Windows | CPU only | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1` |
|
||||
|
||||
The up-to-date commands can be found here: https://pytorch.org/get-started/locally/.
|
||||
|
||||
@ -150,7 +146,7 @@ Then browse to
|
||||
1) For Kepler GPUs and older, you will need to install CUDA 11.8 instead of 12:
|
||||
|
||||
```
|
||||
pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
|
||||
pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118
|
||||
conda install -y -c "nvidia/label/cuda-11.8.0" cuda-runtime
|
||||
```
|
||||
|
||||
|
@ -38,6 +38,8 @@ class GenerationOptions(BaseModel):
|
||||
dry_base: float = 1.75
|
||||
dry_allowed_length: int = 2
|
||||
dry_sequence_breakers: str = '"\\n", ":", "\\"", "*"'
|
||||
xtc_threshold: float = 0.1
|
||||
xtc_probability: float = 0
|
||||
truncation_length: int = 0
|
||||
max_tokens_second: int = 0
|
||||
prompt_lookup_num_tokens: int = 0
|
||||
|
@ -96,7 +96,7 @@ def ui():
|
||||
with gr.Accordion("Settings", open=False):
|
||||
auto_submit = gr.Checkbox(label='Submit the transcribed audio automatically', value=params['auto_submit'])
|
||||
device_dropd = gr.Dropdown(label='Device', value=str(startup_device), choices=["cuda", "cpu", "none"])
|
||||
whisper_model_dropd = gr.Dropdown(label='Whisper Model', value=params['whipser_model'], choices=["tiny.en", "base.en", "small.en", "medium.en", "tiny", "base", "small", "medium", "large"])
|
||||
whisper_model_dropd = gr.Dropdown(label='Whisper Model', value=params['whipser_model'], choices=["tiny.en", "base.en", "small.en", "medium.en", "tiny", "base", "small", "medium", "large", "turbo"])
|
||||
whisper_language = gr.Dropdown(label='Whisper Language', value=params['whipser_language'], choices=["english", "chinese", "german", "spanish", "russian", "korean", "french", "japanese", "portuguese", "turkish", "polish", "catalan", "dutch", "arabic", "swedish", "italian", "indonesian", "hindi", "finnish", "vietnamese", "hebrew", "ukrainian", "greek", "malay", "czech", "romanian", "danish", "hungarian", "tamil", "norwegian", "thai", "urdu", "croatian", "bulgarian", "lithuanian", "latin", "maori", "malayalam", "welsh", "slovak", "telugu", "persian", "latvian", "bengali", "serbian", "azerbaijani", "slovenian", "kannada", "estonian", "macedonian", "breton", "basque", "icelandic", "armenian", "nepali", "mongolian", "bosnian", "kazakh", "albanian", "swahili", "galician", "marathi", "punjabi", "sinhala", "khmer", "shona", "yoruba", "somali", "afrikaans", "occitan", "georgian", "belarusian", "tajik", "sindhi", "gujarati", "amharic", "yiddish", "lao", "uzbek", "faroese", "haitian creole", "pashto", "turkmen", "nynorsk", "maltese", "sanskrit", "luxembourgish", "myanmar", "tibetan", "tagalog", "malagasy", "assamese", "tatar", "hawaiian", "lingala", "hausa", "bashkir", "javanese", "sundanese"])
|
||||
|
||||
audio.change(
|
||||
|
11
js/main.js
11
js/main.js
@ -600,4 +600,15 @@ headerBar.addEventListener("click", (e) => {
|
||||
}
|
||||
});
|
||||
|
||||
//------------------------------------------------
|
||||
// Add a confirmation dialog when leaving the page
|
||||
// Useful to avoid data loss
|
||||
//------------------------------------------------
|
||||
window.addEventListener("beforeunload", function (event) {
|
||||
// Cancel the event
|
||||
event.preventDefault();
|
||||
// Chrome requires returnValue to be set
|
||||
event.returnValue = "";
|
||||
});
|
||||
|
||||
moveToChatTab();
|
||||
|
@ -1059,7 +1059,12 @@ def handle_start_new_chat_click(state):
|
||||
|
||||
convert_to_markdown.cache_clear()
|
||||
|
||||
return [history, html, gr.update(choices=histories, value=histories[0][1])]
|
||||
if len(histories) > 0:
|
||||
past_chats_update = gr.update(choices=histories, value=histories[0][1])
|
||||
else:
|
||||
past_chats_update = gr.update(choices=histories)
|
||||
|
||||
return [history, html, past_chats_update]
|
||||
|
||||
|
||||
def handle_delete_chat_confirm_click(state):
|
||||
@ -1110,10 +1115,15 @@ def handle_upload_chat_history(load_chat_history, state):
|
||||
|
||||
convert_to_markdown.cache_clear()
|
||||
|
||||
if len(histories) > 0:
|
||||
past_chats_update = gr.update(choices=histories, value=histories[0][1])
|
||||
else:
|
||||
past_chats_update = gr.update(choices=histories)
|
||||
|
||||
return [
|
||||
history,
|
||||
html,
|
||||
gr.update(choices=histories, value=histories[0][1])
|
||||
past_chats_update
|
||||
]
|
||||
|
||||
|
||||
@ -1132,6 +1142,11 @@ def handle_character_menu_change(state):
|
||||
|
||||
convert_to_markdown.cache_clear()
|
||||
|
||||
if len(histories) > 0:
|
||||
past_chats_update = gr.update(choices=histories, value=histories[0][1])
|
||||
else:
|
||||
past_chats_update = gr.update(choices=histories)
|
||||
|
||||
return [
|
||||
history,
|
||||
html,
|
||||
@ -1140,7 +1155,7 @@ def handle_character_menu_change(state):
|
||||
picture,
|
||||
greeting,
|
||||
context,
|
||||
gr.update(choices=histories, value=histories[0][1]),
|
||||
past_chats_update,
|
||||
]
|
||||
|
||||
|
||||
@ -1151,12 +1166,17 @@ def handle_mode_change(state):
|
||||
|
||||
convert_to_markdown.cache_clear()
|
||||
|
||||
if len(histories) > 0:
|
||||
past_chats_update = gr.update(choices=histories, value=histories[0][1])
|
||||
else:
|
||||
past_chats_update = gr.update(choices=histories)
|
||||
|
||||
return [
|
||||
history,
|
||||
html,
|
||||
gr.update(visible=state['mode'] != 'instruct'),
|
||||
gr.update(visible=state['mode'] == 'chat-instruct'),
|
||||
gr.update(choices=histories, value=histories[0][1])
|
||||
past_chats_update
|
||||
]
|
||||
|
||||
|
||||
|
@ -7,6 +7,7 @@ from exllamav2 import (
|
||||
ExLlamaV2Cache,
|
||||
ExLlamaV2Cache_8bit,
|
||||
ExLlamaV2Cache_Q4,
|
||||
ExLlamaV2Cache_TP,
|
||||
ExLlamaV2Config,
|
||||
ExLlamaV2Tokenizer
|
||||
)
|
||||
@ -18,14 +19,6 @@ from modules.text_generation import get_max_prompt_length
|
||||
|
||||
try:
|
||||
import flash_attn
|
||||
except ModuleNotFoundError:
|
||||
logger.warning(
|
||||
'You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage '
|
||||
'to be a lot higher than it could be.\n'
|
||||
'Try installing flash-attention following the instructions here: '
|
||||
'https://github.com/Dao-AILab/flash-attention#installation-and-features'
|
||||
)
|
||||
pass
|
||||
except Exception:
|
||||
logger.warning('Failed to load flash-attention due to the following error:\n')
|
||||
traceback.print_exc()
|
||||
@ -54,21 +47,30 @@ class Exllamav2Model:
|
||||
|
||||
model = ExLlamaV2(config)
|
||||
|
||||
if not shared.args.autosplit:
|
||||
split = None
|
||||
if shared.args.gpu_split:
|
||||
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
|
||||
split = None
|
||||
if shared.args.gpu_split:
|
||||
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
|
||||
|
||||
if shared.args.enable_tp:
|
||||
model.load_tp(split)
|
||||
elif not shared.args.autosplit:
|
||||
model.load(split)
|
||||
|
||||
# Determine the correct cache type
|
||||
if shared.args.cache_8bit:
|
||||
cache = ExLlamaV2Cache_8bit(model, lazy=shared.args.autosplit)
|
||||
cache_type = ExLlamaV2Cache_8bit
|
||||
elif shared.args.cache_4bit:
|
||||
cache = ExLlamaV2Cache_Q4(model, lazy=shared.args.autosplit)
|
||||
cache_type = ExLlamaV2Cache_Q4
|
||||
else:
|
||||
cache = ExLlamaV2Cache(model, lazy=shared.args.autosplit)
|
||||
cache_type = ExLlamaV2Cache
|
||||
|
||||
if shared.args.autosplit:
|
||||
# Use TP if specified
|
||||
if shared.args.enable_tp:
|
||||
cache = ExLlamaV2Cache_TP(model, base=cache_type)
|
||||
else:
|
||||
cache = cache_type(model, lazy=shared.args.autosplit)
|
||||
|
||||
if shared.args.autosplit and not shared.args.enable_tp:
|
||||
model.load_autosplit(cache)
|
||||
|
||||
tokenizer = ExLlamaV2Tokenizer(config)
|
||||
|
@ -9,6 +9,7 @@ from exllamav2 import (
|
||||
ExLlamaV2Cache,
|
||||
ExLlamaV2Cache_8bit,
|
||||
ExLlamaV2Cache_Q4,
|
||||
ExLlamaV2Cache_TP,
|
||||
ExLlamaV2Config
|
||||
)
|
||||
from torch.nn import CrossEntropyLoss
|
||||
@ -20,14 +21,6 @@ from modules.logging_colors import logger
|
||||
|
||||
try:
|
||||
import flash_attn
|
||||
except ModuleNotFoundError:
|
||||
logger.warning(
|
||||
'You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage '
|
||||
'to be a lot higher than it could be.\n'
|
||||
'Try installing flash-attention following the instructions here: '
|
||||
'https://github.com/Dao-AILab/flash-attention#installation-and-features'
|
||||
)
|
||||
pass
|
||||
except Exception:
|
||||
logger.warning('Failed to load flash-attention due to the following error:\n')
|
||||
traceback.print_exc()
|
||||
@ -42,21 +35,30 @@ class Exllamav2HF(PreTrainedModel):
|
||||
|
||||
self.ex_model = ExLlamaV2(config)
|
||||
|
||||
if not shared.args.autosplit:
|
||||
split = None
|
||||
if shared.args.gpu_split:
|
||||
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
|
||||
split = None
|
||||
if shared.args.gpu_split:
|
||||
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
|
||||
|
||||
if shared.args.enable_tp:
|
||||
self.ex_model.load_tp(split)
|
||||
elif not shared.args.autosplit:
|
||||
self.ex_model.load(split)
|
||||
|
||||
# Determine the correct cache type
|
||||
if shared.args.cache_8bit:
|
||||
self.ex_cache = ExLlamaV2Cache_8bit(self.ex_model, lazy=shared.args.autosplit)
|
||||
cache_type = ExLlamaV2Cache_8bit
|
||||
elif shared.args.cache_4bit:
|
||||
self.ex_cache = ExLlamaV2Cache_Q4(self.ex_model, lazy=shared.args.autosplit)
|
||||
cache_type = ExLlamaV2Cache_Q4
|
||||
else:
|
||||
self.ex_cache = ExLlamaV2Cache(self.ex_model, lazy=shared.args.autosplit)
|
||||
cache_type = ExLlamaV2Cache
|
||||
|
||||
if shared.args.autosplit:
|
||||
# Use TP if specified
|
||||
if shared.args.enable_tp:
|
||||
self.ex_cache = ExLlamaV2Cache_TP(self.ex_model, base=cache_type)
|
||||
else:
|
||||
self.ex_cache = cache_type(self.ex_model, lazy=shared.args.autosplit)
|
||||
|
||||
if shared.args.autosplit and not shared.args.enable_tp:
|
||||
self.ex_model.load_autosplit(self.ex_cache)
|
||||
|
||||
self.past_seq = None
|
||||
|
@ -2,12 +2,12 @@ import importlib
|
||||
import platform
|
||||
from typing import Sequence
|
||||
|
||||
import numpy as np
|
||||
from tqdm import tqdm
|
||||
|
||||
from modules import shared
|
||||
from modules.cache_utils import process_llamacpp_cache
|
||||
|
||||
|
||||
imported_module = None
|
||||
|
||||
|
||||
@ -57,8 +57,6 @@ def eval_with_progress(self, tokens: Sequence[int]):
|
||||
|
||||
with tqdm to show prompt processing progress.
|
||||
"""
|
||||
assert self._ctx.ctx is not None
|
||||
assert self._batch.batch is not None
|
||||
self._ctx.kv_cache_seq_rm(-1, self.n_tokens, -1)
|
||||
|
||||
if len(tokens) > self.n_batch:
|
||||
@ -80,13 +78,20 @@ def eval_with_progress(self, tokens: Sequence[int]):
|
||||
if self.context_params.logits_all:
|
||||
rows = n_tokens
|
||||
cols = self._n_vocab
|
||||
logits = self._ctx.get_logits()[: rows * cols]
|
||||
self.scores[n_past : n_past + n_tokens, :].reshape(-1)[: :] = logits
|
||||
logits = np.ctypeslib.as_array(
|
||||
self._ctx.get_logits(), shape=(rows * cols,)
|
||||
)
|
||||
self.scores[n_past : n_past + n_tokens, :].reshape(-1)[::] = logits
|
||||
self.last_updated_index = n_past + n_tokens - 1
|
||||
else:
|
||||
rows = 1
|
||||
cols = self._n_vocab
|
||||
logits = self._ctx.get_logits()[: rows * cols]
|
||||
self.scores[n_past + n_tokens - 1, :].reshape(-1)[: :] = logits
|
||||
logits = np.ctypeslib.as_array(
|
||||
self._ctx.get_logits(), shape=(rows * cols,)
|
||||
)
|
||||
last_token_index = min(n_past + n_tokens - 1, self.scores.shape[0] - 1)
|
||||
self.scores[last_token_index, :] = logits.reshape(-1)
|
||||
self.last_updated_index = last_token_index
|
||||
# Update n_tokens
|
||||
self.n_tokens += n_tokens
|
||||
|
||||
|
@ -127,7 +127,7 @@ class LlamacppHF(PreTrainedModel):
|
||||
self.model.reset()
|
||||
self.model.eval(seq)
|
||||
|
||||
logits = torch.tensor(self.model.scores[self.model.n_tokens - 1, :]).view(1, 1, -1).to(input_ids.device)
|
||||
logits = torch.tensor(self.model.scores[self.model.last_updated_index, :]).view(1, 1, -1).to(input_ids.device)
|
||||
else:
|
||||
self.model.reset()
|
||||
self.model.eval(seq)
|
||||
@ -205,5 +205,6 @@ class LlamacppHF(PreTrainedModel):
|
||||
|
||||
Llama = llama_cpp_lib().Llama
|
||||
model = Llama(**params)
|
||||
model.last_updated_index = -1
|
||||
|
||||
return LlamacppHF(model, model_file)
|
||||
|
@ -90,6 +90,7 @@ loaders_and_params = OrderedDict({
|
||||
'cache_8bit',
|
||||
'cache_4bit',
|
||||
'autosplit',
|
||||
'enable_tp',
|
||||
'alpha_value',
|
||||
'compress_pos_emb',
|
||||
'trust_remote_code',
|
||||
@ -105,6 +106,7 @@ loaders_and_params = OrderedDict({
|
||||
'cache_8bit',
|
||||
'cache_4bit',
|
||||
'autosplit',
|
||||
'enable_tp',
|
||||
'alpha_value',
|
||||
'compress_pos_emb',
|
||||
'exllamav2_info',
|
||||
@ -168,6 +170,8 @@ def transformers_samplers():
|
||||
'dry_base',
|
||||
'dry_allowed_length',
|
||||
'dry_sequence_breakers',
|
||||
'xtc_threshold',
|
||||
'xtc_probability',
|
||||
'seed',
|
||||
'do_sample',
|
||||
'penalty_alpha',
|
||||
@ -242,6 +246,8 @@ loaders_samplers = {
|
||||
'dry_base',
|
||||
'dry_allowed_length',
|
||||
'dry_sequence_breakers',
|
||||
'xtc_threshold',
|
||||
'xtc_probability',
|
||||
'seed',
|
||||
'do_sample',
|
||||
'mirostat_mode',
|
||||
@ -304,6 +310,8 @@ loaders_samplers = {
|
||||
'dry_base',
|
||||
'dry_allowed_length',
|
||||
'dry_sequence_breakers',
|
||||
'xtc_threshold',
|
||||
'xtc_probability',
|
||||
'seed',
|
||||
'do_sample',
|
||||
'mirostat_mode',
|
||||
|
@ -70,11 +70,11 @@ def load_model(model_name, loader=None):
|
||||
shared.model_name = model_name
|
||||
load_func_map = {
|
||||
'Transformers': huggingface_loader,
|
||||
'AutoGPTQ': AutoGPTQ_loader,
|
||||
'llama.cpp': llamacpp_loader,
|
||||
'llamacpp_HF': llamacpp_HF_loader,
|
||||
'ExLlamav2': ExLlamav2_loader,
|
||||
'ExLlamav2_HF': ExLlamav2_HF_loader,
|
||||
'AutoGPTQ': AutoGPTQ_loader,
|
||||
'HQQ': HQQ_loader,
|
||||
'TensorRT-LLM': TensorRT_LLM_loader,
|
||||
}
|
||||
@ -302,12 +302,6 @@ def llamacpp_HF_loader(model_name):
|
||||
return model
|
||||
|
||||
|
||||
def AutoGPTQ_loader(model_name):
|
||||
import modules.AutoGPTQ_loader
|
||||
|
||||
return modules.AutoGPTQ_loader.load_quantized(model_name)
|
||||
|
||||
|
||||
def ExLlamav2_loader(model_name):
|
||||
from modules.exllamav2 import Exllamav2Model
|
||||
|
||||
@ -321,9 +315,21 @@ def ExLlamav2_HF_loader(model_name):
|
||||
return Exllamav2HF.from_pretrained(model_name)
|
||||
|
||||
|
||||
def AutoGPTQ_loader(model_name):
|
||||
try:
|
||||
import modules.AutoGPTQ_loader
|
||||
except ModuleNotFoundError:
|
||||
raise ModuleNotFoundError("Failed to import 'autogptq'. Please install it manually following the instructions in the AutoGPTQ GitHub repository.")
|
||||
|
||||
return modules.AutoGPTQ_loader.load_quantized(model_name)
|
||||
|
||||
|
||||
def HQQ_loader(model_name):
|
||||
from hqq.core.quantize import HQQBackend, HQQLinear
|
||||
from hqq.models.hf.base import AutoHQQHFModel
|
||||
try:
|
||||
from hqq.core.quantize import HQQBackend, HQQLinear
|
||||
from hqq.models.hf.base import AutoHQQHFModel
|
||||
except ModuleNotFoundError:
|
||||
raise ModuleNotFoundError("Failed to import 'hqq'. Please install it manually following the instructions in the HQQ GitHub repository.")
|
||||
|
||||
logger.info(f"Loading HQQ model with backend: \"{shared.args.hqq_backend}\"")
|
||||
|
||||
@ -334,7 +340,10 @@ def HQQ_loader(model_name):
|
||||
|
||||
|
||||
def TensorRT_LLM_loader(model_name):
|
||||
from modules.tensorrt_llm import TensorRTLLMModel
|
||||
try:
|
||||
from modules.tensorrt_llm import TensorRTLLMModel
|
||||
except ModuleNotFoundError:
|
||||
raise ModuleNotFoundError("Failed to import 'tensorrt_llm'. Please install it manually following the instructions in the TensorRT-LLM GitHub repository.")
|
||||
|
||||
model = TensorRTLLMModel.from_pretrained(model_name)
|
||||
return model
|
||||
|
@ -44,7 +44,9 @@ def default_preset():
|
||||
'dry_base': 1.75,
|
||||
'dry_allowed_length': 2,
|
||||
'dry_sequence_breakers': '"\\n", ":", "\\"", "*"',
|
||||
'sampler_priority': 'temperature\ndynamic_temperature\nquadratic_sampling\ntop_k\ntop_p\ntypical_p\nepsilon_cutoff\neta_cutoff\ntfs\ntop_a\nmin_p\nmirostat'
|
||||
'xtc_threshold': 0.1,
|
||||
'xtc_probability': 0,
|
||||
'sampler_priority': 'repetition_penalty\npresence_penalty\nfrequency_penalty\ndry\ntemperature\ndynamic_temperature\nquadratic_sampling\ntop_k\ntop_p\ntypical_p\nepsilon_cutoff\neta_cutoff\ntfs\ntop_a\nmin_p\nmirostat\nxtc\nencoder_repetition_penalty\nno_repeat_ngram'
|
||||
}
|
||||
|
||||
|
||||
|
@ -1,6 +1,7 @@
|
||||
import json
|
||||
import math
|
||||
import pprint
|
||||
import random
|
||||
|
||||
import torch
|
||||
import transformers
|
||||
@ -191,6 +192,53 @@ class TopALogitsWarper(LogitsWarper):
|
||||
return scores
|
||||
|
||||
|
||||
# Exclude Top Choices (XTC)
|
||||
class XTCLogitsWarper(LogitsWarper):
|
||||
def __init__(self, threshold: float, probability: float, filter_value: float = -float("Inf")):
|
||||
self.threshold = threshold
|
||||
self.probability = probability
|
||||
self.filter_value = filter_value
|
||||
self.special_token_ids = [
|
||||
shared.tokenizer.encode("\n")[-1],
|
||||
]
|
||||
|
||||
if shared.tokenizer.eos_token_id is not None:
|
||||
self.special_token_ids.append(shared.tokenizer.eos_token_id)
|
||||
|
||||
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
|
||||
# `random` returns values in the half-open range [0, 1), so setting `probability`
|
||||
# to 0 means the sampler never takes action, while setting it to 1 means the sampler
|
||||
# always takes action.
|
||||
#
|
||||
# Note that while XTC is most intuitively described as "if multiple tokens meet
|
||||
# the threshold, then with probability...", reversing the two conditions is logically
|
||||
# equivalent, and improves performance because processing can immediately be stopped
|
||||
# if the random check fails.
|
||||
if random.random() >= self.probability:
|
||||
return scores
|
||||
|
||||
sorted_logits, sorted_indices = torch.sort(scores, descending=True)
|
||||
probs = sorted_logits.softmax(dim=-1)
|
||||
|
||||
sorted_indices_to_remove = torch.full_like(probs, False, dtype=torch.bool)
|
||||
|
||||
# This operation sets exactly those indices to `True` for which the next index has
|
||||
# probability above the threshold. Since `probs` is sorted, those are the indices
|
||||
# of all tokens that meet the threshold, *except* the least probable one.
|
||||
sorted_indices_to_remove[..., :-1] = probs[..., 1:] >= self.threshold
|
||||
|
||||
# Convert sorted_indices_to_remove to the original indices
|
||||
indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
|
||||
|
||||
# If newline or EOS tokens would be removed, return the original scores
|
||||
if indices_to_remove[:, self.special_token_ids].any():
|
||||
return scores
|
||||
|
||||
# Otherwise, remove tokens with the mask
|
||||
scores = scores.masked_fill(indices_to_remove, self.filter_value)
|
||||
return scores
|
||||
|
||||
|
||||
class DRYLogitsProcessor(LogitsProcessor):
|
||||
def __init__(self, multiplier: float, base: float, allowed_length: int, sequence_breakers: set[int], _range: int):
|
||||
self.multiplier = multiplier
|
||||
@ -323,62 +371,141 @@ class SpyLogitsWarper(LogitsWarper):
|
||||
|
||||
|
||||
class RepetitionPenaltyLogitsProcessorWithRange(LogitsProcessor):
|
||||
'''
|
||||
Copied from the transformers library
|
||||
'''
|
||||
|
||||
def __init__(self, penalty: float, presence_penalty: float, frequency_penalty: float, _range: int):
|
||||
def __init__(self, penalty: float, _range: int):
|
||||
if not (penalty > 0):
|
||||
raise ValueError(f"`penalty` has to be strictly positive, but is {penalty}")
|
||||
|
||||
self.penalty = penalty
|
||||
self.presence_penalty = presence_penalty
|
||||
self.frequency_penalty = frequency_penalty
|
||||
self._range = _range
|
||||
|
||||
def apply_repetition_penalty(self, input_ids_row, scores_row):
|
||||
unique_ids = torch.unique(input_ids_row)
|
||||
score = torch.gather(scores_row, 0, unique_ids)
|
||||
|
||||
# Apply multiplicative repetition penalty
|
||||
score = torch.where(score < 0, score * self.penalty, score / self.penalty)
|
||||
scores_row.scatter_(0, unique_ids, score)
|
||||
return scores_row
|
||||
|
||||
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
|
||||
input_ids = input_ids[:, -self._range:]
|
||||
|
||||
# We loop here because torch.unique() needs to process each row separately in the
|
||||
# case that batch_size > 1.
|
||||
for input_ids_row, scores_row in zip(input_ids, scores):
|
||||
unique_ids, counts = torch.unique(input_ids_row, return_counts=True)
|
||||
score = torch.gather(scores_row, 0, unique_ids)
|
||||
|
||||
# multiplicative repetition penalty
|
||||
# if score < 0 then repetition penalty has to be multiplied to reduce the previous token probability
|
||||
score = torch.where(score < 0, score * self.penalty, score / self.penalty)
|
||||
scores_row.scatter_(0, unique_ids, score)
|
||||
|
||||
# presence_penalty and frequency_penalty
|
||||
raw_presence_penalty = (counts > 0).to(scores.dtype)
|
||||
raw_frequency_penalty = counts.to(scores.dtype)
|
||||
additive_penalty = raw_presence_penalty * self.presence_penalty + raw_frequency_penalty * self.frequency_penalty
|
||||
scores_row.scatter_add_(0, unique_ids, -additive_penalty)
|
||||
scores_row = self.apply_repetition_penalty(input_ids_row, scores_row)
|
||||
|
||||
return scores
|
||||
|
||||
|
||||
def get_logits_warper_patch(self, generation_config, **kwargs):
|
||||
class PresencePenaltyLogitsProcessor(LogitsProcessor):
|
||||
def __init__(self, presence_penalty: float, _range: int):
|
||||
self.presence_penalty = presence_penalty
|
||||
self._range = _range
|
||||
|
||||
def apply_presence_penalty(self, input_ids_row, scores_row):
|
||||
unique_ids, counts = torch.unique(input_ids_row, return_counts=True)
|
||||
|
||||
# Apply presence penalty
|
||||
raw_presence_penalty = (counts > 0).to(scores_row.dtype)
|
||||
presence_penalty = raw_presence_penalty * self.presence_penalty
|
||||
scores_row.scatter_add_(0, unique_ids, -presence_penalty)
|
||||
return scores_row
|
||||
|
||||
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
|
||||
input_ids = input_ids[:, -self._range:]
|
||||
for input_ids_row, scores_row in zip(input_ids, scores):
|
||||
scores_row = self.apply_presence_penalty(input_ids_row, scores_row)
|
||||
return scores
|
||||
|
||||
|
||||
class FrequencyPenaltyLogitsProcessor(LogitsProcessor):
|
||||
def __init__(self, frequency_penalty: float, _range: int):
|
||||
self.frequency_penalty = frequency_penalty
|
||||
self._range = _range
|
||||
|
||||
def apply_frequency_penalty(self, input_ids_row, scores_row):
|
||||
unique_ids, counts = torch.unique(input_ids_row, return_counts=True)
|
||||
|
||||
# Apply frequency penalty
|
||||
raw_frequency_penalty = counts.to(scores_row.dtype)
|
||||
frequency_penalty = raw_frequency_penalty * self.frequency_penalty
|
||||
scores_row.scatter_add_(0, unique_ids, -frequency_penalty)
|
||||
return scores_row
|
||||
|
||||
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
|
||||
input_ids = input_ids[:, -self._range:]
|
||||
for input_ids_row, scores_row in zip(input_ids, scores):
|
||||
scores_row = self.apply_frequency_penalty(input_ids_row, scores_row)
|
||||
return scores
|
||||
|
||||
|
||||
def get_logits_processor_patch(self, **kwargs):
|
||||
generation_config = kwargs['generation_config']
|
||||
|
||||
# Parameter sanitization
|
||||
if isinstance(generation_config.temperature, int):
|
||||
generation_config.temperature = float(generation_config.temperature) # Must be float
|
||||
|
||||
# Get the original warpers
|
||||
warpers = self._get_logits_warper_old(generation_config, **kwargs)
|
||||
warpers = self._get_logits_processor_old(**kwargs)
|
||||
|
||||
# Replace temperature with our modified class.
|
||||
# Currently, it behaves identically to the original.
|
||||
for i in range(len(warpers)):
|
||||
for i in range(len(warpers) - 1, -1, -1):
|
||||
# Replace temperature with our modified class.
|
||||
if warpers[i].__class__.__name__ == 'TemperatureLogitsWarper':
|
||||
warpers[i] = TemperatureLogitsWarperCustom(
|
||||
generation_config.temperature,
|
||||
)
|
||||
|
||||
# Stuff we don't need
|
||||
elif warpers[i].__class__.__name__ in ['SuppressTokensLogitsProcessor', 'RepetitionPenaltyLogitsProcessor']:
|
||||
del warpers[i]
|
||||
|
||||
# Add custom warpers
|
||||
warpers_to_add = LogitsProcessorList()
|
||||
min_tokens_to_keep = 2 if generation_config.num_beams > 1 else 1
|
||||
|
||||
if generation_config.repetition_penalty is not None and generation_config.repetition_penalty != 1.0:
|
||||
warpers_to_add.append(
|
||||
RepetitionPenaltyLogitsProcessorWithRange(
|
||||
penalty=generation_config.repetition_penalty,
|
||||
_range=generation_config.repetition_penalty_range
|
||||
)
|
||||
)
|
||||
|
||||
if generation_config.presence_penalty is not None and generation_config.presence_penalty != 0.0:
|
||||
warpers_to_add.append(
|
||||
PresencePenaltyLogitsProcessor(
|
||||
presence_penalty=generation_config.presence_penalty,
|
||||
_range=generation_config.repetition_penalty_range
|
||||
)
|
||||
)
|
||||
|
||||
if generation_config.frequency_penalty is not None and generation_config.frequency_penalty != 0.0:
|
||||
warpers_to_add.append(
|
||||
FrequencyPenaltyLogitsProcessor(
|
||||
frequency_penalty=generation_config.frequency_penalty,
|
||||
_range=generation_config.repetition_penalty_range
|
||||
)
|
||||
)
|
||||
|
||||
if generation_config.dry_multiplier is not None and generation_config.dry_multiplier > 0.0:
|
||||
dry_sequence_breakers = generation_config.dry_sequence_breakers
|
||||
|
||||
# Support both JSON array notation and comma-separated strings.
|
||||
if not dry_sequence_breakers.startswith("["):
|
||||
dry_sequence_breakers = "[" + dry_sequence_breakers + "]"
|
||||
|
||||
sequence_breaker_strings = json.loads(dry_sequence_breakers)
|
||||
# Prefix with 'a' to get the correct encoding of the token at the end of a text.
|
||||
sequence_breakers = {shared.tokenizer.encode(f'a{s}')[-1] for s in sequence_breaker_strings}
|
||||
|
||||
warpers.append(
|
||||
DRYLogitsProcessor(
|
||||
multiplier=generation_config.dry_multiplier,
|
||||
base=generation_config.dry_base,
|
||||
allowed_length=generation_config.dry_allowed_length,
|
||||
sequence_breakers=sequence_breakers,
|
||||
_range=generation_config.repetition_penalty_range,
|
||||
)
|
||||
)
|
||||
|
||||
if generation_config.tfs is not None and 0.0 <= generation_config.tfs < 1.0:
|
||||
warpers_to_add.append(
|
||||
TailFreeLogitsWarper(
|
||||
@ -395,6 +522,14 @@ def get_logits_warper_patch(self, generation_config, **kwargs):
|
||||
)
|
||||
)
|
||||
|
||||
if generation_config.xtc_probability is not None and generation_config.xtc_probability > 0:
|
||||
warpers_to_add.append(
|
||||
XTCLogitsWarper(
|
||||
threshold=generation_config.xtc_threshold,
|
||||
probability=generation_config.xtc_probability,
|
||||
)
|
||||
)
|
||||
|
||||
if generation_config.dynamic_temperature:
|
||||
warpers_to_add.append(
|
||||
DynamicTemperatureLogitsWarper(
|
||||
@ -436,11 +571,10 @@ def get_logits_warper_patch(self, generation_config, **kwargs):
|
||||
if generation_config.temperature_last:
|
||||
for param_name in ['temperature', 'dynamic_temperature', 'quadratic_sampling']:
|
||||
if param_name in sampler_priority:
|
||||
if param_name in sampler_priority:
|
||||
index = sampler_priority.index(param_name)
|
||||
sampler_priority.append(sampler_priority.pop(index))
|
||||
else:
|
||||
sampler_priority.append(param_name)
|
||||
index = sampler_priority.index(param_name)
|
||||
sampler_priority.append(sampler_priority.pop(index))
|
||||
else:
|
||||
sampler_priority.append(param_name)
|
||||
|
||||
class_name_to_nickname = {
|
||||
'DynamicTemperatureLogitsWarper': 'dynamic_temperature',
|
||||
@ -454,17 +588,23 @@ def get_logits_warper_patch(self, generation_config, **kwargs):
|
||||
'TopALogitsWarper': 'top_a',
|
||||
'TopKLogitsWarper': 'top_k',
|
||||
'TopPLogitsWarper': 'top_p',
|
||||
'TypicalLogitsWarper': 'typical_p'
|
||||
'TypicalLogitsWarper': 'typical_p',
|
||||
'XTCLogitsWarper': 'xtc',
|
||||
'RepetitionPenaltyLogitsProcessorWithRange': 'repetition_penalty',
|
||||
'PresencePenaltyLogitsProcessor': 'presence_penalty',
|
||||
'FrequencyPenaltyLogitsProcessor': 'frequency_penalty',
|
||||
'DRYLogitsProcessor': 'dry',
|
||||
'EncoderRepetitionPenaltyLogitsProcessor': 'encoder_repetition_penalty',
|
||||
'NoRepeatNGramLogitsProcessor': 'no_repeat_ngram',
|
||||
}
|
||||
|
||||
def custom_sort_key(obj):
|
||||
class_name = obj.__class__.__name__
|
||||
|
||||
# Return a large value if class name is not mapped or if the mapped nickname is not in priority
|
||||
# Return -1 if class_name is not mapped
|
||||
if class_name not in class_name_to_nickname or class_name_to_nickname[class_name] not in sampler_priority:
|
||||
return float('inf')
|
||||
return -1
|
||||
|
||||
# Return the index of the nickname in the priority list for sorting
|
||||
return sampler_priority.index(class_name_to_nickname[class_name])
|
||||
|
||||
# Sort the list using the custom key function
|
||||
@ -482,49 +622,6 @@ def get_logits_warper_patch(self, generation_config, **kwargs):
|
||||
return warpers
|
||||
|
||||
|
||||
def get_logits_processor_patch(self, **kwargs):
|
||||
generation_config = kwargs['generation_config']
|
||||
|
||||
do_rep_pen_hijack = (generation_config.repetition_penalty > 1) or (generation_config.presence_penalty != 0) or (generation_config.frequency_penalty != 0)
|
||||
if do_rep_pen_hijack:
|
||||
generation_config.repetition_penalty = 1.1 # Set to value > 1 to ensure RepetitionPenaltyLogitsProcessor is created
|
||||
|
||||
result = self._get_logits_processor_old(**kwargs)
|
||||
|
||||
if do_rep_pen_hijack:
|
||||
for i in range(len(result)):
|
||||
if result[i].__class__.__name__ == 'RepetitionPenaltyLogitsProcessor':
|
||||
result[i] = RepetitionPenaltyLogitsProcessorWithRange(
|
||||
generation_config.repetition_penalty,
|
||||
generation_config.presence_penalty,
|
||||
generation_config.frequency_penalty,
|
||||
generation_config.repetition_penalty_range
|
||||
)
|
||||
|
||||
if generation_config.dry_multiplier is not None and generation_config.dry_multiplier > 0.0:
|
||||
dry_sequence_breakers = generation_config.dry_sequence_breakers
|
||||
|
||||
# Support both JSON array notation and comma-separated strings.
|
||||
if not dry_sequence_breakers.startswith("["):
|
||||
dry_sequence_breakers = "[" + dry_sequence_breakers + "]"
|
||||
|
||||
sequence_breaker_strings = json.loads(dry_sequence_breakers)
|
||||
# Prefix with 'a' to get the correct encoding of the token at the end of a text.
|
||||
sequence_breakers = {shared.tokenizer.encode(f'a{s}')[-1] for s in sequence_breaker_strings}
|
||||
|
||||
result.append(
|
||||
DRYLogitsProcessor(
|
||||
multiplier=generation_config.dry_multiplier,
|
||||
base=generation_config.dry_base,
|
||||
allowed_length=generation_config.dry_allowed_length,
|
||||
sequence_breakers=sequence_breakers,
|
||||
_range=generation_config.repetition_penalty_range,
|
||||
)
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def generation_config_init_patch(self, **kwargs):
|
||||
self.__init___old(**kwargs)
|
||||
self.min_p = kwargs.pop("min_p", 0.0)
|
||||
@ -546,14 +643,13 @@ def generation_config_init_patch(self, **kwargs):
|
||||
self.dry_base = kwargs.pop("dry_base", 1.75)
|
||||
self.dry_allowed_length = kwargs.pop("dry_allowed_length", 2)
|
||||
self.dry_sequence_breakers = kwargs.pop("dry_sequence_breakers", '"\\n", ":", "\\"", "*"')
|
||||
self.xtc_threshold = kwargs.pop("xtc_threshold", 0.1)
|
||||
self.xtc_probability = kwargs.pop("xtc_probability", 0)
|
||||
self.temperature_last = kwargs.pop("temperature_last", False)
|
||||
self.sampler_priority = kwargs.pop("sampler_priority", ['temperature', 'dynamic_temperature', 'quadratic_sampling', 'top_k', 'top_p', 'typical_p', 'epsilon_cutoff', 'eta_cutoff', 'tfs', 'top_a', 'min_p', 'mirostat'])
|
||||
self.sampler_priority = kwargs.pop("sampler_priority", ['repetition_penalty', 'presence_penalty', 'frequency_penalty', 'dry', 'temperature', 'dynamic_temperature', 'quadratic_sampling', 'top_k', 'top_p', 'typical_p', 'epsilon_cutoff', 'eta_cutoff', 'tfs', 'top_a', 'min_p', 'mirostat', 'xtc', 'encoder_repetition_penalty', 'no_repeat_ngram'])
|
||||
|
||||
|
||||
def hijack_samplers():
|
||||
transformers.GenerationMixin._get_logits_warper_old = transformers.GenerationMixin._get_logits_warper
|
||||
transformers.GenerationMixin._get_logits_warper = get_logits_warper_patch
|
||||
|
||||
transformers.GenerationMixin._get_logits_processor_old = transformers.GenerationMixin._get_logits_processor
|
||||
transformers.GenerationMixin._get_logits_processor = get_logits_processor_patch
|
||||
|
||||
|
@ -146,6 +146,7 @@ group.add_argument('--no_sdpa', action='store_true', help='Force Torch SDPA to n
|
||||
group.add_argument('--cache_8bit', action='store_true', help='Use 8-bit cache to save VRAM.')
|
||||
group.add_argument('--cache_4bit', action='store_true', help='Use Q4 cache to save VRAM.')
|
||||
group.add_argument('--num_experts_per_token', type=int, default=2, help='Number of experts to use for generation. Applies to MoE models like Mixtral.')
|
||||
group.add_argument('--enable_tp', action='store_true', help='Enable Tensor Parallelism (TP) in ExLlamaV2.')
|
||||
|
||||
# AutoGPTQ
|
||||
group = parser.add_argument_group('AutoGPTQ')
|
||||
|
@ -274,10 +274,10 @@ def get_reply_from_output_ids(output_ids, state=None, starting_from=0):
|
||||
if (hasattr(shared.tokenizer, 'convert_ids_to_tokens') and len(output_ids) > starting_from) and not reply.startswith(' '):
|
||||
first_token = shared.tokenizer.convert_ids_to_tokens(int(output_ids[starting_from]))
|
||||
if isinstance(first_token, (bytes,)):
|
||||
#try to decode the bytes to a string
|
||||
# try to decode the bytes to a string
|
||||
# if it fails, which means it's not a string in this turn, just ignore it
|
||||
try:
|
||||
first_token = first_token.decode('utf8')
|
||||
#if it fails, which means it's not a string in this turn, just ignore it
|
||||
except UnicodeDecodeError:
|
||||
first_token = ''
|
||||
|
||||
@ -289,7 +289,7 @@ def get_reply_from_output_ids(output_ids, state=None, starting_from=0):
|
||||
|
||||
def generate_reply_HF(question, original_question, seed, state, stopping_strings=None, is_chat=False):
|
||||
generate_params = {}
|
||||
for k in ['max_new_tokens', 'temperature', 'temperature_last', 'dynamic_temperature', 'dynatemp_low', 'dynatemp_high', 'dynatemp_exponent', 'smoothing_factor', 'smoothing_curve', 'top_p', 'min_p', 'top_k', 'repetition_penalty', 'presence_penalty', 'frequency_penalty', 'repetition_penalty_range', 'typical_p', 'tfs', 'top_a', 'guidance_scale', 'penalty_alpha', 'mirostat_mode', 'mirostat_tau', 'mirostat_eta', 'do_sample', 'encoder_repetition_penalty', 'no_repeat_ngram_size', 'dry_multiplier', 'dry_base', 'dry_allowed_length', 'dry_sequence_breakers']:
|
||||
for k in ['max_new_tokens', 'temperature', 'temperature_last', 'dynamic_temperature', 'dynatemp_low', 'dynatemp_high', 'dynatemp_exponent', 'smoothing_factor', 'smoothing_curve', 'top_p', 'min_p', 'top_k', 'repetition_penalty', 'presence_penalty', 'frequency_penalty', 'repetition_penalty_range', 'typical_p', 'tfs', 'top_a', 'guidance_scale', 'penalty_alpha', 'mirostat_mode', 'mirostat_tau', 'mirostat_eta', 'do_sample', 'encoder_repetition_penalty', 'no_repeat_ngram_size', 'dry_multiplier', 'dry_base', 'dry_allowed_length', 'dry_sequence_breakers', 'xtc_threshold', 'xtc_probability']:
|
||||
if k in state:
|
||||
generate_params[k] = state[k]
|
||||
|
||||
|
@ -90,6 +90,7 @@ def list_model_elements():
|
||||
'cache_8bit',
|
||||
'cache_4bit',
|
||||
'autosplit',
|
||||
'enable_tp',
|
||||
'threads',
|
||||
'threads_batch',
|
||||
'n_batch',
|
||||
@ -158,6 +159,8 @@ def list_interface_input_elements():
|
||||
'dry_base',
|
||||
'dry_allowed_length',
|
||||
'dry_sequence_breakers',
|
||||
'xtc_threshold',
|
||||
'xtc_probability',
|
||||
'do_sample',
|
||||
'penalty_alpha',
|
||||
'mirostat_mode',
|
||||
|
@ -74,7 +74,7 @@ def handle_save_preset_confirm_click(filename, contents):
|
||||
try:
|
||||
utils.save_file(f"presets/{filename}.yaml", contents)
|
||||
available_presets = utils.get_available_presets()
|
||||
output = gr.update(choices=available_presets, value=filename),
|
||||
output = gr.update(choices=available_presets, value=filename)
|
||||
except Exception:
|
||||
output = gr.update()
|
||||
traceback.print_exc()
|
||||
|
@ -136,6 +136,7 @@ def create_ui():
|
||||
shared.gradio['disk'] = gr.Checkbox(label="disk", value=shared.args.disk)
|
||||
shared.gradio['bf16'] = gr.Checkbox(label="bf16", value=shared.args.bf16)
|
||||
shared.gradio['autosplit'] = gr.Checkbox(label="autosplit", value=shared.args.autosplit, info='Automatically split the model tensors across the available GPUs.')
|
||||
shared.gradio['enable_tp'] = gr.Checkbox(label="enable_tp", value=shared.args.enable_tp, info='Enable Tensor Parallelism (TP).')
|
||||
shared.gradio['no_flash_attn'] = gr.Checkbox(label="no_flash_attn", value=shared.args.no_flash_attn)
|
||||
shared.gradio['no_xformers'] = gr.Checkbox(label="no_xformers", value=shared.args.no_xformers)
|
||||
shared.gradio['no_sdpa'] = gr.Checkbox(label="no_sdpa", value=shared.args.no_sdpa)
|
||||
|
@ -45,6 +45,10 @@ def create_ui(default_preset):
|
||||
shared.gradio['dry_base'] = gr.Slider(1, 4, value=generate_params['dry_base'], step=0.01, label='dry_base', info='Controls how fast the penalty grows with increasing sequence length.')
|
||||
shared.gradio['dry_sequence_breakers'] = gr.Textbox(value=generate_params['dry_sequence_breakers'], label='dry_sequence_breakers', info='Tokens across which sequence matching is not continued. Specified as a comma-separated list of quoted strings.')
|
||||
|
||||
with gr.Blocks():
|
||||
shared.gradio['xtc_threshold'] = gr.Slider(0, 0.5, value=generate_params['xtc_threshold'], step=0.01, label='xtc_threshold', info='If 2 or more tokens have probability above this threshold, consider removing all but the last one.')
|
||||
shared.gradio['xtc_probability'] = gr.Slider(0, 1, value=generate_params['xtc_probability'], step=0.01, label='xtc_probability', info='Probability that the removal will actually happen. 0 disables the sampler. 1 makes it always happen.')
|
||||
|
||||
gr.Markdown("[Learn more](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab)")
|
||||
|
||||
with gr.Column():
|
||||
|
21
one_click.py
21
one_click.py
@ -16,9 +16,9 @@ import sys
|
||||
|
||||
|
||||
# Define the required PyTorch version
|
||||
TORCH_VERSION = "2.2.2"
|
||||
TORCHVISION_VERSION = "0.17.2"
|
||||
TORCHAUDIO_VERSION = "2.2.2"
|
||||
TORCH_VERSION = "2.4.1"
|
||||
TORCHVISION_VERSION = "0.19.1"
|
||||
TORCHAUDIO_VERSION = "2.4.1"
|
||||
|
||||
# Environment
|
||||
script_dir = os.getcwd()
|
||||
@ -117,7 +117,7 @@ def update_pytorch():
|
||||
elif is_cuda:
|
||||
install_pytorch += "--index-url https://download.pytorch.org/whl/cu121"
|
||||
elif is_rocm:
|
||||
install_pytorch += "--index-url https://download.pytorch.org/whl/rocm5.6"
|
||||
install_pytorch += "--index-url https://download.pytorch.org/whl/rocm6.1"
|
||||
elif is_cpu:
|
||||
install_pytorch += "--index-url https://download.pytorch.org/whl/cpu"
|
||||
elif is_intel:
|
||||
@ -189,8 +189,11 @@ def run_cmd(cmd, assert_success=False, environment=False, capture_output=False,
|
||||
conda_sh_path = os.path.join(script_dir, "installer_files", "conda", "etc", "profile.d", "conda.sh")
|
||||
cmd = f'. "{conda_sh_path}" && conda activate "{conda_env_path}" && {cmd}'
|
||||
|
||||
# Set executable to None for Windows, bash for everything else
|
||||
executable = None if is_windows() else 'bash'
|
||||
|
||||
# Run shell commands
|
||||
result = subprocess.run(cmd, shell=True, capture_output=capture_output, env=env)
|
||||
result = subprocess.run(cmd, shell=True, capture_output=capture_output, env=env, executable=executable)
|
||||
|
||||
# Assert the command ran successfully
|
||||
if assert_success and result.returncode != 0:
|
||||
@ -239,7 +242,7 @@ def install_webui():
|
||||
"What is your GPU?",
|
||||
{
|
||||
'A': 'NVIDIA',
|
||||
'B': 'AMD (Linux/MacOS only. Requires ROCm SDK 5.6 on Linux)',
|
||||
'B': 'AMD (Linux/MacOS only. Requires ROCm SDK 6.1 on Linux)',
|
||||
'C': 'Apple M Series',
|
||||
'D': 'Intel Arc (IPEX)',
|
||||
'N': 'None (I want to run models in CPU mode)'
|
||||
@ -294,7 +297,7 @@ def install_webui():
|
||||
else:
|
||||
install_pytorch += "--index-url https://download.pytorch.org/whl/cu121"
|
||||
elif selected_gpu == "AMD":
|
||||
install_pytorch += "--index-url https://download.pytorch.org/whl/rocm5.6"
|
||||
install_pytorch += "--index-url https://download.pytorch.org/whl/rocm6.1"
|
||||
elif selected_gpu in ["APPLE", "NONE"]:
|
||||
install_pytorch += "--index-url https://download.pytorch.org/whl/cpu"
|
||||
elif selected_gpu == "INTEL":
|
||||
@ -310,7 +313,7 @@ def install_webui():
|
||||
if selected_gpu == "INTEL":
|
||||
# Install oneAPI dependencies via conda
|
||||
print_big_message("Installing Intel oneAPI runtime libraries.")
|
||||
run_cmd("conda install -y -c intel dpcpp-cpp-rt=2024.0 mkl-dpcpp=2024.0")
|
||||
run_cmd("conda install -y -c https://software.repos.intel.com/python/conda/ -c conda-forge dpcpp-cpp-rt=2024.0 mkl-dpcpp=2024.0")
|
||||
# Install libuv required by Intel-patched torch
|
||||
run_cmd("conda install -y libuv")
|
||||
|
||||
@ -326,7 +329,7 @@ def install_extensions_requirements():
|
||||
print_big_message("Installing extensions requirements.\nSome of these may fail on Windows.\nDon\'t worry if you see error messages, as they will not affect the main program.")
|
||||
extensions = get_extensions_names()
|
||||
for i, extension in enumerate(extensions):
|
||||
print(f"\n\n--- [{i+1}/{len(extensions)}]: {extension}\n\n")
|
||||
print(f"\n\n--- [{i + 1}/{len(extensions)}]: {extension}\n\n")
|
||||
extension_req_path = os.path.join("extensions", extension, "requirements.txt")
|
||||
run_cmd(f"python -m pip install -r {extension_req_path} --upgrade", assert_success=False, environment=True)
|
||||
|
||||
|
@ -1,13 +1,10 @@
|
||||
accelerate==0.33.*
|
||||
aqlm[gpu,cpu]==1.1.6; platform_system == "Linux"
|
||||
auto-gptq==0.7.1
|
||||
bitsandbytes==0.43.*
|
||||
bitsandbytes==0.44.*
|
||||
colorama
|
||||
datasets
|
||||
einops
|
||||
fastapi==0.112.4
|
||||
gradio==4.26.*
|
||||
hqq==0.1.7.post3
|
||||
jinja2==3.1.4
|
||||
lm_eval==0.3.0
|
||||
markdown
|
||||
@ -26,7 +23,7 @@ safetensors==0.4.*
|
||||
scipy
|
||||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.44.*
|
||||
transformers==4.45.*
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
@ -39,38 +36,30 @@ soundfile
|
||||
openai-whisper
|
||||
|
||||
# llama-cpp-python (CPU only, AVX2)
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
|
||||
# llama-cpp-python (CUDA, no tensor cores)
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.90+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.90+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.90+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.90+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.1+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.1+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.1+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.1+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
|
||||
# llama-cpp-python (CUDA, tensor cores)
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.90+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.90+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.90+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.90+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.1+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.1+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.1+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.1+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
|
||||
# CUDA wheels
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0+cu121.torch2.2.2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0+cu121.torch2.2.2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0+cu121.torch2.2.2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0+cu121.torch2.2.2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
||||
https://github.com/oobabooga/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu122torch2.2.2cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu122torch2.2.2cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu123torch2.2cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu123torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/AutoAWQ/releases/download/0.2.6/autoawq-0.2.6-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/AutoAWQ/releases/download/0.2.6/autoawq-0.2.6-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/AutoAWQ/releases/download/0.2.6/autoawq-0.2.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/AutoAWQ/releases/download/0.2.6/autoawq-0.2.6-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/AutoAWQ_kernels/releases/download/0.0.7/autoawq_kernels-0.0.7-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/AutoAWQ_kernels/releases/download/0.0.7/autoawq_kernels-0.0.7-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/AutoAWQ_kernels/releases/download/0.0.7/autoawq_kernels-0.0.7-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/AutoAWQ_kernels/releases/download/0.0.7/autoawq_kernels-0.0.7-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3+cu121.torch2.4.1-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3+cu121.torch2.4.1-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3+cu121.torch2.4.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3+cu121.torch2.4.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
||||
https://github.com/oobabooga/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu122torch2.4.1cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu122torch2.4.1cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu123torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu123torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
|
@ -4,7 +4,6 @@ datasets
|
||||
einops
|
||||
fastapi==0.112.4
|
||||
gradio==4.26.*
|
||||
hqq==0.1.7.post3
|
||||
jinja2==3.1.4
|
||||
lm_eval==0.3.0
|
||||
markdown
|
||||
@ -23,7 +22,7 @@ safetensors==0.4.*
|
||||
scipy
|
||||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.44.*
|
||||
transformers==4.45.*
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
@ -34,18 +33,14 @@ sse-starlette==1.6.5
|
||||
tiktoken
|
||||
|
||||
# llama-cpp-python (CPU only, AVX2)
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
|
||||
# AMD wheels
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.2.90+rocm5.6.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.2.90+rocm5.6.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0+rocm5.6.torch2.2.2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0+rocm5.6.torch2.2.2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"
|
||||
https://github.com/oobabooga/AutoAWQ/releases/download/0.2.6/autoawq-0.2.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/AutoAWQ/releases/download/0.2.6/autoawq-0.2.6-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/AutoAWQ_kernels/releases/download/0.0.7/autoawq_kernels-0.0.7+rocm561-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/AutoAWQ_kernels/releases/download/0.0.7/autoawq_kernels-0.0.7+rocm561-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.3.1+rocm6.1.2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/rocm/llama_cpp_python_cuda-0.3.1+rocm6.1.2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3+rocm6.1.torch2.4.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3+rocm6.1.torch2.4.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"
|
||||
|
@ -4,7 +4,6 @@ datasets
|
||||
einops
|
||||
fastapi==0.112.4
|
||||
gradio==4.26.*
|
||||
hqq==0.1.7.post3
|
||||
jinja2==3.1.4
|
||||
lm_eval==0.3.0
|
||||
markdown
|
||||
@ -23,7 +22,7 @@ safetensors==0.4.*
|
||||
scipy
|
||||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.44.*
|
||||
transformers==4.45.*
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
@ -34,16 +33,12 @@ sse-starlette==1.6.5
|
||||
tiktoken
|
||||
|
||||
# llama-cpp-python (CPU only, no AVX2)
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
|
||||
# AMD wheels
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0+rocm5.6.torch2.2.2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0+rocm5.6.torch2.2.2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"
|
||||
https://github.com/oobabooga/AutoAWQ/releases/download/0.2.6/autoawq-0.2.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/AutoAWQ/releases/download/0.2.6/autoawq-0.2.6-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/AutoAWQ_kernels/releases/download/0.0.7/autoawq_kernels-0.0.7+rocm561-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/AutoAWQ_kernels/releases/download/0.0.7/autoawq_kernels-0.0.7+rocm561-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3+rocm6.1.torch2.4.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3+rocm6.1.torch2.4.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3-py3-none-any.whl; platform_system != "Darwin" and platform_machine != "x86_64"
|
||||
|
@ -4,7 +4,6 @@ datasets
|
||||
einops
|
||||
fastapi==0.112.4
|
||||
gradio==4.26.*
|
||||
hqq==0.1.7.post3
|
||||
jinja2==3.1.4
|
||||
lm_eval==0.3.0
|
||||
markdown
|
||||
@ -23,7 +22,7 @@ safetensors==0.4.*
|
||||
scipy
|
||||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.44.*
|
||||
transformers==4.45.*
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
@ -34,8 +33,8 @@ sse-starlette==1.6.5
|
||||
tiktoken
|
||||
|
||||
# Mac wheels
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.90-cp311-cp311-macosx_12_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.90-cp310-cp310-macosx_12_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.90-cp311-cp311-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.90-cp310-cp310-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0-py3-none-any.whl
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.1-cp311-cp311-macosx_12_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.1-cp310-cp310-macosx_12_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.1-cp311-cp311-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.1-cp310-cp310-macosx_14_0_x86_64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3-py3-none-any.whl
|
||||
|
@ -4,7 +4,6 @@ datasets
|
||||
einops
|
||||
fastapi==0.112.4
|
||||
gradio==4.26.*
|
||||
hqq==0.1.7.post3
|
||||
jinja2==3.1.4
|
||||
lm_eval==0.3.0
|
||||
markdown
|
||||
@ -23,7 +22,7 @@ safetensors==0.4.*
|
||||
scipy
|
||||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.44.*
|
||||
transformers==4.45.*
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
@ -34,10 +33,10 @@ sse-starlette==1.6.5
|
||||
tiktoken
|
||||
|
||||
# Mac wheels
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.90-cp311-cp311-macosx_12_0_arm64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.90-cp310-cp310-macosx_12_0_arm64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.90-cp311-cp311-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.90-cp310-cp310-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.90-cp311-cp311-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.2.90-cp310-cp310-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0-py3-none-any.whl
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.1-cp311-cp311-macosx_12_0_arm64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.1-cp310-cp310-macosx_12_0_arm64.whl; platform_system == "Darwin" and platform_release >= "21.0.0" and platform_release < "22.0.0" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.1-cp311-cp311-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.1-cp310-cp310-macosx_13_0_arm64.whl; platform_system == "Darwin" and platform_release >= "22.0.0" and platform_release < "23.0.0" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.1-cp311-cp311-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/metal/llama_cpp_python-0.3.1-cp310-cp310-macosx_14_0_arm64.whl; platform_system == "Darwin" and platform_release >= "23.0.0" and platform_release < "24.0.0" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3-py3-none-any.whl
|
||||
|
@ -4,7 +4,6 @@ datasets
|
||||
einops
|
||||
fastapi==0.112.4
|
||||
gradio==4.26.*
|
||||
hqq==0.1.7.post3
|
||||
jinja2==3.1.4
|
||||
lm_eval==0.3.0
|
||||
markdown
|
||||
@ -23,7 +22,7 @@ safetensors==0.4.*
|
||||
scipy
|
||||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.44.*
|
||||
transformers==4.45.*
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
@ -34,7 +33,7 @@ sse-starlette==1.6.5
|
||||
tiktoken
|
||||
|
||||
# llama-cpp-python (CPU only, AVX2)
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
|
@ -4,7 +4,6 @@ datasets
|
||||
einops
|
||||
fastapi==0.112.4
|
||||
gradio==4.26.*
|
||||
hqq==0.1.7.post3
|
||||
jinja2==3.1.4
|
||||
lm_eval==0.3.0
|
||||
markdown
|
||||
@ -23,7 +22,7 @@ safetensors==0.4.*
|
||||
scipy
|
||||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.44.*
|
||||
transformers==4.45.*
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
@ -34,7 +33,7 @@ sse-starlette==1.6.5
|
||||
tiktoken
|
||||
|
||||
# llama-cpp-python (CPU only, no AVX2)
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
|
@ -1,13 +1,10 @@
|
||||
accelerate==0.33.*
|
||||
aqlm[gpu,cpu]==1.1.6; platform_system == "Linux"
|
||||
auto-gptq==0.7.1
|
||||
bitsandbytes==0.43.*
|
||||
bitsandbytes==0.44.*
|
||||
colorama
|
||||
datasets
|
||||
einops
|
||||
fastapi==0.112.4
|
||||
gradio==4.26.*
|
||||
hqq==0.1.7.post3
|
||||
jinja2==3.1.4
|
||||
lm_eval==0.3.0
|
||||
markdown
|
||||
@ -26,7 +23,7 @@ safetensors==0.4.*
|
||||
scipy
|
||||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.44.*
|
||||
transformers==4.45.*
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
@ -37,38 +34,30 @@ sse-starlette==1.6.5
|
||||
tiktoken
|
||||
|
||||
# llama-cpp-python (CPU only, no AVX2)
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.90+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.1+cpuavx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
|
||||
# llama-cpp-python (CUDA, no tensor cores)
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.90+cu121avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.90+cu121avx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.90+cu121avx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.90+cu121avx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.1+cu121avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.1+cu121avx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.1+cu121avx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.3.1+cu121avx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
|
||||
# llama-cpp-python (CUDA, tensor cores)
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.90+cu121avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.90+cu121avx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.90+cu121avx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.90+cu121avx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.1+cu121avx-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.1+cu121avx-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.1+cu121avx-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.3.1+cu121avx-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
|
||||
# CUDA wheels
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0+cu121.torch2.2.2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0+cu121.torch2.2.2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0+cu121.torch2.2.2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0+cu121.torch2.2.2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.0/exllamav2-0.2.0-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
||||
https://github.com/oobabooga/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu122torch2.2.2cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu122torch2.2.2cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu123torch2.2cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu123torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/AutoAWQ/releases/download/0.2.6/autoawq-0.2.6-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/AutoAWQ/releases/download/0.2.6/autoawq-0.2.6-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/AutoAWQ/releases/download/0.2.6/autoawq-0.2.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/AutoAWQ/releases/download/0.2.6/autoawq-0.2.6-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/AutoAWQ_kernels/releases/download/0.0.7/autoawq_kernels-0.0.7-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/AutoAWQ_kernels/releases/download/0.0.7/autoawq_kernels-0.0.7-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/AutoAWQ_kernels/releases/download/0.0.7/autoawq_kernels-0.0.7-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/AutoAWQ_kernels/releases/download/0.0.7/autoawq_kernels-0.0.7-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3+cu121.torch2.4.1-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3+cu121.torch2.4.1-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3+cu121.torch2.4.1-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3+cu121.torch2.4.1-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
https://github.com/oobabooga/exllamav2/releases/download/v0.2.3/exllamav2-0.2.3-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
|
||||
https://github.com/oobabooga/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu122torch2.4.1cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
|
||||
https://github.com/oobabooga/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu122torch2.4.1cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
|
||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu123torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
|
||||
https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu123torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
|
||||
|
@ -4,7 +4,6 @@ datasets
|
||||
einops
|
||||
fastapi==0.112.4
|
||||
gradio==4.26.*
|
||||
hqq==0.1.7.post3
|
||||
jinja2==3.1.4
|
||||
lm_eval==0.3.0
|
||||
markdown
|
||||
@ -23,7 +22,7 @@ safetensors==0.4.*
|
||||
scipy
|
||||
sentencepiece
|
||||
tensorboard
|
||||
transformers==4.44.*
|
||||
transformers==4.45.*
|
||||
tqdm
|
||||
wandb
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user