Remove exllamav1 loaders (#5128)

2024-11-25 01:09:22 +01:00 · 2023-12-31 01:57:06 -03:00 · 2023-12-31 01:57:06 -03:00 · 0e54a09bcb
commit 0e54a09bcb
parent 8e397915c9
18 changed files with 28 additions and 635 deletions
--- a/README.md
+++ b/README.md
@ -11,7 +11,7 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
 ## Features

 * 3 interface modes: default (two columns), notebook, and chat.
-* Multiple model backends: [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp) (through [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)), [ExLlama](https://github.com/turboderp/exllama), [ExLlamaV2](https://github.com/turboderp/exllamav2), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [CTransformers](https://github.com/marella/ctransformers), [QuIP#](https://github.com/Cornell-RelaxML/quip-sharp).
+* Multiple model backends: [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp) (through [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)), [ExLlamaV2](https://github.com/turboderp/exllamav2), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [CTransformers](https://github.com/marella/ctransformers), [QuIP#](https://github.com/Cornell-RelaxML/quip-sharp).
 * Dropdown menu for quickly switching between different models.
 * Large number of extensions (built-in and user-contributed), including Coqui TTS for realistic voice outputs, Whisper STT for voice inputs, translation, [multimodal pipelines](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal), vector databases, Stable Diffusion integration, and a lot more. See [the wiki](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions) and [the extensions directory](https://github.com/oobabooga/text-generation-webui-extensions) for details.
 * [Chat with custom characters](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#character).
@ -140,13 +140,6 @@ Then browse to
 3) Manually install AutoGPTQ: [Installation](https://github.com/PanQiWei/AutoGPTQ#install-from-source).
    * Perform the from-source installation - there are no prebuilt ROCm packages for Windows.

-4) Manually install [ExLlama](https://github.com/turboderp/exllama) by simply cloning it into the `repositories` folder (it will be automatically compiled at runtime after that):
-
-```sh
-cd text-generation-webui
-git clone https://github.com/turboderp/exllama repositories/exllama
-```
-
 ##### Older NVIDIA GPUs

 1) For Kepler GPUs and older, you will need to install CUDA 11.8 instead of 12:
@ -216,7 +209,7 @@ List of command-line flags

 | Flag                                       | Description |
 |--------------------------------------------|-------------|
-| `--loader LOADER`                          | Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlama_HF, ExLlamav2_HF, AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, ExLlama, ExLlamav2, ctransformers, QuIP#. |
+| `--loader LOADER`                          | Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlamav2_HF, ExLlamav2, AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, ctransformers, QuIP#. |

 #### Accelerate/transformers

@ -265,13 +258,13 @@ List of command-line flags
 | `--no_offload_kqv` | Do not offload the K, Q, V to the GPU. This saves VRAM but reduces the performance. |
 | `--cache-capacity CACHE_CAPACITY`   | Maximum cache capacity (llama-cpp-python). Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. |

-#### ExLlama
+#### ExLlamav2

 | Flag             | Description |
 |------------------|-------------|
 |`--gpu-split`     | Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7. |
 |`--max_seq_len MAX_SEQ_LEN`           | Maximum sequence length. |
-|`--cfg-cache`                         | ExLlama_HF: Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader, but not necessary for CFG with base ExLlama. |
+|`--cfg-cache`                         | ExLlamav2_HF: Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader. |
 |`--no_flash_attn`                     | Force flash-attention to not be used. |
 |`--cache_8bit`                        | Use 8-bit cache to save VRAM. |
 |`--num_experts_per_token NUM_EXPERTS_PER_TOKEN` |  Number of experts to use for generation. Applies to MoE models like Mixtral. |
@ -326,7 +319,7 @@ List of command-line flags
 | `--rwkv-strategy RWKV_STRATEGY` | RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". |
 | `--rwkv-cuda-on`                | RWKV: Compile the CUDA kernel for better performance. |

-#### RoPE (for llama.cpp, ExLlama, ExLlamaV2, and transformers)
+#### RoPE (for llama.cpp, ExLlamaV2, and transformers)

 | Flag             | Description |
 |------------------|-------------|
--- a/docs/04
+++ b/docs/04
@ -32,13 +32,14 @@ Options:
 * **use_flash_attention_2**: Set use_flash_attention_2=True while loading the model. Possibly useful for training.
 * **disable_exllama**: Only applies when you are loading a GPTQ model through the transformers loader. It needs to be checked if you intend to train LoRAs with the model.

-### ExLlama_HF
+### ExLlamav2_HF

-Loads: GPTQ models. They usually have GPTQ in the model name, or alternatively something like "-4bit-128g" in the name.
+Loads: GPTQ and EXL2 models. EXL2 models usually have "EXL2" in the model name, while GPTQ models usually have GPTQ in the model name, or alternatively something like "-4bit-128g" in the name.

-Example: https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ
+Examples:

-ExLlama_HF is the v1 of ExLlama (https://github.com/turboderp/exllama) connected to the transformers library for sampling, tokenizing, and detokenizing. It is very fast and memory-efficient.
+* https://huggingface.co/turboderp/Llama2-70B-exl2
+* https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ

 * **gpu-split**: If you have multiple GPUs, the amount of memory to allocate per GPU should be set in this field. Make sure to set a lower value for the first GPU, as that's where the cache is allocated.
 * **max_seq_len**: The maximum sequence length for the model. In ExLlama, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on its metadata, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "max_seq_len" so that you don't have to set the same thing twice.
@ -46,18 +47,6 @@ ExLlama_HF is the v1 of ExLlama (https://github.com/turboderp/exllama) connected
 * **no_flash_attn**: Disables flash attention. Otherwise, it is automatically used as long as the library is installed.
 * **cache_8bit**: Create a 8-bit precision cache instead of a 16-bit one. This saves VRAM but increases perplexity (I don't know by how much).

-### ExLlamav2_HF
-
-Loads: GPTQ and EXL2 models. EXL2 models usually have "EXL2" in the model name.
-
-Example: https://huggingface.co/turboderp/Llama2-70B-exl2
-
-The parameters are the same as in ExLlama_HF.
-
-### ExLlama
-
-The same as ExLlama_HF but using the internal samplers of ExLlama instead of the ones in the Transformers library.
-
 ### ExLlamav2

 The same as ExLlamav2_HF but using the internal samplers of ExLlamav2 instead of the ones in the Transformers library.
--- a/docs/What
+++ b/docs/What
@ -3,9 +3,7 @@
 | Loader         | Loading 1 LoRA | Loading 2 or more LoRAs | Training LoRAs | Multimodal extension | Perplexity evaluation |
 |----------------|----------------|-------------------------|----------------|----------------------|-----------------------|
 | Transformers   |       ✅       |           ✅***            |       ✅*       |          ✅          |           ✅          |
-| ExLlama_HF     |       ✅       |           ❌            |       ❌       |          ❌          |           ✅          |
 | ExLlamav2_HF   |       ✅       |           ✅            |       ❌       |          ❌          |           ✅          |
-| ExLlama        |       ✅       |           ❌            |       ❌       |          ❌          |           use ExLlama_HF      |
 | ExLlamav2      |       ✅       |           ✅            |       ❌       |          ❌          |           use ExLlamav2_HF    |
 | AutoGPTQ       |       ✅       |           ❌            |       ❌       |          ✅          |           ✅          |
 | GPTQ-for-LLaMa |       ✅**       |           ✅***            |       ✅       |          ✅          |           ✅          |
--- a/modules/LoRA.py
+++ b/modules/LoRA.py
@ -12,8 +12,6 @@ from modules.models import reload_model
 def add_lora_to_model(lora_names):
    if 'GPTQForCausalLM' in shared.model.__class__.__name__ or shared.args.loader == 'AutoGPTQ':
        add_lora_autogptq(lora_names)
-    elif shared.model.__class__.__name__ in ['ExllamaModel', 'ExllamaHF'] or shared.args.loader == 'ExLlama':
-        add_lora_exllama(lora_names)
    elif shared.model.__class__.__name__ in ['Exllamav2Model', 'Exllamav2HF'] or shared.args.loader == ['ExLlamav2', 'ExLlamav2_HF']:
        add_lora_exllamav2(lora_names)
    else:
@ -28,48 +26,6 @@ def get_lora_path(lora_name):
    return Path(f"{shared.args.lora_dir}/{lora_name}")


-def add_lora_exllama(lora_names):
-
-    try:
-        from exllama.lora import ExLlamaLora
-    except:
-        try:
-            from repositories.exllama.lora import ExLlamaLora
-        except:
-            logger.error("Could not find the file repositories/exllama/lora.py. Make sure that exllama is cloned inside repositories/ and is up to date.")
-            return
-
-    if len(lora_names) == 0:
-        if shared.model.__class__.__name__ == 'ExllamaModel':
-            shared.model.generator.lora = None
-        else:
-            shared.model.lora = None
-
-        shared.lora_names = []
-        return
-    else:
-        if len(lora_names) > 1:
-            logger.warning('ExLlama can only work with 1 LoRA at the moment. Only the first one in the list will be loaded.')
-
-        lora_path = get_lora_path(lora_names[0])
-        lora_config_path = lora_path / "adapter_config.json"
-        for file_name in ["adapter_model.safetensors", "adapter_model.bin"]:
-            file_path = lora_path / file_name
-            if file_path.is_file():
-                lora_adapter_path = file_path
-
-        logger.info("Applying the following LoRAs to {}: {}".format(shared.model_name, ', '.join([lora_names[0]])))
-        if shared.model.__class__.__name__ == 'ExllamaModel':
-            lora = ExLlamaLora(shared.model.model, str(lora_config_path), str(lora_adapter_path))
-            shared.model.generator.lora = lora
-        else:
-            lora = ExLlamaLora(shared.model.ex_model, str(lora_config_path), str(lora_adapter_path))
-            shared.model.lora = lora
-
-        shared.lora_names = [lora_names[0]]
-        return
-
-
 def add_lora_exllamav2(lora_names):

    from exllamav2 import ExLlamaV2Lora
--- a/modules/exllama.py
+++ b/modules/exllama.py
@ -1,237 +0,0 @@
-from pathlib import Path
-
-import torch
-import torch.nn.functional as F
-from torch import version as torch_version
-
-from modules import shared
-from modules.logging_colors import logger
-from modules.models import clear_torch_cache
-from modules.text_generation import get_max_prompt_length
-
-try:
-    from exllama.generator import ExLlamaGenerator
-    from exllama.model import ExLlama, ExLlamaCache, ExLlamaConfig
-    from exllama.tokenizer import ExLlamaTokenizer
-except:
-    logger.warning('exllama module failed to import. Will attempt to import from repositories/.')
-    try:
-        from modules.relative_imports import RelativeImport
-
-        with RelativeImport("repositories/exllama"):
-            from generator import ExLlamaGenerator
-            from model import ExLlama, ExLlamaCache, ExLlamaConfig
-            from tokenizer import ExLlamaTokenizer
-    except:
-        logger.error(
-            "Could not find repositories/exllama. Please ensure that exllama"
-            " (https://github.com/turboderp/exllama) is cloned inside repositories/ and is up to date."
-        )
-        raise
-
-
-class ExllamaModel:
-    def __init__(self):
-        pass
-
-    @classmethod
-    def from_pretrained(self, path_to_model):
-
-        path_to_model = Path(f'{shared.args.model_dir}') / Path(path_to_model)
-        tokenizer_model_path = path_to_model / "tokenizer.model"
-        model_config_path = path_to_model / "config.json"
-
-        # Find the model checkpoint
-        model_path = None
-        for ext in ['.safetensors', '.pt', '.bin']:
-            found = list(path_to_model.glob(f"*{ext}"))
-            if len(found) > 0:
-                if len(found) > 1:
-                    logger.warning(f'More than one {ext} model has been found. The last one will be selected. It could be wrong.')
-
-                model_path = found[-1]
-                break
-
-        config = ExLlamaConfig(str(model_config_path))
-        config.model_path = str(model_path)
-        config.max_seq_len = shared.args.max_seq_len
-        config.compress_pos_emb = shared.args.compress_pos_emb
-        if shared.args.gpu_split:
-            config.set_auto_map(shared.args.gpu_split)
-            config.gpu_peer_fix = True
-
-        if shared.args.alpha_value > 1 and shared.args.rope_freq_base == 0:
-            config.alpha_value = shared.args.alpha_value
-            config.calculate_rotary_embedding_base()
-        elif shared.args.rope_freq_base > 0:
-            config.rotary_embedding_base = shared.args.rope_freq_base
-
-        if torch_version.hip:
-            config.rmsnorm_no_half2 = True
-            config.rope_no_half2 = True
-            config.matmul_no_half2 = True
-            config.silu_no_half2 = True
-
-        model = ExLlama(config)
-        tokenizer = ExLlamaTokenizer(str(tokenizer_model_path))
-        cache = ExLlamaCache(model)
-        generator = ExLlamaGenerator(model, tokenizer, cache)
-
-        result = self()
-        result.config = config
-        result.model = model
-        result.cache = cache
-        result.tokenizer = tokenizer
-        result.generator = generator
-        return result, result
-
-    def encode(self, string, **kwargs):
-        return self.tokenizer.encode(string, max_seq_len=self.model.config.max_seq_len, add_bos=True)
-
-    def decode(self, ids, **kwargs):
-        if isinstance(ids, list):
-            ids = torch.tensor([ids])
-        elif isinstance(ids, torch.Tensor) and ids.numel() == 1:
-            ids = ids.view(1, -1)
-
-        return self.tokenizer.decode(ids)[0]
-
-    def get_logits(self, token_ids, **kwargs):
-        self.cache.current_seq_len = 0
-        if token_ids.shape[-1] > 1:
-            self.model.forward(token_ids[:, :-1], self.cache, input_mask=None, preprocess_only=True)
-
-        return self.model.forward(token_ids[:, -1:], self.cache, **kwargs).float().cpu()
-
-    def generate_with_streaming(self, prompt, state):
-
-        # The cache batch size must be 2 for CFG and 1 otherwise
-        if state['guidance_scale'] == 1:
-            if self.cache.batch_size == 2:
-                del self.cache
-                clear_torch_cache()
-                self.cache = ExLlamaCache(self.model)
-                self.generator = ExLlamaGenerator(self.model, self.tokenizer, self.cache)
-        else:
-            if self.cache.batch_size == 1:
-                del self.cache
-                clear_torch_cache()
-                self.cache = ExLlamaCache(self.model, batch_size=2)
-                self.generator = ExLlamaGenerator(self.model, self.tokenizer, self.cache)
-
-        self.generator.settings.temperature = state['temperature']
-        self.generator.settings.top_p = state['top_p']
-        self.generator.settings.top_k = state['top_k']
-        self.generator.settings.typical = state['typical_p']
-        self.generator.settings.token_repetition_penalty_max = state['repetition_penalty']
-        self.generator.settings.token_repetition_penalty_sustain = -1 if state['repetition_penalty_range'] <= 0 else state['repetition_penalty_range']
-        if state['ban_eos_token']:
-            self.generator.disallow_tokens([self.tokenizer.eos_token_id])
-        else:
-            self.generator.disallow_tokens(None)
-
-        if state['custom_token_bans']:
-            to_ban = [int(x) for x in state['custom_token_bans'].split(',')]
-            if len(to_ban) > 0:
-                self.generator.disallow_tokens(to_ban)
-
-        # Case 1: no CFG
-        if state['guidance_scale'] == 1:
-            self.generator.end_beam_search()
-
-            # Tokenizing the input
-            ids = self.generator.tokenizer.encode(prompt, max_seq_len=self.model.config.max_seq_len)
-            if state['add_bos_token']:
-                ids = torch.cat(
-                    [torch.tensor([[self.tokenizer.bos_token_id]]).to(ids.device),
-                     ids], dim=1
-                ).to(torch.int64)
-            ids = ids[:, -get_max_prompt_length(state):]
-            if state['auto_max_new_tokens']:
-                max_new_tokens = state['truncation_length'] - ids.shape[-1]
-            else:
-                max_new_tokens = state['max_new_tokens']
-
-            self.generator.gen_begin_reuse(ids)
-            initial_len = self.generator.sequence[0].shape[0]
-            has_leading_space = False
-
-            for i in range(max_new_tokens):
-                token = self.generator.gen_single_token()
-                if i == 0 and self.generator.tokenizer.tokenizer.IdToPiece(int(token)).startswith('▁'):
-                    has_leading_space = True
-
-                decoded_text = self.generator.tokenizer.decode(self.generator.sequence[0][initial_len:])
-                if has_leading_space:
-                    decoded_text = ' ' + decoded_text
-
-                # Check the partial unicode character
-                if chr(0xfffd) in decoded_text:
-                    is_last = i == max_new_tokens - 1
-                    is_stopping = token.item() == self.generator.tokenizer.eos_token_id or shared.stop_everything
-                    # If we are not at the end of the generation, we skip this token
-                    if not (is_last or is_stopping):
-                        continue
-
-                if token.item() == self.generator.tokenizer.eos_token_id or shared.stop_everything:
-                    break
-
-                yield decoded_text
-
-        # Case 2: CFG
-        # Copied from https://github.com/turboderp/exllama/blob/master/example_cfg.py
-        else:
-            alpha = state['guidance_scale']
-            prompts = [prompt, state['negative_prompt'] or '']
-
-            ids, mask = self.tokenizer.encode(
-                prompts,
-                return_mask=True,
-                max_seq_len=self.model.config.max_seq_len,
-                add_bos=state['add_bos_token']
-            )
-            if state['auto_max_new_tokens']:
-                max_new_tokens = state['truncation_length'] - ids[0].shape[-1]
-            else:
-                max_new_tokens = state['max_new_tokens']
-
-            self.generator.gen_begin(ids, mask=mask)
-            initial_len = self.generator.sequence[0].shape[0]
-            has_leading_space = False
-
-            for i in range(max_new_tokens):
-                logits = self.model.forward(self.generator.sequence[:, -1:], self.cache, input_mask=mask)
-                self.generator.apply_rep_penalty(logits)
-
-                logits = F.log_softmax(logits, dim=-1)
-                logits_mixed = alpha * logits[0] + (1 - alpha) * logits[1]
-
-                token, _ = self.generator.sample_current(logits_mixed)
-                if i == 0 and self.generator.tokenizer.tokenizer.IdToPiece(int(token)).startswith('▁'):
-                    has_leading_space = True
-
-                decoded_text = self.generator.tokenizer.decode(self.generator.sequence[0][initial_len:])
-                if has_leading_space:
-                    decoded_text = ' ' + decoded_text
-
-                # Check the partial unicode character
-                if chr(0xfffd) in decoded_text:
-                    is_last = i == max_new_tokens - 1
-                    is_stopping = token.item() == self.tokenizer.eos_token_id or shared.stop_everything
-                    # If we are not at the end of the generation, we skip this token
-                    if not (is_last or is_stopping):
-                        continue
-
-                yield decoded_text
-                if token.item() == self.tokenizer.eos_token_id or shared.stop_everything:
-                    break
-
-                batch_token = token.repeat(2, 1)
-                self.generator.gen_accept_token(batch_token)
-
-    def generate(self, prompt, state):
-        output = ''
-        for output in self.generate_with_streaming(prompt, state):
-            pass
-
-        return output
--- a/modules/exllama_hf.py
+++ b/modules/exllama_hf.py
@ -1,174 +0,0 @@
-import os
-from pathlib import Path
-from typing import Any, Dict, Optional, Union
-
-import torch
-from torch.nn import CrossEntropyLoss
-from transformers import GenerationConfig, PretrainedConfig, PreTrainedModel
-from transformers.modeling_outputs import CausalLMOutputWithPast
-
-from modules import shared
-from modules.logging_colors import logger
-
-try:
-    from exllama.model import ExLlama, ExLlamaCache, ExLlamaConfig
-except:
-    logger.warning('Exllama module failed to load. Will attempt to load from repositories.')
-    try:
-        from modules.relative_imports import RelativeImport
-
-        with RelativeImport("repositories/exllama"):
-            from model import ExLlama, ExLlamaCache, ExLlamaConfig
-    except:
-        logger.error("Could not find repositories/exllama/. Make sure that exllama is cloned inside repositories/ and is up to date.")
-        raise
-
-
-class ExllamaHF(PreTrainedModel):
-    def __init__(self, config: ExLlamaConfig):
-        super().__init__(PretrainedConfig())
-        self.ex_config = config
-        self.ex_model = ExLlama(self.ex_config)
-        self.generation_config = GenerationConfig()
-        self.lora = None
-
-        self.ex_cache = ExLlamaCache(self.ex_model)
-        self.past_seq = None
-
-        if shared.args.cfg_cache:
-            self.ex_cache_negative = ExLlamaCache(self.ex_model)
-            self.past_seq_negative = None
-
-    def _validate_model_class(self):
-        pass
-
-    def _validate_model_kwargs(self, model_kwargs: Dict[str, Any]):
-        pass
-
-    def prepare_inputs_for_generation(self, input_ids, **kwargs):
-        return {'input_ids': input_ids, **kwargs}
-
-    @property
-    def device(self) -> torch.device:
-        return torch.device(0)
-
-    def __call__(self, *args, **kwargs):
-        use_cache = kwargs.get('use_cache', True)
-        labels = kwargs.get('labels', None)
-        past_key_values = kwargs.get('past_key_values', None)
-
-        if len(args) > 0:
-            if not shared.args.cfg_cache:
-                logger.error("Please enable the cfg-cache option to use CFG with ExLlama_HF.")
-                return
-
-            input_ids = args[0]
-            is_negative = True
-            past_seq = self.past_seq_negative
-            ex_cache = self.ex_cache_negative
-        else:
-            input_ids = kwargs['input_ids']
-            is_negative = False
-            past_seq = self.past_seq
-            ex_cache = self.ex_cache
-
-        seq = input_ids[0].tolist()
-        if is_negative and past_key_values is not None:
-            seq = past_key_values + seq
-
-        seq_tensor = torch.tensor(seq)
-        reset = True
-
-        # Make the forward call
-        if labels is None:
-            if past_seq is not None:
-                min_length = min(past_seq.shape[0], seq_tensor.shape[0])
-                indices = torch.nonzero(~torch.eq(past_seq[:min_length], seq_tensor[:min_length]))
-                if len(indices) > 0:
-                    longest_prefix = indices[0].item()
-                else:
-                    longest_prefix = min_length
-
-                if longest_prefix > 0:
-                    reset = False
-                    ex_cache.current_seq_len = longest_prefix
-                    if len(seq_tensor) - longest_prefix > 1:
-                        self.ex_model.forward(seq_tensor[longest_prefix:-1].view(1, -1), ex_cache, preprocess_only=True, lora=self.lora)
-                    elif len(seq_tensor) == longest_prefix:
-                        # Very tricky: if the prefix we are reusing *is* the input_ids, then we have to back up the cache pointer by one,
-                        # because we feed input_ids[-1] to forward() below, but that last token is already in the cache!
-                        ex_cache.current_seq_len -= 1
-
-            if reset:
-                ex_cache.current_seq_len = 0
-                if len(seq_tensor) > 1:
-                    self.ex_model.forward(seq_tensor[:-1].view(1, -1), ex_cache, preprocess_only=True, lora=self.lora)
-
-            logits = self.ex_model.forward(seq_tensor[-1:].view(1, -1), ex_cache, lora=self.lora).to(input_ids.device)
-        else:
-            ex_cache.current_seq_len = 0
-            logits = self.ex_model.forward(seq_tensor.view(1, -1), ex_cache, last_id_only=False, lora=self.lora)
-
-        if is_negative:
-            self.past_seq_negative = seq_tensor
-        else:
-            self.past_seq = seq_tensor
-
-        loss = None
-        if labels is not None:
-            # Shift so that tokens < n predict n
-            shift_logits = logits[..., :-1, :].contiguous()
-            shift_labels = labels[..., 1:].contiguous()
-            # Flatten the tokens
-            loss_fct = CrossEntropyLoss()
-            shift_logits = shift_logits.view(-1, logits.shape[-1])
-            shift_labels = shift_labels.view(-1)
-            # Enable model parallelism
-            shift_labels = shift_labels.to(shift_logits.device)
-            loss = loss_fct(shift_logits, shift_labels)
-
-        return CausalLMOutputWithPast(logits=logits, past_key_values=seq if use_cache else None, loss=loss)
-
-    @classmethod
-    def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.PathLike]], *model_args, **kwargs):
-        assert len(model_args) == 0 and len(kwargs) == 0, "extra args is currently not supported"
-        if isinstance(pretrained_model_name_or_path, str):
-            pretrained_model_name_or_path = Path(pretrained_model_name_or_path)
-
-        pretrained_model_name_or_path = Path(f'{shared.args.model_dir}') / Path(pretrained_model_name_or_path)
-        config = ExLlamaConfig(pretrained_model_name_or_path / 'config.json')
-
-        # from 'oobabooga/text-generation-webui/modules/exllama.py'
-        weight_path = None
-        for ext in ['.safetensors', '.pt', '.bin']:
-            found = list(pretrained_model_name_or_path.glob(f"*{ext}"))
-            if len(found) > 0:
-                weight_path = found[-1]
-                break
-        assert weight_path is not None, f'could not find weight in "{pretrained_model_name_or_path}"'
-
-        config.model_path = str(weight_path)
-        config.max_seq_len = shared.args.max_seq_len
-        config.compress_pos_emb = shared.args.compress_pos_emb
-        if shared.args.gpu_split:
-            config.set_auto_map(shared.args.gpu_split)
-            config.gpu_peer_fix = True
-
-        if shared.args.alpha_value > 1 and shared.args.rope_freq_base == 0:
-            config.alpha_value = shared.args.alpha_value
-            config.calculate_rotary_embedding_base()
-        elif shared.args.rope_freq_base > 0:
-            config.rotary_embedding_base = shared.args.rope_freq_base
-
-        if torch.version.hip:
-            config.rmsnorm_no_half2 = True
-            config.rope_no_half2 = True
-            config.matmul_no_half2 = True
-            config.silu_no_half2 = True
-
-        # This slowes down a bit but align better with autogptq generation.
-        # TODO: Should give user choice to tune the exllama config
-        # config.fused_attn = False
-        # config.fused_mlp_thd = 0
-
-        return ExllamaHF(config)
--- a/modules/loaders.py
+++ b/modules/loaders.py
@ -81,15 +81,15 @@ loaders_and_params = OrderedDict({
        'trust_remote_code',
        'no_use_fast',
    ],
-    'ExLlama_HF': [
+    'ExLlamav2': [
        'gpu_split',
        'max_seq_len',
+        'no_flash_attn',
+        'num_experts_per_token',
+        'cache_8bit',
        'alpha_value',
-        'rope_freq_base',
        'compress_pos_emb',
-        'cfg_cache',
-        'trust_remote_code',
-        'no_use_fast',
+        'exllamav2_info',
    ],
    'AutoGPTQ': [
        'triton',
@ -128,24 +128,6 @@ loaders_and_params = OrderedDict({
        'no_use_fast',
        'gptq_for_llama_info',
    ],
-    'ExLlamav2': [
-        'gpu_split',
-        'max_seq_len',
-        'no_flash_attn',
-        'num_experts_per_token',
-        'cache_8bit',
-        'alpha_value',
-        'compress_pos_emb',
-        'exllamav2_info',
-    ],
-    'ExLlama': [
-        'gpu_split',
-        'max_seq_len',
-        'alpha_value',
-        'rope_freq_base',
-        'compress_pos_emb',
-        'exllama_info',
-    ],
    'ctransformers': [
        'n_ctx',
        'n_gpu_layers',
@ -216,54 +198,6 @@ loaders_samplers = {
    'AutoAWQ': transformers_samplers(),
    'QuIP#': transformers_samplers(),
    'HQQ': transformers_samplers(),
-    'ExLlama_HF': {
-        'temperature',
-        'temperature_last',
-        'top_p',
-        'min_p',
-        'top_k',
-        'typical_p',
-        'epsilon_cutoff',
-        'eta_cutoff',
-        'tfs',
-        'top_a',
-        'repetition_penalty',
-        'presence_penalty',
-        'frequency_penalty',
-        'repetition_penalty_range',
-        'encoder_repetition_penalty',
-        'no_repeat_ngram_size',
-        'min_length',
-        'seed',
-        'do_sample',
-        'mirostat_mode',
-        'mirostat_tau',
-        'mirostat_eta',
-        'grammar_file_row',
-        'grammar_string',
-        'guidance_scale',
-        'negative_prompt',
-        'ban_eos_token',
-        'custom_token_bans',
-        'add_bos_token',
-        'skip_special_tokens',
-        'auto_max_new_tokens',
-    },
-    'ExLlama': {
-        'temperature',
-        'top_p',
-        'top_k',
-        'typical_p',
-        'repetition_penalty',
-        'repetition_penalty_range',
-        'seed',
-        'guidance_scale',
-        'negative_prompt',
-        'ban_eos_token',
-        'add_bos_token',
-        'custom_token_bans',
-        'auto_max_new_tokens',
-    },
    'ExLlamav2': {
        'temperature',
        'top_p',
--- a/modules/logits.py
+++ b/modules/logits.py
@ -14,11 +14,10 @@ def get_next_logits(prompt, state, use_samplers, previous, top_logits=50, return
        return 'Error: No model is loaded1 Select one in the Model tab.', previous

    is_non_hf_exllamav2 = shared.model.__class__.__name__ == 'Exllamav2Model'
-    is_non_hf_exllamav1 = shared.model.__class__.__name__ == 'ExllamaModel'
    is_non_hf_llamacpp = shared.model.__class__.__name__ == 'LlamaCppModel'

    if use_samplers:
-        if any([is_non_hf_exllamav2, is_non_hf_exllamav1, is_non_hf_llamacpp]):
+        if any([is_non_hf_exllamav2, is_non_hf_llamacpp]):
            logger.error("Sampler hijacking is not supported non-Huggingface loaders.")
            # sampling is all done in c for exllama, so it is really hard to hijack
            # it should be possible to hijack llamacpp sampler by hijacking all their sampling methods,
@ -32,7 +31,7 @@ def get_next_logits(prompt, state, use_samplers, previous, top_logits=50, return

        scores = sampler_hijack.global_scores[-1]
    else:
-        if is_non_hf_exllamav2 or is_non_hf_exllamav1:
+        if is_non_hf_exllamav2:
            if is_torch_xpu_available():
                tokens = shared.tokenizer.encode(prompt).to("xpu:0")
            else:
@ -51,7 +50,7 @@ def get_next_logits(prompt, state, use_samplers, previous, top_logits=50, return

    probs = torch.softmax(scores, dim=-1, dtype=torch.float)
    topk_values, topk_indices = torch.topk(probs, k=top_logits, largest=True, sorted=True)
-    if is_non_hf_exllamav1 or is_non_hf_llamacpp:
+    if is_non_hf_llamacpp:
        topk_indices = [i.expand((1, 1)) for i in topk_indices]

    if hasattr(shared.tokenizer, 'convert_ids_to_tokens'):
--- a/modules/models.py
+++ b/modules/models.py
@ -66,8 +66,6 @@ def load_model(model_name, loader=None):
        'llama.cpp': llamacpp_loader,
        'llamacpp_HF': llamacpp_HF_loader,
        'RWKV': RWKV_loader,
-        'ExLlama': ExLlama_loader,
-        'ExLlama_HF': ExLlama_HF_loader,
        'ExLlamav2': ExLlamav2_loader,
        'ExLlamav2_HF': ExLlamav2_HF_loader,
        'ctransformers': ctransformers_loader,
@ -382,19 +380,6 @@ def AutoGPTQ_loader(model_name):
    return modules.AutoGPTQ_loader.load_quantized(model_name)


-def ExLlama_loader(model_name):
-    from modules.exllama import ExllamaModel
-
-    model, tokenizer = ExllamaModel.from_pretrained(model_name)
-    return model, tokenizer
-
-
-def ExLlama_HF_loader(model_name):
-    from modules.exllama_hf import ExllamaHF
-
-    return ExllamaHF.from_pretrained(model_name)
-
-
 def ExLlamav2_loader(model_name):
    from modules.exllamav2 import Exllamav2Model

--- a/modules/models_settings.py
+++ b/modules/models_settings.py
@ -41,13 +41,11 @@ def get_model_metadata(model):

    if 'loader' not in model_settings:
        if hf_metadata is not None and 'quip_params' in hf_metadata:
-            model_settings['loader'] = 'QuIP#'
+            loader = 'QuIP#'
        else:
            loader = infer_loader(model, model_settings)
-            if 'wbits' in model_settings and type(model_settings['wbits']) is int and model_settings['wbits'] > 0:
-                loader = 'AutoGPTQ'

-            model_settings['loader'] = loader
+        model_settings['loader'] = loader

    # GGUF metadata
    if model_settings['loader'] in ['llama.cpp', 'llamacpp_HF', 'ctransformers']:
@ -152,7 +150,7 @@ def infer_loader(model_name, model_settings):
    if not path_to_model.exists():
        loader = None
    elif (path_to_model / 'quantize_config.json').exists() or ('wbits' in model_settings and type(model_settings['wbits']) is int and model_settings['wbits'] > 0):
-        loader = 'ExLlama_HF'
+        loader = 'ExLlamav2_HF'
    elif (path_to_model / 'quant_config.json').exists() or re.match(r'.*-awq', model_name.lower()):
        loader = 'AutoAWQ'
    elif len(list(path_to_model.glob('*.gguf'))) > 0:
@ -229,7 +227,7 @@ def apply_model_settings_to_state(model, state):
        loader = model_settings.pop('loader')

        # If the user is using an alternative loader for the same model type, let them keep using it
-        if not (loader == 'AutoGPTQ' and state['loader'] in ['GPTQ-for-LLaMa', 'ExLlama', 'ExLlama_HF', 'ExLlamav2', 'ExLlamav2_HF']) and not (loader == 'llama.cpp' and state['loader'] in ['llamacpp_HF', 'ctransformers']):
+        if not (loader == 'ExLlamav2_HF' and state['loader'] in ['GPTQ-for-LLaMa', 'ExLlamav2', 'AutoGPTQ']) and not (loader == 'llama.cpp' and state['loader'] in ['llamacpp_HF', 'ctransformers']):
            state['loader'] = loader

    for k in model_settings:
--- a/modules/shared.py
+++ b/modules/shared.py
@ -85,7 +85,7 @@ group.add_argument('--chat-buttons', action='store_true', help='Show buttons on

 # Model loader
 group = parser.add_argument_group('Model loader')
-group.add_argument('--loader', type=str, help='Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlama_HF, ExLlamav2_HF, AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, ExLlama, ExLlamav2, ctransformers, QuIP#.')
+group.add_argument('--loader', type=str, help='Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlamav2_HF, ExLlamav2, AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, ctransformers, QuIP#.')

 # Transformers/Accelerate
 group = parser.add_argument_group('Transformers/Accelerate')
@ -131,7 +131,7 @@ group.add_argument('--cache-capacity', type=str, help='Maximum cache capacity (l
 group = parser.add_argument_group('ExLlama')
 group.add_argument('--gpu-split', type=str, help='Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.')
 group.add_argument('--max_seq_len', type=int, default=2048, help='Maximum sequence length.')
-group.add_argument('--cfg-cache', action='store_true', help='ExLlama_HF: Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader, but not necessary for CFG with base ExLlama.')
+group.add_argument('--cfg-cache', action='store_true', help='ExLlamav2_HF: Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader.')
 group.add_argument('--no_flash_attn', action='store_true', help='Force flash-attention to not be used.')
 group.add_argument('--cache_8bit', action='store_true', help='Use 8-bit cache to save VRAM.')
 group.add_argument('--num_experts_per_token', type=int, default=2, help='Number of experts to use for generation. Applies to MoE models like Mixtral.')
@ -260,8 +260,6 @@ def fix_loader_name(name):
        return 'GPTQ-for-LLaMa'
    elif name in ['exllama', 'ex-llama', 'ex_llama', 'exlama']:
        return 'ExLlama'
-    elif name in ['exllama-hf', 'exllama_hf', 'exllama hf', 'ex-llama-hf', 'ex_llama_hf']:
-        return 'ExLlama_HF'
    elif name in ['exllamav2', 'exllama-v2', 'ex_llama-v2', 'exlamav2', 'exlama-v2', 'exllama2', 'exllama-2']:
        return 'ExLlamav2'
    elif name in ['exllamav2-hf', 'exllamav2_hf', 'exllama-v2-hf', 'exllama_v2_hf', 'exllama-v2_hf', 'exllama2-hf', 'exllama2_hf', 'exllama-2-hf', 'exllama_2_hf', 'exllama-2_hf']:
--- a/modules/text_generation.py
+++ b/modules/text_generation.py
@ -44,7 +44,7 @@ def _generate_reply(question, state, stopping_strings=None, is_chat=False, escap
            yield ''
            return

-        if shared.model.__class__.__name__ in ['LlamaCppModel', 'RWKVModel', 'ExllamaModel', 'Exllamav2Model', 'CtransformersModel']:
+        if shared.model.__class__.__name__ in ['LlamaCppModel', 'RWKVModel', 'Exllamav2Model', 'CtransformersModel']:
            generate_func = generate_reply_custom
        else:
            generate_func = generate_reply_HF
@ -132,7 +132,7 @@ def encode(prompt, add_special_tokens=True, add_bos_token=True, truncation_lengt
    if truncation_length is not None:
        input_ids = input_ids[:, -truncation_length:]

-    if shared.model.__class__.__name__ in ['LlamaCppModel', 'RWKVModel', 'ExllamaModel', 'Exllamav2Model', 'CtransformersModel'] or shared.args.cpu:
+    if shared.model.__class__.__name__ in ['LlamaCppModel', 'RWKVModel', 'Exllamav2Model', 'CtransformersModel'] or shared.args.cpu:
        return input_ids
    elif shared.args.deepspeed:
        return input_ids.to(device=local_rank)
--- a/modules/ui_model_menu.py
+++ b/modules/ui_model_menu.py
@ -96,7 +96,7 @@ def create_ui():
                            shared.gradio['groupsize'] = gr.Dropdown(label="groupsize", choices=["None", 32, 64, 128, 1024], value=shared.args.groupsize if shared.args.groupsize > 0 else "None")
                            shared.gradio['model_type'] = gr.Dropdown(label="model_type", choices=["None"], value=shared.args.model_type or "None")
                            shared.gradio['pre_layer'] = gr.Slider(label="pre_layer", minimum=0, maximum=100, value=shared.args.pre_layer[0] if shared.args.pre_layer is not None else 0)
-                            shared.gradio['autogptq_info'] = gr.Markdown('* ExLlama_HF is recommended over AutoGPTQ for models derived from Llama.')
+                            shared.gradio['autogptq_info'] = gr.Markdown('* ExLlamav2_HF is recommended over AutoGPTQ for models derived from Llama.')
                            shared.gradio['gpu_split'] = gr.Textbox(label='gpu-split', info='Comma-separated list of VRAM (in GB) to use per GPU. Example: 20,7,7')
                            shared.gradio['max_seq_len'] = gr.Slider(label='max_seq_len', minimum=0, maximum=shared.settings['truncation_length_max'], step=256, info='Context length. Try lowering this if you run out of memory while loading the model.', value=shared.args.max_seq_len)
                            shared.gradio['alpha_value'] = gr.Slider(label='alpha_value', minimum=1, maximum=8, step=0.05, info='Positional embeddings alpha factor for NTK RoPE scaling. Recommended values (NTKv1): 1.75 for 1.5x context, 2.5 for 2x context. Use either this or compress_pos_emb, not both.', value=shared.args.alpha_value)
@ -134,8 +134,7 @@ def create_ui():
                            shared.gradio['cache_8bit'] = gr.Checkbox(label="cache_8bit", value=shared.args.cache_8bit, info='Use 8-bit cache to save VRAM.')
                            shared.gradio['no_use_fast'] = gr.Checkbox(label="no_use_fast", value=shared.args.no_use_fast, info='Set use_fast=False while loading the tokenizer.')
                            shared.gradio['num_experts_per_token'] = gr.Number(label="Number of experts per token", value=shared.args.num_experts_per_token, info='Only applies to MoE models like Mixtral.')
-                            shared.gradio['gptq_for_llama_info'] = gr.Markdown('Legacy loader for compatibility with older GPUs. ExLlama_HF or AutoGPTQ are preferred for GPTQ models when supported.')
-                            shared.gradio['exllama_info'] = gr.Markdown("ExLlama_HF is recommended over ExLlama for better integration with extensions and more consistent sampling behavior across loaders.")
+                            shared.gradio['gptq_for_llama_info'] = gr.Markdown('Legacy loader for compatibility with older GPUs. ExLlamav2_HF or AutoGPTQ are preferred for GPTQ models when supported.')
                            shared.gradio['exllamav2_info'] = gr.Markdown("ExLlamav2_HF is recommended over ExLlamav2 for better integration with extensions and more consistent sampling behavior across loaders.")
                            shared.gradio['llamacpp_HF_info'] = gr.Markdown('llamacpp_HF loads llama.cpp as a Transformers model. To use it, you need to download a tokenizer.\n\nOption 1 (recommended): place your .gguf in a subfolder of models/ along with these 4 files: special_tokens_map.json, tokenizer_config.json, tokenizer.json, tokenizer.model.\n\nOption 2: download `oobabooga/llama-tokenizer` under "Download model or LoRA". That\'s a default Llama tokenizer that will work for some (but not all) models.')

--- a/one_click.py
+++ b/one_click.py
@ -343,27 +343,6 @@ def update_requirements(initial_installation=False):
    if not os.path.exists("repositories/"):
        os.mkdir("repositories")

-    os.chdir("repositories")
-
-    # Install or update ExLlama as needed
-    if not os.path.exists("exllama/"):
-        run_cmd("git clone https://github.com/turboderp/exllama.git", environment=True)
-    else:
-        os.chdir("exllama")
-        run_cmd("git pull", environment=True)
-        os.chdir("..")
-
-    if is_linux():
-        # Fix JIT compile issue with ExLlama in Linux/WSL
-        if not os.path.exists(f"{conda_env_path}/lib64"):
-            run_cmd(f'ln -s "{conda_env_path}/lib" "{conda_env_path}/lib64"', environment=True)
-
-        # On some Linux distributions, g++ may not exist or be the wrong version to compile GPTQ-for-LLaMa
-        gxx_output = run_cmd("g++ -dumpfullversion -dumpversion", environment=True, capture_output=True)
-        if gxx_output.returncode != 0 or int(gxx_output.stdout.strip().split(b".")[0]) > 11:
-            # Install the correct version of g++
-            run_cmd("conda install -y -k conda-forge::gxx_linux-64=11.2.0", environment=True)
-
    clear_cache()


--- a/requirements.txt
+++ b/requirements.txt
@ -66,14 +66,6 @@ https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu1
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp39-cp39-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp38-cp38-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp39-cp39-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp38-cp38-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
 https://github.com/turboderp/exllamav2/releases/download/v0.0.11/exllamav2-0.0.11+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
 https://github.com/turboderp/exllamav2/releases/download/v0.0.11/exllamav2-0.0.11+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
 https://github.com/turboderp/exllamav2/releases/download/v0.0.11/exllamav2-0.0.11+cu121-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
--- a/requirements_amd.txt
+++ b/requirements_amd.txt
@ -46,10 +46,6 @@ https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+roc
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+rocm5.6-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+rocm5.6-cp39-cp39-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+rocm5.6-cp38-cp38-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+rocm5.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+rocm5.6-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+rocm5.6-cp39-cp39-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+rocm5.6-cp38-cp38-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
 https://github.com/turboderp/exllamav2/releases/download/v0.0.11/exllamav2-0.0.11+rocm5.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
 https://github.com/turboderp/exllamav2/releases/download/v0.0.11/exllamav2-0.0.11+rocm5.6-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
 https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+rocm5.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
--- a/requirements_amd_noavx2.txt
+++ b/requirements_amd_noavx2.txt
@ -42,10 +42,6 @@ https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+roc
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+rocm5.6-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+rocm5.6-cp39-cp39-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+rocm5.6-cp38-cp38-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+rocm5.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+rocm5.6-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+rocm5.6-cp39-cp39-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+rocm5.6-cp38-cp38-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
 https://github.com/turboderp/exllamav2/releases/download/v0.0.11/exllamav2-0.0.11+rocm5.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
 https://github.com/turboderp/exllamav2/releases/download/v0.0.11/exllamav2-0.0.11+rocm5.6-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
 https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+rocm5.6-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
--- a/requirements_noavx2.txt
+++ b/requirements_noavx2.txt
@ -66,14 +66,6 @@ https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu1
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp39-cp39-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
 https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp38-cp38-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp38-cp38-win_amd64.whl; platform_system == "Windows" and python_version == "3.8"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp39-cp39-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.9"
-https://github.com/jllllll/exllama/releases/download/0.0.18/exllama-0.0.18+cu121-cp38-cp38-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.8"
 https://github.com/turboderp/exllamav2/releases/download/v0.0.11/exllamav2-0.0.11+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
 https://github.com/turboderp/exllamav2/releases/download/v0.0.11/exllamav2-0.0.11+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
 https://github.com/turboderp/exllamav2/releases/download/v0.0.11/exllamav2-0.0.11+cu121-cp39-cp39-win_amd64.whl; platform_system == "Windows" and python_version == "3.9"