Update README

This commit is contained in:
oobabooga 2023-06-23 12:24:43 -03:00
parent 3ae9af01aa
commit 0f9088f730

View File

@ -258,6 +258,7 @@ Optionally, you can use the following command-line flags:
| `--triton` | Use triton. | | `--triton` | Use triton. |
| `--no_inject_fused_attention` | Disable the use of fused attention, which will use less VRAM at the cost of slower inference. | | `--no_inject_fused_attention` | Disable the use of fused attention, which will use less VRAM at the cost of slower inference. |
| `--no_inject_fused_mlp` | Triton mode only: disable the use of fused MLP, which will use less VRAM at the cost of slower inference. | | `--no_inject_fused_mlp` | Triton mode only: disable the use of fused MLP, which will use less VRAM at the cost of slower inference. |
| `--no_use_cuda_fp16` | This can make models faster on some systems. |
| `--desc_act` | For models that don't have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig. | | `--desc_act` | For models that don't have a quantize_config.json, this parameter is used to define whether to set desc_act or not in BaseQuantizeConfig. |
#### ExLlama #### ExLlama