mirror of
https://github.com/oobabooga/text-generation-webui.git
synced 2024-10-30 06:00:15 +01:00
Update Low-VRAM-guide.md
This commit is contained in:
parent
ee99a87330
commit
aa83fc21d4
@ -1,4 +1,4 @@
|
||||
If you GPU is not large enough to fit a model, try these in the following order:
|
||||
If you GPU is not large enough to fit a 16-bit model, try these in the following order:
|
||||
|
||||
### Load the model in 8-bit mode
|
||||
|
||||
@ -6,7 +6,11 @@ If you GPU is not large enough to fit a model, try these in the following order:
|
||||
python server.py --load-in-8bit
|
||||
```
|
||||
|
||||
This reduces the memory usage by half with no noticeable loss in quality. Only newer GPUs support 8-bit mode.
|
||||
### Load the model in 4-bit mode
|
||||
|
||||
```
|
||||
python server.py --load-in-4bit
|
||||
```
|
||||
|
||||
### Split the model across your GPU and CPU
|
||||
|
||||
@ -34,8 +38,6 @@ python server.py --auto-devices --gpu-memory 3500MiB
|
||||
...
|
||||
```
|
||||
|
||||
Additionally, you can also set the `--no-cache` value to reduce the GPU usage while generating text at a performance cost. This may allow you to set a higher value for `--gpu-memory`, resulting in a net performance gain.
|
||||
|
||||
### Send layers to a disk cache
|
||||
|
||||
As a desperate last measure, you can split the model across your GPU, CPU, and disk:
|
||||
|
Loading…
Reference in New Issue
Block a user