>FlexGen is a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card!).
https://github.com/FMInference/FlexGen
## Installation
No additional installation steps are necessary. FlexGen is in the `requirements.txt` file for this project.
## Converting a model
FlexGen only works with the OPT model, and it needs to be converted to numpy format before starting the web UI:
```
python convert-to-flexgen.py models/opt-1.3b/
```
The output will be saved to `models/opt-1.3b-np/`.
You should typically only change the first two numbers. If their sum is less than 100, the remaining layers will be offloaded to the disk, by default into the `text-generation-webui/cache` folder.
## Performance
In my experiments with OPT-30B using a RTX 3090 on Linux, I have obtained these results: