Nam D. Tran f6793491b5
llama : add AWQ for llama, llama2, mpt, and mistral models (#4593)
* update: awq support llama-7b model

* update: change order

* update: benchmark results for llama2-7b

* update: mistral 7b v1 benchmark

* update: support 4 models

* fix: Readme

* update: ready for PR

* update: readme

* fix: readme

* update: change order import

* black

* format code

* update: work for bot mpt and awqmpt

* update: readme

* Rename to llm_build_ffn_mpt_awq

* Formatted other files

* Fixed params count

* fix: remove code

* update: more detail for mpt

* fix: readme

* fix: readme

* update: change folder architecture

* fix: common.cpp

* fix: readme

* fix: remove ggml_repeat

* update: cicd

* update: cicd

* uppdate: remove use_awq arg

* update: readme

* llama : adapt plamo to new ffn

ggml-ci

---------

Co-authored-by: Trần Đức Nam <v.namtd12@vinai.io>
Co-authored-by: Le Hoang Anh <v.anhlh33@vinai.io>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-27 17:39:45 +02:00
..

AWQ: Activation-aware Weight Quantization for LLM - version apply to llamacpp

[Paper][Original Repo][Easy-to-use Repo]

Supported models:

  • LLaMA
  • LLaMA 2
  • MPT
  • Mistral AI v0.1
  • Bloom
  • Mixtral MoE

TODO:

  • Update version work with both MPT and MPT-AWQ model
  • Add OPT model
  • Add Bloom model
  • Add Mixtral MoE
  • Support w3, w2

Contents

Install

Install requirements

pip install -r requirements.txt

Get the pre-computed AWQ search results for multiple model families, including LLaMA, LLaMA2, MPT, OPT

git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache

Convert

Example for llama model

# For llama7b and llama2 models
python convert.py models/llama-7b/ --awq-path awq_cache/llama-7b-w4-g128.pt --outfile models/llama_7b_fp16.gguf
# For mistral and mpt models
python convert-hf-to-gguf.py models/mpt-7b/ --awq-path awq_cache/llama-7b-w4-g128.pt --outfile models/mpt_7b_fp16.gguf

Quantize

# We only benchmark and confirm the results on q4_0, q4_1, and q2_k types.
./quantize models/llama_7b_fp16.gguf models/llama_7b_q4_0.gguf q4_0

Test

# For all models.
./build/bin/main -m models/llama_7b_q4_0.gguf -n 128 --prompt "Once upon a time"

Benchmark

The perplexity measurements in table above are done against the wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.

# For llama and llama2, and mistral models.
./perplexity -m models/llama_7b_q4_0.gguf -f datasets/wikitext-2-raw/wiki.test.raw

Results

Results are run on OpenBLAS (CPU) and CuBLAS (GPU) for fair comparison We use three types of llamacpp quantization methods to work with our version, including q4_0, q4_1, and q2_k

Llama 7B (Build with OpenBLAS)

Model Measure F16 Q4_0 Q4_1 Q2_K
Llama 7B perplexity 5.9066 6.1214 6.0643 6.5808
Llama 7B file size 12.9G 3.5G 3.9G 2.7G
Llama 7B bits/weight 16.0 4.5 5.0 2.6
AWQ-LLama 7B perplexity 5.9175 6.0252 5.9987 6.3692
AWQ-LLama 7B file size 12.9G 3.5G 3.9G 2.7G
AWQ-LLama 7B bits/weight 16.0 4.5 5.0 2.6

Llama2 7B (Build with CuBLAS)

Model Measure F16 Q4_0 Q4_1 Q2_K
Llama2 7B perplexity 5.8664 6.0260 6.0656 6.4496
Llama2 7B file size 12.9G 3.5G 3.9G 2.7G
Llama2 7B bits/weight 16.0 4.5 5.0 2.6
AWQ-LLama2 7B perplexity 5.8801 6.0054 5.9849 6.3650
AWQ-LLama2 7B file size 12.9G 3.5G 3.9G 2.7G
AWQ-LLama2 7B bits/weight 16.0 4.5 5.0 2.6

Mistral 7B v0.1 (Build with CuBLAS)

Model Measure F16 Q4_0 Q4_1 Q2_K
Mistral 7B perplexity 5.6931 5.8202 5.8268 6.1645
Mistral 7B file size 14.5G 4.1G 4.5G 3.1G
Mistral 7B bits/weight 16.0 4.5 5.0 2.6
AWQ-Mistral 7B perplexity 5.6934 5.8020 5.7691 6.0426
AWQ-Mistral 7B file size 14.5G 4.1G 4.5G 3.1G
AWQ-Mistral 7B bits/weight 16.0 4.5 5.0 2.6

MPT 7B (Build with OpenBLAS)

Model Measure F16 Q4_0 Q4_1 Q2_K
MPT 7B perplexity 8.4369 8.7956 8.6265 11.4913
MPT 7B file size 13.7G 3.9G 4.3G 2.8G
MPT 7B bits/weight 16.0 4.5 5.0 2.6
AWQ-MPT 7B perplexity 8.4944 8.7053 8.6750 10.2873
AWQ-MPT 7B file size 13.7G 3.9G 4.3G 2.8G
AWQ-MPT 7B bits/weight 16.0 4.5 5.0 2.6