1. 背景
計劃通過FastChat加載一個語言大模型或代碼大模型讨永,7B參數(shù)的沒問題洞渤。
嘗試加載量化之后的13B或33B級別的模型橙凳。
FastChat支持AWQ(llm-awq)和GPTQ兩種量化模型丘损,本次先嘗試AWQ(llm-awq)。
https://github.com/lm-sys/FastChat/blob/main/docs/awq.md
AWQ量化還有一種實現(xiàn):autoawq奕筐,已經(jīng)被transformers嵌入舱痘,所以推薦采用這個版本的AWQ。
參考:transformers/src/transformers/integrations/awq.py at main · huggingface/transformers (github.com)
本文也會介紹AutoAWQ這種量化方法离赫。
2. 加載模型
qwen1.5
llm-awq不支持qwen2模型(實際是qwen1.5模型)
python3 -m fastchat.serve.cli \
--model-path /data/shuzhang/models/qwen/Qwen1.5-14B-Chat-AWQ \
--awq-wbits 4 \
--awq-groupsize 128
File "/home/jinxiao/code/llm-deploy/llm-awq/awq/quantize/quantizer.py", line 132, in real_quantize_model_weight
layers = get_blocks(model)
File "/home/jinxiao/code/llm-deploy/llm-awq/awq/quantize/pre_quant.py", line 43, in get_blocks
raise NotImplementedError(type(model))
NotImplementedError: <class 'transformers.models.qwen2.modeling_qwen2.Qwen2ForCausalLM'>
如果采用AutoAWQ的話芭逝,可以直接啟動(腳本如下),會走transformers的加載過程渊胸,前提是需要
pip install autoawq
并且旬盯,不在llm-awq的目錄下,否則,會報錯ModuleNotFoundError: No module named 'awq.modules'
其實胖翰,如果transformers可以直接load AWQ模型接剩,沒有采用llm-awq,說明這個模型也是采用AutoAWQ量化的
注意:下面的啟動腳本萨咳,沒有--awq-wbits 4 --awq-groupsize 128
參數(shù)設(shè)置懊缺,fastchat會默認采用transformers庫加載預訓練模型
python3 -m fastchat.serve.cli \
--model-path /data/shuzhang/models/qwen/Qwen1.5-14B-Chat-AWQ
deepseek
deepseek模型采用llama架構(gòu),所以llm-awq支持某弦。
但是報了一個莫名的錯誤桐汤,懷疑是量化checkpoint
的問題而克。
自行量化靶壮,但是GPU顯存不足,一個24GB的3090會oom员萍。
llm-awq也不支持兩張卡腾降,拉胯!
$ python3 -m fastchat.serve.cli \
> --model-path /data/shuzhang/models/deepseek/deepseek-coder-33B-instruct-AWQ \
> --awq-wbits 4 \
> --awq-groupsize 128
Loading AWQ quantized model...
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
real weight quantization...(init only): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [00:02<00:00, 28.97it/s]
[Warning] The awq quantized checkpoint seems to be in v1 format.
If the model cannot be loaded successfully, please use the latest awq library to re-quantized the model, or repack the current checkpoint with tinychat/offline-weight-repacker.py
Loading checkpoint: 0%| | 0/1 [00:11<?, ?it/s]
Traceback (most recent call last):
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/site-packages/fastchat/serve/cli.py", line 304, in <module>
main(args)
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/site-packages/fastchat/serve/cli.py", line 227, in main
chat_loop(
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/site-packages/fastchat/serve/inference.py", line 361, in chat_loop
model, tokenizer = load_model(
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/site-packages/fastchat/model/model_adapter.py", line 294, in load_model
model, tokenizer = load_awq_quantized(model_path, awq_config, device)
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/site-packages/fastchat/modules/awq.py", line 65, in load_awq_quantized
model = load_quant.load_awq_model(
File "/home/jinxiao/code/llm-deploy/llm-awq/tinychat/utils/load_quant.py", line 82, in load_awq_model
model = load_checkpoint_and_dispatch(
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/site-packages/accelerate/big_modeling.py", line 589, in load_checkpoint_and_dispatch
load_checkpoint_in_model(
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 1645, in load_checkpoint_in_model
model.load_state_dict(checkpoint, strict=False)
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.layers.34.mlp.up_proj.qweight: copying a param with shape torch.Size([7168, 2400]) from checkpoint, the shape in current model is torch.Size([4800, 7168]).
size mismatch for model.layers.34.mlp.down_proj.qweight: copying a param with shape torch.Size([19200, 896]) from checkpoint, the shape in current model is torch.Size([1792, 19200]).
size mismatch for model.layers.34.mlp.down_proj.scales: copying a param with shape torch.Size([150, 7168]) from checkpoint, the shape in current model is torch.Size([152, 7168]).
...
3. llm-awq量化過程
The current release supports of llm-awq :
- AWQ search for accurate quantization.
- Pre-computed AWQ model zoo for LLMs (LLaMA, Llama2, OPT, CodeLlama, StarCoder, Vicuna, VILA, LLaVA; load to generate quantized weights).
- Memory-efficient 4-bit Linear in PyTorch.
- Efficient CUDA kernel implementation for fast inference (support context and decoding stage).
- Examples on 4-bit inference of an instruction-tuned model (Vicuna) and multi-modal LM (VILA).
記錄一下llm-awq的量化過程碎绎,以后如果GPU顯存充足螃壤,可以試試。(本次量化沒有成功筋帖,顯存不足)
環(huán)境安裝
量化步驟
llm-awq/scripts/llama2_example.sh at main · mit-han-lab/llm-awq (github.com)
MODEL_NAME=deepseek-coder-6.7b-instruct
MODEL_PATH=/home/shuzhang/ai/deepseek/$MODEL_NAME
CACHE_PATH=/data/models/llm-awq
AWQ_CACHE=$CACHE_PATH/awq_cache
QUANT_CACHE=$CACHE_PATH/quant_cache
# run AWQ search (optional; we provided the pre-computed results)
python -m awq.entry --model_path $MODEL_PATH \
--w_bit 4 --q_group_size 128 \
--run_awq --dump_awq $AWQ_CACHE/$MODEL_NAME-w4-g128.pt
# evaluate the AWQ quantize model (simulated pseudo quantization)
python -m awq.entry --model_path $MODEL_PATH \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_awq $AWQ_CACHE/$MODEL_NAME-w4-g128.pt \
--q_backend fake
# generate real quantized weights (w4)
python -m awq.entry --model_path $MODEL_PATH \
--w_bit 4 --q_group_size 128 \
--load_awq $AWQ_CACHE/$MODEL_NAME-w4-g128.pt \
--q_backend real --dump_quant $QUANT_CACHE/$MODEL_NAME-w4-g128-awq.pt
# load and evaluate the real quantized model (smaller gpu memory usage)
python -m awq.entry --model_path $MODEL_PATH \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_quant $QUANT_CACHE/$MODEL_NAME-w4-g128-awq.pt
遇到的問題
問題1
Traceback (most recent call last):
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/jinxiao/code/llm-deploy/llm-awq/awq/entry.py", line 15, in <module>
from awq.quantize.pre_quant import run_awq, apply_awq
ModuleNotFoundError: No module named 'awq.quantize.pre_quant'
解決辦法
- 創(chuàng)建文件/home/jinxiao/code/llm-deploy/llm-awq/awq/init.py
問題2
File "/home/jinxiao/code/llm-deploy/llm-awq/awq/utils/calib_data.py", line 7, in get_calib_dataset
dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation")
解決辦法
- 手動下載dataset奸晴,https://huggingface.co/datasets/mit-han-lab/pile-val-backup
- 本地位置:/data/models/llm-awq/datasets/mit-han-lab/pile-val-backup
- 修改代碼llm-awq/awq/utils/calib_data.py,從本地加載dataset
4. AutoAWQ量化過程
https://github.com/casper-hansen/AutoAWQ?tab=readme-ov-file#examples
- 量化腳本如下日麸,耗時大概20分鐘(兩張3090 24GB顯卡)
- 量化之后寄啼,通過fastchat加載測試,也沒問題代箭。顯存使用更少墩划,推理更快。
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = '/data/shuzhang/models/deepseek/deepseek-coder-6.7b-instruct'
quant_path = 'deepseek-coder-6.7b-instruct-AWQ'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
問題1
- 加載數(shù)據(jù)失敗嗡综,網(wǎng)絡(luò)不通導致乙帮,有proxy應該沒問題
dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation")
解決辦法
- 手動下載dataset,https://huggingface.co/datasets/mit-han-lab/pile-val-backup
- 本地位置:/data/models/llm-awq/datasets/mit-han-lab/pile-val-backup
- 修改報錯的Py文件代碼极景,從本地加載dataset
問題2
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1067, in _update_causal_mask
if hasattr(self.layers[0].self_attn, "past_key_value"): # static cache
File "/home/jinxiao/miniconda3/envs/llm_new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'Catcher' object has no attribute 'self_attn'
解決辦法
- 文件:.../miniconda3/envs/llm_new/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py
if hasattr(self.layers[0].self_attn, "past_key_value"): # static cache
=> 改成
if False:
5. 總結(jié)
由于沒有測試llm-awq的量化模型察净,也沒能通過llm-awq量化成功。
所以盼樟,并不清楚llm-awq量化后的模型氢卡,推理速度如何,顯存占用怎樣恤批。
如果和AutoAWQ資源占用情況和推理速度相似异吻,更推薦使用AutoAWQ。