代码拉取完成,页面将自动刷新
We integrated AWQ into FastChat to provide efficient and accurate 4bit LLM inference.
Setup environment (please refer to this link for more details):
conda create -n fastchat-awq python=3.10 -y
conda activate fastchat-awq
# cd /path/to/FastChat
pip install --upgrade pip # enable PEP 660 support
pip install -e . # install fastchat
git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq
cd repositories/llm-awq
pip install -e . # install awq package
cd awq/kernels
python setup.py install # install awq CUDA kernels
# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq
# You can specify which quantized model to use by setting --awq-ckpt
python3 -m fastchat.serve.cli \
--model-path models/vicuna-7b-v1.3-4bit-g128-awq \
--awq-wbits 4 \
--awq-groupsize 128
Through 4-bit weight quantization, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128.
Benchmark on NVIDIA RTX A6000:
Model | Bits | Max Memory (MiB) | Speed (ms/token) | AWQ Speedup |
---|---|---|---|---|
vicuna-7b | 16 | 13543 | 26.06 | / |
vicuna-7b | 4 | 5547 | 12.43 | 2.1x |
llama2-7b-chat | 16 | 13543 | 27.14 | / |
llama2-7b-chat | 4 | 5547 | 12.44 | 2.2x |
vicuna-13b | 16 | 25647 | 44.91 | / |
vicuna-13b | 4 | 9355 | 17.30 | 2.6x |
llama2-13b-chat | 16 | 25647 | 47.28 | / |
llama2-13b-chat | 4 | 9355 | 20.28 | 2.3x |
NVIDIA RTX 4090:
Model | AWQ 4bit Speed (ms/token) | FP16 Speed (ms/token) | AWQ Speedup |
---|---|---|---|
vicuna-7b | 8.61 | 19.09 | 2.2x |
llama2-7b-chat | 8.66 | 19.97 | 2.3x |
vicuna-13b | 12.17 | OOM | / |
llama2-13b-chat | 13.54 | OOM | / |
NVIDIA Jetson Orin:
Model | AWQ 4bit Speed (ms/token) | FP16 Speed (ms/token) | AWQ Speedup |
---|---|---|---|
vicuna-7b | 65.34 | 93.12 | 1.4x |
llama2-7b-chat | 75.11 | 104.71 | 1.4x |
vicuna-13b | 115.40 | OOM | / |
llama2-13b-chat | 136.81 | OOM | / |
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。