用 vLLM 在多節(jié)點(diǎn)多卡上部署 Qwen2.5 以及進(jìn)行推理

原文地址：https://alphahinex.github.io/2024/12/22/vllm-multi-node-inference/

description: "本文記錄了在兩臺(tái)機(jī)器工禾，每臺(tái)機(jī)器一塊 Tesla T4 顯卡的環(huán)境下，使用 vLLM 部署 Qwen2.5-32B-Instruct-GPTQ-Int4 模型的過程及遇到的問題，供類似環(huán)境使用 vLLM 進(jìn)行多節(jié)點(diǎn)多卡推理參考逊朽。"
date: 2024.12.22 10:26
categories:
- AI
tags: [AI, Python, vLLM]
keywords: vllm, gptq, gptq_marlin, tensor-parallel-size, Qwen2.5-32B-Instruct-GPTQ-Int4, multi-node inference, docker, nvidia container toolkit, max-model-len, gpu-memory-utilization, tesla t4

本文記錄了在兩臺(tái)機(jī)器，每臺(tái)機(jī)器一塊 Tesla T4 顯卡的環(huán)境下扶镀，使用 vLLM 部署 Qwen2.5-32B-Instruct-GPTQ-Int4 模型的過程及遇到的問題溜在，供類似環(huán)境使用 vLLM 進(jìn)行多節(jié)點(diǎn)多卡推理參考硬毕。

部署清單

Qwen2.5-32B-Instruct-GPTQ-Int4、vLLM
docker v27.4.0考蕾、nvidia-container-toolkit v1.17.3
Tesla T4 顯卡驅(qū)動(dòng) v550.127.08 CUDA12.4

部署包準(zhǔn)備

# qwen
$ git clone https://www.modelscope.cn/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4.git

# vllm image
$ docker pull vllm/vllm-openai:v0.6.4.post1

# export
$ docker save vllm/vllm-openai:v0.6.4.post1 | gzip > images.tar.gz

更新顯卡驅(qū)動(dòng)

需要更新至 cuda>=12.4祸憋，以運(yùn)行 vLLM 容器。

# 先卸載之前安裝的驅(qū)動(dòng) 
$ sh ./NVIDIA-Linux-x86_64-550.127.08.run --uninstall 
# 再安裝驅(qū)動(dòng) 
$ sh ./NVIDIA-Linux-x86_64-550.127.08.run 
# 檢測(cè)驅(qū)動(dòng) 
$ nvidia-smi

Docker

Docker Engine

$ tar -xzf docker-27.4.0.tgz
$ cp docker/* /usr/local/bin/
$ docker -v

將 https://github.com/containerd/containerd/blob/main/containerd.service 內(nèi)容保存至 /usr/lib/systemd/system/containerd.service：

# Copyright The containerd Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target dbus.service

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd

Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity

# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

$ systemctl enable --now containerd
$ systemctl status containerd

將下面內(nèi)容保存至 /usr/lib/systemd/system/docker.service：

[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/dockerd
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutStartSec=0
RestartSec=2
Restart=always
StartLimitBurst=3
StartLimitInterval=60s
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
Delegate=yes
KillMode=process
OOMScoreAdjust=-500

[Install]
WantedBy=multi-user.target

$ systemctl enable --now docker
$ systemctl status docker

Nvidia Container Toolkit

$ tar -xzf nvidia-container-toolkit_1.17.3_rpm_x86_64.tar.gz
$ cd release-v1.17.3-stable/packages/centos7/x86_64
$ rpm -i libnvidia-container1-1.17.3-1.x86_64.rpm
$ rpm -i libnvidia-container-tools-1.17.3-1.x86_64.rpm
$ rpm -i nvidia-container-toolkit-base-1.17.3-1.x86_64.rpm
$ rpm -i nvidia-container-toolkit-1.17.3-1.x86_64.rpm 
# 檢查安裝情況
$ nvidia-ctk -h
# 配置 Nvidia Container Runtime
$ nvidia-ctk runtime configure --runtime=docker
# 檢查配置
$ cat /etc/docker/daemon.json
# 重啟 docker
$ systemctl restart docker
# 重啟服務(wù)后執(zhí)行如下命令查看效果：
$ docker info | grep Runtimes
 Runtimes: io.containerd.runc.v2 nvidia runc

Qwen

1. 校驗(yàn)?zāi)Ｐ臀募?/h2>

942d93a82fb6d0cb27c940329db971c1e55da78aed959b7a9ac23944363e8f47  model-00001-of-00005.safetensors
19139f34508cb30b78868db0f19ed23dbc9f248f1c5688e29000ed19b29a7eef  model-00002-of-00005.safetensors
d0f829efe1693dddaa4c6e42e867603f19d9cc71806df6e12b56cc3567927169  model-00003-of-00005.safetensors
3a5a428f449bc9eaf210f8c250bc48f3edeae027c4ef8ae48dd4f80e744dd19e  model-00004-of-00005.safetensors
c22a1d1079136e40e1d445dda1de9e3fe5bd5d3b08357c2eb052c5b71bf871fe  model-00005-of-00005.safetensors

$ cd /root/model/Qwen2.5-32B-Instruct-GPTQ-Int4
$ sha256sum *.safetensors > sum.txt

2. 配置集群

在兩臺(tái)機(jī)器分別準(zhǔn)備好 vllm/vllm-openai:v0.6.4.post1 鏡像后肖卧，將 https://github.com/vllm-project/vllm/blob/main/examples/run_cluster.sh 存放至 /root/model/：

#!/bin/bash

# Check for minimum number of required arguments
if [ $# -lt 4 ]; then
    echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]"
    exit 1
fi

# Assign the first three arguments and shift them away
DOCKER_IMAGE="$1"
HEAD_NODE_ADDRESS="$2"
NODE_TYPE="$3"  # Should be --head or --worker
PATH_TO_HF_HOME="$4"
shift 4

# Additional arguments are passed directly to the Docker command
ADDITIONAL_ARGS=("$@")

# Validate node type
if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
    echo "Error: Node type must be --head or --worker"
    exit 1
fi

# Define a function to cleanup on EXIT signal
cleanup() {
    docker stop node
    docker rm node
}
trap cleanup EXIT

# Command setup for head or worker node
RAY_START_CMD="ray start --block"
if [ "${NODE_TYPE}" == "--head" ]; then
    RAY_START_CMD+=" --head --port=6379"
else
    RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"
fi

# Run the docker command with the user specified parameters and additional arguments
docker run \
    --entrypoint /bin/bash \
    --network host \
    --name node \
    --shm-size 10.24g \
    --gpus all \
    -v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \
    "${ADDITIONAL_ARGS[@]}" \
    "${DOCKER_IMAGE}" -c "${RAY_START_CMD}"

選擇節(jié)點(diǎn)1 作為 head node蚯窥，節(jié)點(diǎn)2 作為 worker node。

在節(jié)點(diǎn)1 執(zhí)行：

nohup bash run_cluster.sh \
    vllm/vllm-openai:v0.6.4.post1 \
    IP_OF_HEAD_NODE \
    --head \
    /root/model > nohup.log 2>&1 &

在節(jié)點(diǎn)2 執(zhí)行：

nohup bash run_cluster.sh \
    vllm/vllm-openai:v0.6.4.post1 \
    IP_OF_HEAD_NODE \
    --worker \
    /root/model > nohup.log 2>&1 &

注意：兩個(gè)節(jié)點(diǎn)執(zhí)行腳本指定的都是 head 節(jié)點(diǎn)的 IP塞帐。

在任意節(jié)點(diǎn)通過 docker exec -ti node bash 進(jìn)入容器：

# 查看集群狀態(tài)
$ ray status

3. 啟動(dòng) vLLM 服務(wù)

在節(jié)點(diǎn)1 的容器中啟動(dòng)服務(wù)（按當(dāng)前顯卡配置拦赠，GPU 利用率 90% 的前提下，只能將原始模型 32k 的上下文長(zhǎng)度縮減到 4k）：

# 根據(jù) 2 個(gè)節(jié)點(diǎn)和每個(gè)節(jié)點(diǎn) 1 個(gè) GPU 設(shè)置總的 tensor-parallel-size
$ nohup vllm serve /root/.cache/huggingface/Qwen2.5-32B-Instruct-GPTQ-Int4 \
    --served-model-name Qwen2.5-32B-Instruct-GPTQ-Int4 \
    --tensor-parallel-size 2 --max-model-len 4096 \
    > vllm_serve_qwen_nohup.log 2>&1 &

參數(shù)調(diào)整過程

默認(rèn) gpu-memory-utilization（0.9）時(shí)葵姥，日志中輸出的 # GPU blocks 為 0荷鼠。

No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine. —— --gpu-memory-utilization 0.95

調(diào)整 gpu-memory-utilization 為 0.95 后，# GPU blocks: 271榔幸，271 * 16 = 4336允乐，即下面報(bào)錯(cuò)中的 KV cache token 數(shù)矮嫉。

The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (4336). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. —— --max_model_len 4096

添加 --max-model-len 4096 后，# GPU blocks: 1548

4. 驗(yàn)證對(duì)話接口

curl --request POST \
  -H "Content-Type: application/json" \
  --url http://IP_OF_HEAD_NODE:8000/v1/chat/completions \
  --data '{"messages":[{"role":"user","content":"我希望你充當(dāng) IT 專家牍疏。我會(huì)向您提供有關(guān)我的技術(shù)問題所需的所有信息敞临，而您的職責(zé)是解決我的問題。你應(yīng)該使用你的計(jì)算機(jī)科學(xué)麸澜、網(wǎng)絡(luò)基礎(chǔ)設(shè)施和 IT 安全知識(shí)來解決我的問題。在您的回答中使用適合所有級(jí)別的人的智能奏黑、簡(jiǎn)單和易于理解的語(yǔ)言將很有幫助炊邦。用要點(diǎn)逐步解釋您的解決方案很有幫助。盡量避免過多的技術(shù)細(xì)節(jié)熟史，但在必要時(shí)使用它們馁害。我希望您回復(fù)解決方案，而不是寫任何解釋蹂匹。我的第一個(gè)問題是“我的筆記本電腦出現(xiàn)藍(lán)屏錯(cuò)誤”碘菜。"}],"stream":true,"model":"Qwen2.5-32B-Instruct-GPTQ-Int4"}'

必須設(shè)置 Content-Type 請(qǐng)求頭，否則會(huì)報(bào) 500 的錯(cuò)誤：[Bug]: Missing Content Type returns 500 Internal Server Error instead of 415 Unsupported Media Type

回復(fù)都是限寞！

we currently find two workarounds

use gptq_marlin, which is available for Ampere and later cards.

change the number on this line from 50 to 0 and install from the modified source code. it may affect speed on short sequences though.
—— https://github.com/QwenLM/Qwen2.5/issues/1103#issuecomment-2507022590

目前 Qwen 和 vLLM 社區(qū)均向項(xiàng)目開發(fā)者報(bào)告了類似問題忍啸，jklj077 暫時(shí)給出了兩個(gè)繞過方案：

需要修改模型文件中的 config.json，將其中的 "quant_method": "gptq", 修改為 "quant_method": "gptq_marlin",履植，但需要顯卡算力在 8.0 以上计雌；
需要修改 vLLM 源碼，之后使用修改后源碼安裝玫霎。

5. 驗(yàn)證補(bǔ)全接口

curl --request POST \
  -H "Content-Type: application/json" \
  --url http://IP_OF_HEAD_NODE:8000/v1/completions \
  --data '{"prompt":"who r u?","model":"Qwen2.5-32B-Instruct-GPTQ-Int4"}'

參考資料

nvidia顯卡驅(qū)動(dòng)安裝
Centos7.9離線安裝Docker24(無坑版)_centos7.9 離線安裝docker-CSDN博客
用 PaddleNLP 結(jié)合 CodeGen 實(shí)現(xiàn)離線 GitHub Copilot - Alpha Hinex's Blog
[Usage]: vllm infer with 2 * Nvidia-L20, output repeat !!!!
[Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!!
Distributed Inference and Serving
vLLM - Multi-Node Inference and Serving
大模型推理:vllm多機(jī)多卡分布式本地部署_vllm 多卡部署-CSDN博客
vLLM分布式多GPU Docker部署踩坑記 | LittleFish’Blog

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末凿滤，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子庶近，更是在濱河造成了極大的恐慌翁脆，老刑警劉巖，帶你破解...
沈念sama閱讀 216,470評(píng)論 6贊 501
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件鼻种，死亡現(xiàn)場(chǎng)離奇詭異反番，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)叉钥，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,393評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門恬口，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人沼侣，你說我怎么就攤上這事祖能。” “怎么了蛾洛？”我有些...
開封第一講書人閱讀 162,577評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵养铸，是天一觀的道長(zhǎng)雁芙。經(jīng)常有香客問我，道長(zhǎng)钞螟，這世上最難降的妖魔是什么兔甘？我笑而不...
開封第一講書人閱讀 58,176評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮鳞滨，結(jié)果婚禮上洞焙，老公的妹妹穿的比我還像新娘。我一直安慰自己拯啦，他們只是感情好澡匪，可當(dāng)我...
茶點(diǎn)故事閱讀 67,189評(píng)論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著褒链，像睡著了一般唁情。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上甫匹，一...
開封第一講書人閱讀 51,155評(píng)論 1贊 299
城市分裂傳說
那天甸鸟，我揣著相機(jī)與錄音，去河邊找鬼兵迅。笑死抢韭，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的恍箭。我是一名探鬼主播篮绰，決...
沈念sama閱讀 40,041評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼季惯！你這毒婦竟也來了吠各？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 38,903評(píng)論 0贊 274
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤勉抓，失蹤者是張志新（化名）和其女友劉穎贾漏，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體藕筋，經(jīng)...
沈念sama閱讀 45,319評(píng)論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡纵散，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,539評(píng)論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了隐圾。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片伍掀。...
茶點(diǎn)故事閱讀 39,703評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖暇藏，靈堂內(nèi)的尸體忽然破棺而出蜜笤，到底是詐尸還是另有隱情，我是刑警寧澤盐碱，帶...
沈念sama閱讀 35,417評(píng)論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布把兔，位于F島的核電站沪伙，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏县好。R本人自食惡果不足惜围橡，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,013評(píng)論 3贊 325
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望缕贡。院中可真熱鬧翁授，春花似錦、人聲如沸晾咪。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,664評(píng)論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)禀酱。三九已至，卻和暖如春牧嫉，著一層夾襖步出監(jiān)牢的瞬間剂跟，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,818評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來泰國(guó)打工酣藻，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留曹洽，地道東北人。一個(gè)月前我還...
沈念sama閱讀 47,711評(píng)論 2贊 368
代替公主和親
正文我出身青樓辽剧，卻偏偏與公主長(zhǎng)得像送淆，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子怕轿，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,601評(píng)論 2贊 353