原文地址:https://alphahinex.github.io/2024/12/22/vllm-multi-node-inference/
description: "本文記錄了在兩臺(tái)機(jī)器工禾,每臺(tái)機(jī)器一塊 Tesla T4 顯卡的環(huán)境下,使用 vLLM 部署 Qwen2.5-32B-Instruct-GPTQ-Int4 模型的過程及遇到的問題,供類似環(huán)境使用 vLLM 進(jìn)行多節(jié)點(diǎn)多卡推理參考逊朽。"
date: 2024.12.22 10:26
categories:
- AI
tags: [AI, Python, vLLM]
keywords: vllm, gptq, gptq_marlin, tensor-parallel-size, Qwen2.5-32B-Instruct-GPTQ-Int4, multi-node inference, docker, nvidia container toolkit, max-model-len, gpu-memory-utilization, tesla t4
本文記錄了在兩臺(tái)機(jī)器,每臺(tái)機(jī)器一塊 Tesla T4 顯卡的環(huán)境下扶镀,使用 vLLM 部署 Qwen2.5-32B-Instruct-GPTQ-Int4 模型的過程及遇到的問題溜在,供類似環(huán)境使用 vLLM 進(jìn)行多節(jié)點(diǎn)多卡推理參考硬毕。
部署清單
- Qwen2.5-32B-Instruct-GPTQ-Int4、vLLM
- docker v27.4.0考蕾、nvidia-container-toolkit v1.17.3
- Tesla T4 顯卡驅(qū)動(dòng) v550.127.08 CUDA12.4
部署包準(zhǔn)備
# qwen
$ git clone https://www.modelscope.cn/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4.git
# vllm image
$ docker pull vllm/vllm-openai:v0.6.4.post1
# export
$ docker save vllm/vllm-openai:v0.6.4.post1 | gzip > images.tar.gz
更新顯卡驅(qū)動(dòng)
需要更新至 cuda>=12.4祸憋,以運(yùn)行 vLLM 容器。
# 先卸載之前安裝的驅(qū)動(dòng)
$ sh ./NVIDIA-Linux-x86_64-550.127.08.run --uninstall
# 再安裝驅(qū)動(dòng)
$ sh ./NVIDIA-Linux-x86_64-550.127.08.run
# 檢測(cè)驅(qū)動(dòng)
$ nvidia-smi
Docker
Docker Engine
$ tar -xzf docker-27.4.0.tgz
$ cp docker/* /usr/local/bin/
$ docker -v
將 https://github.com/containerd/containerd/blob/main/containerd.service 內(nèi)容保存至 /usr/lib/systemd/system/containerd.service
:
# Copyright The containerd Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target dbus.service
[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd
Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999
[Install]
WantedBy=multi-user.target
$ systemctl enable --now containerd
$ systemctl status containerd
將下面內(nèi)容保存至 /usr/lib/systemd/system/docker.service
:
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target
[Service]
Type=notify
ExecStart=/usr/local/bin/dockerd
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutStartSec=0
RestartSec=2
Restart=always
StartLimitBurst=3
StartLimitInterval=60s
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
Delegate=yes
KillMode=process
OOMScoreAdjust=-500
[Install]
WantedBy=multi-user.target
$ systemctl enable --now docker
$ systemctl status docker
Nvidia Container Toolkit
$ tar -xzf nvidia-container-toolkit_1.17.3_rpm_x86_64.tar.gz
$ cd release-v1.17.3-stable/packages/centos7/x86_64
$ rpm -i libnvidia-container1-1.17.3-1.x86_64.rpm
$ rpm -i libnvidia-container-tools-1.17.3-1.x86_64.rpm
$ rpm -i nvidia-container-toolkit-base-1.17.3-1.x86_64.rpm
$ rpm -i nvidia-container-toolkit-1.17.3-1.x86_64.rpm
# 檢查安裝情況
$ nvidia-ctk -h
# 配置 Nvidia Container Runtime
$ nvidia-ctk runtime configure --runtime=docker
# 檢查配置
$ cat /etc/docker/daemon.json
# 重啟 docker
$ systemctl restart docker
# 重啟服務(wù)后執(zhí)行如下命令查看效果:
$ docker info | grep Runtimes
Runtimes: io.containerd.runc.v2 nvidia runc
Qwen
1. 校驗(yàn)?zāi)P臀募?/h2>
942d93a82fb6d0cb27c940329db971c1e55da78aed959b7a9ac23944363e8f47 model-00001-of-00005.safetensors
19139f34508cb30b78868db0f19ed23dbc9f248f1c5688e29000ed19b29a7eef model-00002-of-00005.safetensors
d0f829efe1693dddaa4c6e42e867603f19d9cc71806df6e12b56cc3567927169 model-00003-of-00005.safetensors
3a5a428f449bc9eaf210f8c250bc48f3edeae027c4ef8ae48dd4f80e744dd19e model-00004-of-00005.safetensors
c22a1d1079136e40e1d445dda1de9e3fe5bd5d3b08357c2eb052c5b71bf871fe model-00005-of-00005.safetensors
$ cd /root/model/Qwen2.5-32B-Instruct-GPTQ-Int4
$ sha256sum *.safetensors > sum.txt
2. 配置集群
942d93a82fb6d0cb27c940329db971c1e55da78aed959b7a9ac23944363e8f47 model-00001-of-00005.safetensors
19139f34508cb30b78868db0f19ed23dbc9f248f1c5688e29000ed19b29a7eef model-00002-of-00005.safetensors
d0f829efe1693dddaa4c6e42e867603f19d9cc71806df6e12b56cc3567927169 model-00003-of-00005.safetensors
3a5a428f449bc9eaf210f8c250bc48f3edeae027c4ef8ae48dd4f80e744dd19e model-00004-of-00005.safetensors
c22a1d1079136e40e1d445dda1de9e3fe5bd5d3b08357c2eb052c5b71bf871fe model-00005-of-00005.safetensors
$ cd /root/model/Qwen2.5-32B-Instruct-GPTQ-Int4
$ sha256sum *.safetensors > sum.txt
在兩臺(tái)機(jī)器分別準(zhǔn)備好 vllm/vllm-openai:v0.6.4.post1
鏡像后肖卧,將 https://github.com/vllm-project/vllm/blob/main/examples/run_cluster.sh 存放至 /root/model/
:
#!/bin/bash
# Check for minimum number of required arguments
if [ $# -lt 4 ]; then
echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]"
exit 1
fi
# Assign the first three arguments and shift them away
DOCKER_IMAGE="$1"
HEAD_NODE_ADDRESS="$2"
NODE_TYPE="$3" # Should be --head or --worker
PATH_TO_HF_HOME="$4"
shift 4
# Additional arguments are passed directly to the Docker command
ADDITIONAL_ARGS=("$@")
# Validate node type
if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
echo "Error: Node type must be --head or --worker"
exit 1
fi
# Define a function to cleanup on EXIT signal
cleanup() {
docker stop node
docker rm node
}
trap cleanup EXIT
# Command setup for head or worker node
RAY_START_CMD="ray start --block"
if [ "${NODE_TYPE}" == "--head" ]; then
RAY_START_CMD+=" --head --port=6379"
else
RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"
fi
# Run the docker command with the user specified parameters and additional arguments
docker run \
--entrypoint /bin/bash \
--network host \
--name node \
--shm-size 10.24g \
--gpus all \
-v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \
"${ADDITIONAL_ARGS[@]}" \
"${DOCKER_IMAGE}" -c "${RAY_START_CMD}"
選擇 節(jié)點(diǎn)1 作為 head node蚯窥,節(jié)點(diǎn)2 作為 worker node。
在 節(jié)點(diǎn)1 執(zhí)行:
nohup bash run_cluster.sh \
vllm/vllm-openai:v0.6.4.post1 \
IP_OF_HEAD_NODE \
--head \
/root/model > nohup.log 2>&1 &
在 節(jié)點(diǎn)2 執(zhí)行:
nohup bash run_cluster.sh \
vllm/vllm-openai:v0.6.4.post1 \
IP_OF_HEAD_NODE \
--worker \
/root/model > nohup.log 2>&1 &
注意:兩個(gè)節(jié)點(diǎn)執(zhí)行腳本指定的都是 head 節(jié)點(diǎn)的 IP塞帐。
在任意節(jié)點(diǎn)通過 docker exec -ti node bash
進(jìn)入容器:
# 查看集群狀態(tài)
$ ray status
3. 啟動(dòng) vLLM 服務(wù)
在 節(jié)點(diǎn)1 的容器中啟動(dòng)服務(wù)(按當(dāng)前顯卡配置拦赠,GPU 利用率 90% 的前提下,只能將原始模型 32k 的上下文長(zhǎng)度縮減到 4k):
# 根據(jù) 2 個(gè)節(jié)點(diǎn)和每個(gè)節(jié)點(diǎn) 1 個(gè) GPU 設(shè)置總的 tensor-parallel-size
$ nohup vllm serve /root/.cache/huggingface/Qwen2.5-32B-Instruct-GPTQ-Int4 \
--served-model-name Qwen2.5-32B-Instruct-GPTQ-Int4 \
--tensor-parallel-size 2 --max-model-len 4096 \
> vllm_serve_qwen_nohup.log 2>&1 &
參數(shù)調(diào)整過程
默認(rèn) gpu-memory-utilization
(0.9
)時(shí)葵姥,日志中輸出的 # GPU blocks
為 0
荷鼠。
No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine. —— --gpu-memory-utilization 0.95
調(diào)整 gpu-memory-utilization
為 0.95
后,# GPU blocks: 271
榔幸,271 * 16 = 4336
允乐,即下面報(bào)錯(cuò)中的 KV cache token 數(shù)矮嫉。
The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (4336). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. —— --max_model_len 4096
添加 --max-model-len 4096
后,# GPU blocks: 1548
4. 驗(yàn)證對(duì)話接口
curl --request POST \
-H "Content-Type: application/json" \
--url http://IP_OF_HEAD_NODE:8000/v1/chat/completions \
--data '{"messages":[{"role":"user","content":"我希望你充當(dāng) IT 專家牍疏。我會(huì)向您提供有關(guān)我的技術(shù)問題所需的所有信息敞临,而您的職責(zé)是解決我的問題。你應(yīng)該使用你的計(jì)算機(jī)科學(xué)麸澜、網(wǎng)絡(luò)基礎(chǔ)設(shè)施和 IT 安全知識(shí)來解決我的問題。在您的回答中使用適合所有級(jí)別的人的智能奏黑、簡(jiǎn)單和易于理解的語(yǔ)言將很有幫助炊邦。用要點(diǎn)逐步解釋您的解決方案很有幫助。盡量避免過多的技術(shù)細(xì)節(jié)熟史,但在必要時(shí)使用它們馁害。我希望您回復(fù)解決方案,而不是寫任何解釋蹂匹。我的第一個(gè)問題是“我的筆記本電腦出現(xiàn)藍(lán)屏錯(cuò)誤”碘菜。"}],"stream":true,"model":"Qwen2.5-32B-Instruct-GPTQ-Int4"}'
必須設(shè)置
Content-Type
請(qǐng)求頭,否則會(huì)報(bào) 500 的錯(cuò)誤:[Bug]: Missing Content Type returns 500 Internal Server Error instead of 415 Unsupported Media Type
回復(fù)都是 限寞!
we currently find two workarounds
- use gptq_marlin, which is available for Ampere and later cards.
- change the number on this line from 50 to 0 and install from the modified source code. it may affect speed on short sequences though.
—— https://github.com/QwenLM/Qwen2.5/issues/1103#issuecomment-2507022590
目前 Qwen 和 vLLM 社區(qū)均向項(xiàng)目開發(fā)者報(bào)告了類似問題忍啸,jklj077 暫時(shí)給出了兩個(gè)繞過方案:
- 需要修改模型文件中的
config.json
,將其中的"quant_method": "gptq",
修改為"quant_method": "gptq_marlin",
履植,但 需要顯卡算力在 8.0 以上计雌; - 需要修改 vLLM 源碼,之后使用修改后源碼安裝玫霎。
5. 驗(yàn)證補(bǔ)全接口
curl --request POST \
-H "Content-Type: application/json" \
--url http://IP_OF_HEAD_NODE:8000/v1/completions \
--data '{"prompt":"who r u?","model":"Qwen2.5-32B-Instruct-GPTQ-Int4"}'
參考資料
- nvidia顯卡驅(qū)動(dòng)安裝
- Centos7.9離線安裝Docker24(無坑版)_centos7.9 離線安裝docker-CSDN博客
- 用 PaddleNLP 結(jié)合 CodeGen 實(shí)現(xiàn)離線 GitHub Copilot - Alpha Hinex's Blog
- [Usage]: vllm infer with 2 * Nvidia-L20, output repeat !!!!
- [Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!!
- Distributed Inference and Serving
- vLLM - Multi-Node Inference and Serving
- 大模型推理:vllm多機(jī)多卡分布式本地部署_vllm 多卡部署-CSDN博客
- vLLM分布式多GPU Docker部署踩坑記 | LittleFish’Blog