實驗測試內容
Why ? 為什么采用HuggingFace的軟件棧黑毅?
- 隨著AI 基礎軟件和框架的發(fā)展嚼摩, AI 訓練過程也變得極為復雜, AI training中往往涉及很多不同配置的組合, 比如 混合精度訓練博肋, Low-bit Optimizer, Quantized Graident等等低斋, 因此直接采用PyTorch訓練會比較繁瑣和復雜。
- 為此HuggingFace, Microsoft等推出了更High-Level高階和用戶友好的AI訓練框架匪凡, 這些框架緊跟學術屆前沿,不斷的將最新的成果集成到各自的庫中掘猿,增強自身的競爭力和影響力病游。
主要內容: 使用Huggingface Transformer 庫, 配置不同的AI訓練參數/選項稠通, 理解這些不同訓練參數和優(yōu)化的意義和原理
(由于本人也是學以致用衬衬,因此可能存在理解不到位的地方)
軟硬件環(huán)境
- Ubuntu 22.04
- 1 * RTX 3080 10GB GPU
- PyTorch 2.0
- CUDA-12.0
- Huggingface相關的庫: transformers, datasets, accelerate
基本訓練過程(Baseline)
以下為最基礎的AI模型訓練過程,不帶任何優(yōu)化改橘。
- 訓練數據為隨機生成的假數據
- 為了監(jiān)測訓練過程GPU Memory的使用情況滋尉, 采用
pynvml
庫的API輸出GPU顯存的使用量 - 模型選擇: BERT-based, 由于是采用單GPU訓練,選擇較小的模型便于觀察
- 訓練API: 主要采用 Huggingface Transformer庫的
Trainer
API, 該API已封裝的Training Loop循環(huán) - 訓練結果: 觀察GPU顯存占用量飞主, 訓練吞吐
import numpy as np
from datasets import Dataset
from pynvml import *
import torch
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer, logging
logging.set_verbosity_error()
seq_len, dataset_size = 512, 512
dummy_data = {
'input_ids': np.random.randint(100, 30000, (dataset_size, seq_len)),
'labels': np.random.randint(0,1, (dataset_size))
}
ds = Dataset.from_dict(dummy_data)
ds.set_format('pt')
def print_gpu_utilization():
nvmlInit()
handle = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(handle)
print(f'GPU memory occupied: {info.used // 1024**2} MB')
def print_summary(result):
print(f"Time: {result.metrics['train_runtime']:.2f}")
print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
print_gpu_utilization()
print_gpu_utilization()
default_args = {
"output_dir": "tmp",
"evaluation_strategy": "steps",
"num_train_epochs": 1,
"log_level": "error",
"report_to": "none",
}
training_args = TrainingArguments(per_device_train_batch_size=4,
optim='adafactor',
**default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)
輸出結果:
{'train_runtime': 16.0498, 'train_samples_per_second': 31.901, 'train_steps_per_second': 7.975, 'train_loss': 0.013442776165902615, 'epoch': 1.0}
Time: 16.05
Samples/second: 31.90
GPU memory occupied: 5790 MB
優(yōu)化1: + 梯度累加 (Gradient Accumulation)
梯度累加: 是一種時間換空間的思想方法狮惜, 使得在有限的GPU Memory條件下允許使用更大的batch_size訓練, 這里的空間指的是GPU Memory碌识。傳統(tǒng)的一般訓練過程碾篡, 每計算完一個batch便計算梯度以及進行權重Weight更新, 采用梯度累加的策略之后筏餐,每計算完若干batch之后开泽,再進行一次weight update, 每個batch計算中仍然計算梯度,將若干個batch的梯度累加在一起
對比:
- 無Gradient Accumulation
for idx, batch in enumerate(dataloader):
# Forward
loss = model(batch).loss
# Backward
loss.backward()
...
# Optimizer update
optimizer.zero_grad()
optimizer.step()
...
- +Gradient-Accumulation:
- 代碼中可能有疑問魁瞪? 沒看到梯度累加的代碼穆律? 實際上是由于PyTorch框架造成的惠呼, 每次計算完梯度backward()的時候如果不立即調用optimizer.zero_grad(), 則當前batch計算的梯度就默認累加到之前idx-1的梯度上。
- 參數: gradient_accumulation_steps 代表多少個batch之后進行一次optimizer update()峦耘。 因此實際的training_batch_size = per_device_train_batch_size * gradient_accumulation_steps
for idx, batch in enumerate(dataloader):
# Forward
loss = model(batch).loss
loss = loss / training_args.gradient_accumulation_steps
# Backward
loss.backward()
...
if idx % training_args.gradient_accumulation_steps == 0:
# Optimizer update
optimizer.zero_grad()
optimizer.step()
...
測試代碼:
training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)
保存training_batch_size不變剔蹋, 輸出結果: GPU Memory占用明顯降低 (5790MB --> 4169MB), 訓練吞吐略有降低贡歧。
per_device_train_batch_size=1, gradient_accumulation_steps=4
{'train_runtime': 19.7445, 'train_samples_per_second': 25.931, 'train_steps_per_second': 6.483, 'train_loss': 0.01618509739637375, 'epoch': 1.0}
Time: 19.74
Samples/second: 25.93
GPU memory occupied: 4169 MB
優(yōu)化2: + Gradient Checkpointing
Why 滩租? 訓練在backward計算某一layer weight的梯度時候, 需要用到Forward階段該Layer計算得到的Activation輸出利朵。 因此每個layer在Forward階段的Activation輸出需要一直保存在GPU Memory律想, 顯然增大了Memory的使用量。
Gradient Checkpoint的原理: 只保存?zhèn)€別Layer 的Activation 輸出 (被選中保存的Layer 稱為Checkpoint Node)绍弟, 在反向傳播計算采用重計算 (Recomputation)根據最近的Layer的Activation重新計算出當前Layer所需的Activation.
優(yōu)勢 vs. 劣勢:
- 優(yōu)勢: 由于只保存部分Layer 的Activation , 降低了GPU Memory占有
- 劣勢: 重計算引入了額外的計算負擔技即,訓練吞吐變慢。
代碼實現:
training_args = TrainingArguments(
per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, **default_args
)
trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)
輸出結果: GPU Memory進一步降低 (4169MB --> 3706MB), 吞吐降低: 25.93 --> 20.40
{'train_runtime': 25.1014, 'train_samples_per_second': 20.397, 'train_steps_per_second': 5.099, 'train_loss': 0.015386142767965794, 'epoch': 1.0}
Time: 25.10
Samples/second: 20.40
GPU memory occupied: 3706 MB
優(yōu)化3: + 混合精度訓練 (Mixed-Precision), 低精度
核心思想: 采用低精度的數據類型(Numeric Format) 存儲Weight, Activation.Gradient, 并且采用低精度進行計算
優(yōu)勢 vs. 劣勢:
- 優(yōu)勢: Low-precision降低Memory Footprint, 計算復雜度樟遣,提高訓練速度和吞吐
- 劣勢:使用不當會造成數值溢出而叼,訓練發(fā)散
AI訓練一般采用浮點數據類型(Floating-point) 進行存儲和計算, 目前NVIDIA GPU支持的Floating Low-bit precision formats: TF32 --> FP16---> BF16 ---> FP8
[圖片上傳失敗...(image-ecbacb-1692528009679)]
代碼實現: 比如fp16=True, bf16=True 采用相應數據類型的混合精度
training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)
輸出結果: 速度吞吐有提升(20.40 --> 25.91), GPU Memory占有反而有增加,因為Master Weight副本采用FP32存儲
{'train_runtime': 19.76, 'train_samples_per_second': 25.911, 'train_steps_per_second': 6.478, 'train_loss': 0.010953620076179504, 'epoch': 1.0}
Time: 19.76
Samples/second: 25.91
GPU memory occupied: 3829 MB
優(yōu)化4: 低精度Optimizer (8-bit Adam)
# 8bit Adam
import numpy as np
from datasets import Dataset
from pynvml import *
import torch
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer, logging
# 8bit Adam
import bitsandbytes as bnb
from torch import nn
from transformers.trainer_pt_utils import get_parameter_names
# https://huggingface.co/docs/transformers/perf_train_gpu_one
logging.set_verbosity_error()
seq_len, dataset_size = 512, 512
dummy_data = {
'input_ids': np.random.randint(100, 30000, (dataset_size, seq_len)),
'labels': np.random.randint(0,1, (dataset_size))
}
ds = Dataset.from_dict(dummy_data)
ds.set_format('pt')
def print_gpu_utilization():
nvmlInit()
handle = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(handle)
print(f'GPU memory occupied: {info.used // 1024**2} MB')
def print_summary(result):
print(f"Time: {result.metrics['train_runtime']:.2f}")
print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
print_gpu_utilization()
print_gpu_utilization()
torch.ones((1, 1)).to("cuda")
print_gpu_utilization()
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased').to('cuda')
print_gpu_utilization()
default_args = {
"output_dir": "tmp",
"evaluation_strategy": "steps",
"num_train_epochs": 1,
"log_level": "error",
"report_to": "none",
}
# first we need to group the model’s parameters into two groups where to one group we apply weight decay and to the other we don’t. Usually, biases and layer norm parameters are not weight decayed. Then in a second step we just do some argument housekeeping to use the same parameters as the previously used AdamW optimizer.
decay_parameters = get_parameter_names(model, forbidden_layer_types=[nn.LayerNorm])
decay_parameters = [name for name in decay_parameters if 'bias' not in name]
training_args = TrainingArguments(per_device_train_batch_size=1,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
fp16=True,
optim='adafactor',
**default_args)
optimizer_grouped_parameters = [
{
'params': [p for n,p in model.named_parameters() if n in decay_parameters],
'weight_decay': training_args.weight_decay,
},
{
"params": [p for n, p in model.named_parameters() if n not in decay_parameters],
"weight_decay": 0.0,
},
]
optimizer_kwargs = {
"betas": (training_args.adam_beta1, training_args.adam_beta2),
"eps": training_args.adam_epsilon,
}
optimizer_kwargs['lr'] = training_args.learning_rate
adam_bnb_optim = bnb.optim.Adam8bit(
optimizer_grouped_parameters,
betas=(training_args.adam_beta1, training_args.adam_beta2),
eps=training_args.adam_epsilon,
lr=training_args.learning_rate
)
trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None))
result = trainer.train()
print_summary(result)
輸出結果:
{'train_runtime': 17.5487, 'train_samples_per_second': 29.176, 'train_steps_per_second': 7.294, 'train_loss': 0.015325695276260376, 'epoch': 1.0}
Time: 17.55
Samples/second: 29.18
GPU memory occupied: 3161 MB
Reference
- https://huggingface.co/docs/transformers/perf_train_gpu_one
- NVIDIA GPU White Paper