優(yōu)化算法進(jìn)階
介紹更高級的優(yōu)化算法
Momentum
%matplotlib inline
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
import torch
eta = 0.4
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
def gd_2d(x1, x2, s1, s2):
return (x1 - eta * 0.2 * x1, x2 - eta * 4 * x2, 0, 0)
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
epoch 20, x1 -0.943467, x2 -0.000073
可以看到,同一位置上见剩,目標(biāo)函數(shù)在豎直方向(軸方向)比在水平方向(軸方向)的斜率的絕對值更大杀糯。因此,給定學(xué)習(xí)率苍苞,梯度下降迭代自變量時(shí)會使自變量在豎直方向比在水平方向移動(dòng)幅度更大固翰。那么,我們需要一個(gè)較小的學(xué)習(xí)率從而避免自變量在豎直方向上越過目標(biāo)函數(shù)最優(yōu)解羹呵。然而骂际,這會造成自變量在水平方向上朝最優(yōu)解移動(dòng)變慢。
下面我們試著將學(xué)習(xí)率調(diào)得稍大一點(diǎn)冈欢,此時(shí)自變量在豎直方向不斷越過最優(yōu)解并逐漸發(fā)散歉铝。
eta = 0.6
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
epoch 20, x1 -0.387814, x2 -1673.365109
def momentum_2d(x1, x2, v1, v2):
v1 = beta * v1 + eta * 0.2 * x1
v2 = beta * v2 + eta * 4 * x2
return x1 - v1, x2 - v2, v1, v2
eta, beta = 0.4, 0.5
d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))
epoch 20, x1 -0.062843, x2 0.001202
eta = 0.6
d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))
epoch 20, x1 0.007188, x2 0.002553
def get_data_ch7():
data = np.genfromtxt('/home/kesci/input/airfoil4755/airfoil_self_noise.dat', delimiter='\t')
data = (data - data.mean(axis=0)) / data.std(axis=0)
return torch.tensor(data[:1500, :-1], dtype=torch.float32), \
torch.tensor(data[:1500, -1], dtype=torch.float32)
features, labels = get_data_ch7()
def init_momentum_states():
v_w = torch.zeros((features.shape[1], 1), dtype=torch.float32)
v_b = torch.zeros(1, dtype=torch.float32)
return (v_w, v_b)
def sgd_momentum(params, states, hyperparams):
for p, v in zip(params, states):
v.data = hyperparams['momentum'] * v.data + hyperparams['lr'] * p.grad.data
p.data -= v.data
我們先將動(dòng)量超參數(shù)momentum設(shè)0.5
d2l.train_ch7(sgd_momentum, init_momentum_states(),
{'lr': 0.02, 'momentum': 0.5}, features, labels)
loss: 0.243297, 0.057950 sec per epoch
將動(dòng)量超參數(shù)momentum增大到0.9
d2l.train_ch7(sgd_momentum, init_momentum_states(),
{'lr': 0.02, 'momentum': 0.9}, features, labels)
loss: 0.260418, 0.059441 sec per epoch
可見目標(biāo)函數(shù)值在后期迭代過程中的變化不夠平滑。直覺上凑耻,10倍小批量梯度比2倍小批量梯度大了5倍太示,我們可以試著將學(xué)習(xí)率減小到原來的1/5。此時(shí)目標(biāo)函數(shù)值在下降了一段時(shí)間后變化更加平滑拳话。
d2l.train_ch7(sgd_momentum, init_momentum_states(),
{'lr': 0.004, 'momentum': 0.9}, features, labels)
loss: 0.243650, 0.063532 sec per epoch
Pytorch Class
在Pytorch中先匪,torch.optim.SGD已實(shí)現(xiàn)了Momentum。
d2l.train_pytorch_ch7(torch.optim.SGD, {'lr': 0.004, 'momentum': 0.9},
features, labels)
loss: 0.243692, 0.048604 sec per epoch
AdaGrad
%matplotlib inline
import math
import torch
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
def adagrad_2d(x1, x2, s1, s2):
g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6 # 前兩項(xiàng)為自變量梯度
s1 += g1 ** 2
s2 += g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
return x1, x2, s1, s2
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
eta = 0.4
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1 -2.382563, x2 -0.158591
下面將學(xué)習(xí)率增大到2弃衍。可以看到自變量更為迅速地逼近了最優(yōu)解坚俗。
eta = 2
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1 -0.002295, x2 -0.000000
Implement
同動(dòng)量法一樣镜盯,AdaGrad算法需要對每個(gè)自變量維護(hù)同它一樣形狀的狀態(tài)變量。我們根據(jù)AdaGrad算法中的公式實(shí)現(xiàn)該算法猖败。
def get_data_ch7():
data = np.genfromtxt('/home/kesci/input/airfoil4755/airfoil_self_noise.dat', delimiter='\t')
data = (data - data.mean(axis=0)) / data.std(axis=0)
return torch.tensor(data[:1500, :-1], dtype=torch.float32), \
torch.tensor(data[:1500, -1], dtype=torch.float32)
features, labels = get_data_ch7()
def init_adagrad_states():
s_w = torch.zeros((features.shape[1], 1), dtype=torch.float32)
s_b = torch.zeros(1, dtype=torch.float32)
return (s_w, s_b)
def adagrad(params, states, hyperparams):
eps = 1e-6
for p, s in zip(params, states):
s.data += (p.grad.data**2)
p.data -= hyperparams['lr'] * p.grad.data / torch.sqrt(s + eps)
word2vec
PTB 數(shù)據(jù)集 Skip-Gram 跳字模型 負(fù)采樣近似 訓(xùn)練模型
詞嵌入基礎(chǔ)
我們在“循環(huán)神經(jīng)網(wǎng)絡(luò)的從零開始實(shí)現(xiàn)”一節(jié)中使用 one-hot 向量表示單詞速缆,雖然它們構(gòu)造起來很容易,但通常并不是一個(gè)好選擇恩闻。一個(gè)主要的原因是艺糜,one-hot 詞向量無法準(zhǔn)確表達(dá)不同詞之間的相似度,如我們常常使用的余弦相似度幢尚。
Word2Vec 詞嵌入工具的提出正是為了解決上面這個(gè)問題破停,它將每個(gè)詞表示成一個(gè)定長的向量,并通過在語料庫上的預(yù)訓(xùn)練使得這些向量能較好地表達(dá)不同詞之間的相似和類比關(guān)系尉剩,以引入一定的語義信息真慢。基于兩種概率模型的假設(shè)理茎,我們可以定義兩種 Word2Vec 模型:
import collections
import math
import random
import sys
import time
import os
import numpy as np
import torch
from torch import nn
import torch.utils.data as Data
PTB 數(shù)據(jù)集
簡單來說黑界,Word2Vec 能從語料中學(xué)到如何將離散的詞映射為連續(xù)空間中的向量管嬉,并保留其語義上的相似關(guān)系。那么為了訓(xùn)練 Word2Vec 模型朗鸠,我們就需要一個(gè)自然語言語料庫蚯撩,模型將從中學(xué)習(xí)各個(gè)單詞間的關(guān)系,這里我們使用經(jīng)典的 PTB 語料庫進(jìn)行訓(xùn)練烛占。PTB (Penn Tree Bank) 是一個(gè)常用的小型語料庫胎挎,它采樣自《華爾街日報(bào)》的文章,包括訓(xùn)練集扰楼、驗(yàn)證集和測試集呀癣。我們將在PTB訓(xùn)練集上訓(xùn)練詞嵌入模型。
載入數(shù)據(jù)集
數(shù)據(jù)集訓(xùn)練文件 ptb.train.txt 示例:
aer banknote berlitz calloway centrust cluett fromstein gitano guterman ...
pierre N years old will join the board as a nonexecutive director nov. N
mr. is chairman of n.v. the dutch publishing group
...
with open('/home/kesci/input/ptb_train1020/ptb.train.txt', 'r') as f:
lines = f.readlines() # 該數(shù)據(jù)集中句子以換行符為分割
raw_dataset = [st.split() for st in lines] # st是sentence的縮寫弦赖,單詞以空格為分割
print('# sentences: %d' % len(raw_dataset))
# 對于數(shù)據(jù)集的前3個(gè)句子项栏,打印每個(gè)句子的詞數(shù)和前5個(gè)詞
# 句尾符為 '' ,生僻詞全用 '' 表示蹬竖,數(shù)字則被替換成了 'N'
for st in raw_dataset[:3]:
print('# tokens:', len(st), st[:5])
# sentences: 42068
# tokens: 24 ['aer', 'banknote', 'berlitz', 'calloway', 'centrust']
# tokens: 15 ['pierre', '<unk>', 'N', 'years', 'old']
# tokens: 11 ['mr.', '<unk>', 'is', 'chairman', 'of']
建立詞語索引
counter = collections.Counter([tk for st in raw_dataset for tk in st]) # tk是token的縮寫
counter = dict(filter(lambda x: x[1] >= 5, counter.items())) # 只保留在數(shù)據(jù)集中至少出現(xiàn)5次的詞
idx_to_token = [tk for tk, _ in counter.items()]
token_to_idx = {tk: idx for idx, tk in enumerate(idx_to_token)}
dataset = [[token_to_idx[tk] for tk in st if tk in token_to_idx]
for st in raw_dataset] # raw_dataset中的單詞在這一步被轉(zhuǎn)換為對應(yīng)的idx
num_tokens = sum([len(st) for st in dataset])
'# tokens: %d' % num_tokens
out:
'# tokens: 887100'
二次采樣
def discard(idx):
'''
@params:
idx: 單詞的下標(biāo)
@return: True/False 表示是否丟棄該單詞
'''
return random.uniform(0, 1) < 1 - math.sqrt(
1e-4 / counter[idx_to_token[idx]] * num_tokens)
subsampled_dataset = [[tk for tk in st if not discard(tk)] for st in dataset]
print('# tokens: %d' % sum([len(st) for st in subsampled_dataset]))
def compare_counts(token):
return '# %s: before=%d, after=%d' % (token, sum(
[st.count(token_to_idx[token]) for st in dataset]), sum(
[st.count(token_to_idx[token]) for st in subsampled_dataset]))
print(compare_counts('the'))
print(compare_counts('join'))
*# tokens: 375995
*# the: before=50770, after=2161
*# join: before=45, after=45
提取中心詞和背景詞
def get_centers_and_contexts(dataset, max_window_size):
'''
@params:
dataset: 數(shù)據(jù)集為句子的集合沼沈,每個(gè)句子則為單詞的集合,此時(shí)單詞已經(jīng)被轉(zhuǎn)換為相應(yīng)數(shù)字下標(biāo)
max_window_size: 背景詞的詞窗大小的最大值
@return:
centers: 中心詞的集合
contexts: 背景詞窗的集合币厕,與中心詞對應(yīng)列另,每個(gè)背景詞窗則為背景詞的集合
'''
centers, contexts = [], []
for st in dataset:
if len(st) < 2: # 每個(gè)句子至少要有2個(gè)詞才可能組成一對“中心詞-背景詞”
continue
centers += st
for center_i in range(len(st)):
window_size = random.randint(1, max_window_size) # 隨機(jī)選取背景詞窗大小
indices = list(range(max(0, center_i - window_size),
min(len(st), center_i + 1 + window_size)))
indices.remove(center_i) # 將中心詞排除在背景詞之外
contexts.append([st[idx] for idx in indices])
return centers, contexts
all_centers, all_contexts = get_centers_and_contexts(subsampled_dataset, 5)
tiny_dataset = [list(range(7)), list(range(7, 10))]
print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
print('center', center, 'has contexts', context)
dataset [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9]]
center 0 has contexts [1, 2]
center 1 has contexts [0, 2, 3]
center 2 has contexts [0, 1, 3, 4]
center 3 has contexts [2, 4]
center 4 has contexts [3, 5]
center 5 has contexts [4, 6]
center 6 has contexts [5]
center 7 has contexts [8]
center 8 has contexts [7, 9]
center 9 has contexts [7, 8]
注:數(shù)據(jù)批量讀取的實(shí)現(xiàn)需要依賴負(fù)采樣近似的實(shí)現(xiàn),故放于負(fù)采樣近似部分進(jìn)行講解旦装。
kip-Gram 跳字模型
PyTorch 預(yù)置的 Embedding 層
embed = nn.Embedding(num_embeddings=10, embedding_dim=4)
print(embed.weight)
x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.long)
print(embed(x))
Parameter containing:
tensor([[-0.7417, -1.9469, -0.5745, 1.4267],
[ 1.1483, 1.4781, 0.3064, -0.2893],
[ 0.6840, 2.4566, -0.1872, -2.2061],
[ 0.3386, 1.3820, -0.3142, 0.2427],
[ 0.4802, -0.6375, -0.4730, 1.2114],
[ 0.7130, -0.9774, 0.5321, 1.4228],
[-0.6726, -0.5829, -0.4888, -0.3290],
[ 0.3152, -0.6827, 0.9950, -0.3326],
[-1.4651, 1.2344, 1.9976, -1.5962],
[ 0.0872, 0.0130, -2.1396, -0.6361]], requires_grad=True)
tensor([[[ 1.1483, 1.4781, 0.3064, -0.2893],
[ 0.6840, 2.4566, -0.1872, -2.2061],
[ 0.3386, 1.3820, -0.3142, 0.2427]],
[[ 0.4802, -0.6375, -0.4730, 1.2114],
[ 0.7130, -0.9774, 0.5321, 1.4228],
[-0.6726, -0.5829, -0.4888, -0.3290]]], grad_fn=<EmbeddingBackward>)
PyTorch 預(yù)置的批量乘法
X = torch.ones((2, 1, 4))
Y = torch.ones((2, 4, 6))
print(torch.bmm(X, Y).shape)
torch.Size([2, 1, 6])
Skip-Gram 模型的前向計(jì)算
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
'''
@params:
center: 中心詞下標(biāo)页衙,形狀為 (n, 1) 的整數(shù)張量
contexts_and_negatives: 背景詞和噪音詞下標(biāo),形狀為 (n, m) 的整數(shù)張量
embed_v: 中心詞的 embedding 層
embed_u: 背景詞的 embedding 層
@return:
pred: 中心詞與背景詞(或噪音詞)的內(nèi)積阴绢,之后可用于計(jì)算概率 p(w_o|w_c)
'''
v = embed_v(center) # shape of (n, 1, d)
u = embed_u(contexts_and_negatives) # shape of (n, m, d)
pred = torch.bmm(v, u.permute(0, 2, 1)) # bmm((n, 1, d), (n, d, m)) => shape of (n, 1, m)
return pred
負(fù)采樣近似
def get_negatives(all_contexts, sampling_weights, K):
'''
@params:
all_contexts: [[w_o1, w_o2, ...], [...], ... ]
sampling_weights: 每個(gè)單詞的噪聲詞采樣概率
K: 隨機(jī)采樣個(gè)數(shù)
@return:
all_negatives: [[w_n1, w_n2, ...], [...], ...]
'''
all_negatives, neg_candidates, i = [], [], 0
population = list(range(len(sampling_weights)))
for contexts in all_contexts:
negatives = []
while len(negatives) < len(contexts) * K:
if i == len(neg_candidates):
# 根據(jù)每個(gè)詞的權(quán)重(sampling_weights)隨機(jī)生成k個(gè)詞的索引作為噪聲詞店乐。
# 為了高效計(jì)算,可以將k設(shè)得稍大一點(diǎn)
i, neg_candidates = 0, random.choices(
population, sampling_weights, k=int(1e5))
neg, i = neg_candidates[i], i + 1
# 噪聲詞不能是背景詞
if neg not in set(contexts):
negatives.append(neg)
all_negatives.append(negatives)
return all_negatives
sampling_weights = [counter[w]**0.75 for w in idx_to_token]
all_negatives = get_negatives(all_contexts, sampling_weights, 5)
注:除負(fù)采樣方法外呻袭,還有層序 softmax (hiererarchical softmax) 方法也可以用來解決計(jì)算量過大的問題眨八,請參考原書10.2.2節(jié)。
批量讀取數(shù)據(jù)
def get_negatives(all_contexts, sampling_weights, K):
'''
@params:
all_contexts: [[w_o1, w_o2, ...], [...], ... ]
sampling_weights: 每個(gè)單詞的噪聲詞采樣概率
K: 隨機(jī)采樣個(gè)數(shù)
@return:
all_negatives: [[w_n1, w_n2, ...], [...], ...]
'''
all_negatives, neg_candidates, i = [], [], 0
population = list(range(len(sampling_weights)))
for contexts in all_contexts:
negatives = []
while len(negatives) < len(contexts) * K:
if i == len(neg_candidates):
# 根據(jù)每個(gè)詞的權(quán)重(sampling_weights)隨機(jī)生成k個(gè)詞的索引作為噪聲詞左电。
# 為了高效計(jì)算廉侧,可以將k設(shè)得稍大一點(diǎn)
i, neg_candidates = 0, random.choices(
population, sampling_weights, k=int(1e5))
neg, i = neg_candidates[i], i + 1
# 噪聲詞不能是背景詞
if neg not in set(contexts):
negatives.append(neg)
all_negatives.append(negatives)
return all_negatives
sampling_weights = [counter[w]**0.75 for w in idx_to_token]
all_negatives = get_negatives(all_contexts, sampling_weights, 5)
*注:除負(fù)采樣方法外,還有層序 softmax (hiererarchical softmax) 方法也可以用來解決計(jì)算量過大的問題篓足,請參考[原書10.2.2節(jié)]
批量讀取數(shù)據(jù)
class MyDataset(torch.utils.data.Dataset):
def __init__(self, centers, contexts, negatives):
assert len(centers) == len(contexts) == len(negatives)
self.centers = centers
self.contexts = contexts
self.negatives = negatives
def __getitem__(self, index):
return (self.centers[index], self.contexts[index], self.negatives[index])
def __len__(self):
return len(self.centers)
def batchify(data):
'''
用作DataLoader的參數(shù)collate_fn
@params:
data: 長為batch_size的列表段誊,列表中的每個(gè)元素都是__getitem__得到的結(jié)果
@outputs:
batch: 批量化后得到 (centers, contexts_negatives, masks, labels) 元組
centers: 中心詞下標(biāo),形狀為 (n, 1) 的整數(shù)張量
contexts_negatives: 背景詞和噪聲詞的下標(biāo)纷纫,形狀為 (n, m) 的整數(shù)張量
masks: 與補(bǔ)齊相對應(yīng)的掩碼枕扫,形狀為 (n, m) 的0/1整數(shù)張量
labels: 指示中心詞的標(biāo)簽,形狀為 (n, m) 的0/1整數(shù)張量
'''
max_len = max(len(c) + len(n) for _, c, n in data)
centers, contexts_negatives, masks, labels = [], [], [], []
for center, context, negative in data:
cur_len = len(context) + len(negative)
centers += [center]
contexts_negatives += [context + negative + [0] * (max_len - cur_len)]
masks += [[1] * cur_len + [0] * (max_len - cur_len)] # 使用掩碼變量mask來避免填充項(xiàng)對損失函數(shù)計(jì)算的影響
labels += [[1] * len(context) + [0] * (max_len - len(context))]
batch = (torch.tensor(centers).view(-1, 1), torch.tensor(contexts_negatives),
torch.tensor(masks), torch.tensor(labels))
return batch
batch_size = 512
num_workers = 0 if sys.platform.startswith('win32') else 4
dataset = MyDataset(all_centers, all_contexts, all_negatives)
data_iter = Data.DataLoader(dataset, batch_size, shuffle=True,
collate_fn=batchify,
num_workers=num_workers)
for batch in data_iter:
for name, data in zip(['centers', 'contexts_negatives', 'masks',
'labels'], batch):
print(name, 'shape:', data.shape)
break
centers shape: torch.Size([512, 1])
contexts_negatives shape: torch.Size([512, 60])
masks shape: torch.Size([512, 60])
labels shape: torch.Size([512, 60])
訓(xùn)練模型
損失函數(shù)
詞嵌入進(jìn)階
介紹了GloVe模型和使用GloVe模型的近義詞和類比詞的實(shí)現(xiàn)
import torch
import torchtext.vocab as vocab
print([key for key in vocab.pretrained_aliases.keys() if "glove" in key])
cache_dir = "/home/kesci/input/GloVe6B5429"
glove = vocab.GloVe(name='6B', dim=50, cache=cache_dir)
print("一共包含%d個(gè)詞辱魁。" % len(glove.stoi))
print(glove.stoi['beautiful'], glove.itos[3366])
['glove.42B.300d', 'glove.840B.300d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.6B.50d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d']
一共包含400000個(gè)詞烟瞧。
3366 beautiful
求近義詞和類比詞
求近義詞
由于詞向量空間中的余弦相似性可以衡量詞語含義的相似性(為什么诗鸭?),我們可以通過尋找空間中的 k 近鄰参滴,來查詢單詞的近義詞强岸。
def knn(W, x, k):
'''
@params:
W: 所有向量的集合
x: 給定向量
k: 查詢的數(shù)量
@outputs:
topk: 余弦相似性最大k個(gè)的下標(biāo)
[...]: 余弦相似度
'''
cos = torch.matmul(W, x.view((-1,))) / (
(torch.sum(W * W, dim=1) + 1e-9).sqrt() * torch.sum(x * x).sqrt())
_, topk = torch.topk(cos, k=k)
topk = topk.cpu().numpy()
return topk, [cos[i].item() for i in topk]
def get_similar_tokens(query_token, k, embed):
'''
@params:
query_token: 給定的單詞
k: 所需近義詞的個(gè)數(shù)
embed: 預(yù)訓(xùn)練詞向量
'''
topk, cos = knn(embed.vectors,
embed.vectors[embed.stoi[query_token]], k+1)
for i, c in zip(topk[1:], cos[1:]): # 除去輸入詞
print('cosine sim=%.3f: %s' % (c, (embed.itos[i])))
get_similar_tokens('chip', 3, glove)
cosine sim=0.856: chips
cosine sim=0.749: intel
cosine sim=0.749: electronics
100%|█████████▉| 398393/400000 [00:30<00:00, 38997.22it/s]
get_similar_tokens('baby', 3, glove)
cosine sim=0.839: babies
cosine sim=0.800: boy
cosine sim=0.792: girl
get_similar_tokens('beautiful', 3, glove)
cosine sim=0.921: lovely
cosine sim=0.893: gorgeous
cosine sim=0.830: wonderful
求類比詞
def get_analogy(token_a, token_b, token_c, embed):
'''
@params:
token_a: 詞a
token_b: 詞b
token_c: 詞c
embed: 預(yù)訓(xùn)練詞向量
@outputs:
res: 類比詞d
'''
vecs = [embed.vectors[embed.stoi[t]]
for t in [token_a, token_b, token_c]]
x = vecs[1] - vecs[0] + vecs[2]
topk, cos = knn(embed.vectors, x, 1)
res = embed.itos[topk[0]]
return res
get_analogy('man', 'woman', 'son', glove)
out
'daughter'
get_analogy('beijing', 'china', 'tokyo', glove)
out:
'japan'
get_analogy('bad', 'worst', 'big', glove)
out:
'biggest'
get_analogy('do', 'did', 'go', glove)
out
'went'