Graph Attention Networks[ICLR, 2018]
- paper: Graph Attention Networks
- github: https://github.com/PetarV-/GAT (tensorflow)
- code: https://github.com/Diego999/pyGAT
該paper提出了GAT聪建,利用masked self-attention layers進行圖卷積。該方法可以賦予不同鄰居節(jié)點不同的權(quán)重,可以處理transdutive問題和inductive問題曲秉。
1. introduction
- GCN[Kipf et al, ICLR, 2017]
- attention mechanism: 可以處理不同大小的input。self-attention提出:Attention is all you need!
2. Architecture
GAT主要將注意力機制(Attention mechanism)和圖卷積神經(jīng)網(wǎng)絡(luò)結(jié)合起來挺智,在聚合節(jié)點信息的時候航缀,對于每個鄰居節(jié)點賦予不同的權(quán)重(也稱為attention score)。同時醉锅,和transformer提出的self-attention一樣,GAT也可以實現(xiàn)多頭(multi-heads)注意力機制发绢,每個頭單獨更新參數(shù)硬耍,最終將幾個頭的結(jié)果進行串聯(lián)或者取平均得到最終過的節(jié)點表達。
下左為得到鄰居節(jié)點attention score過程边酒,下右為多頭注意力機制更新過程经柴。
具體步驟描述如下:
- 步驟一:計算未歸一化的attention acore
。沿著邊將斷點的節(jié)點表示的線性變換串聯(lián)墩朦,并過一個單層的MLP坯认;
- 步驟二:得到歸一化后attention score
。對于
按行通過softmax函數(shù)進行歸一化氓涣;
- 步驟三:將節(jié)點的信息沿著邊整合到一起牛哺。機制分為單頭和多頭,多頭又有兩種整合方式劳吠,第一種是將幾個head的hidden vector和attention score相乘之后直接concat起來引润,第二種是將幾個head的vector平均再過一個非線性層(在output-layer使用)。
3. Contributions
- 計算高效(computation efficient): 可并行計算
- 對于不同鄰居節(jié)點給予不同重要性痒玩,讓模型解釋性更好
- 可用于directed graph和inductive learning場景
- GraphSAGE對每個節(jié)點指定fixed-size的鄰居淳附,并且使用LSTM的聚合器需要random-ordering的操作;但是GAT可以獲得所有鄰居的信息蠢古,并且不需要ordering
4. Experiment
4.1 Transductive learnig
Node Classification dataset:
- citation graph--Cora, Citeseer, Pubmed
set up details:
- 2-layer GAT
- 第一層:
- 第二層:a. Cora, Citeseer
; b. Pubmed
-
正則化:a. Cora, Citeseer
; b. Pubmed
- 加入dropout層:
4.2 Inductive learning
protein-protein interaction(PPI) dataset由包含24個graph奴曙。在訓(xùn)練集上訓(xùn)練得到每一層的參數(shù),再利用這個參數(shù)的到val/test set的節(jié)點表示和進行節(jié)點multi-label分類任務(wù)草讶。
set up details:
- 3-layer GAT
- layer 1 and layer 2:
- layer 3:
- skip connection
- batch size=2
補充:ELU激活函數(shù)
5. Code
本小節(jié)主要講GAT的實現(xiàn)代碼洽糟。第一部分講GATLayer如何實現(xiàn)的,主要通過dgl的框架看一下大致的整個代碼的實現(xiàn)思路,完整代碼可以看reference的源碼脊框;第二部分講基于GATLayer如何構(gòu)建GATmodel颁督。
參考DGL有關(guān)GAT的詳細說明以及DGL中GAT示例代碼。
5.1 GATLayer
==Steps==:
a. 全連接層full connected layer:浇雹,將高維轉(zhuǎn)為較低維特征
b. message--計算沒經(jīng)過正則化(un-normalized)的attention score :
沉御,這個score可以看做edge的特征
c. reduce
- normalize: 計算attention score
:
- aggregate:
from dgl.nn.pytorch import GATConv
: GATConv源碼
# GATConv Layer源碼關(guān)鍵部分(需要注意的地方)
# 主要展示了參數(shù)和residual connection部分
def __init__(self,
in_feats,
out_feats,
num_heads,
feat_drop=0., # dropout
attn_drop=0.,
negative_slope=0.2, # leakyrelu
residual=False, # 是否連接residual
activation=None,
allow_zero_in_degree=False):
#...
if residual:
if self._in_dst_feats != out_feats:
self.res_fc = nn.Linear(
self._in_dst_feats, num_heads * out_feats, bias=False)
else:
self.res_fc = Identity()
def forward(self, graph,...):
# ...
# residual
if self.res_fc is not None:
resval = self.res_fc(h_dst).view(h_dst.shape[0], -1, self._out_feats)
# h_(l+1)' = h_(l+1) + Wh_(l)
rst = rst + resval
在DGL有關(guān)GAT的詳細說明中有關(guān)于GATLayer的簡易實現(xiàn):
class GATLayer(nn.Module):
def __init__(self, g, in_dim, out_dim):
super(GATLayer, self).__init__()
self.g = g
# equation (1)
self.fc = nn.Linear(in_dim, out_dim, bias=False)
# equation (2)
self.attn_fc = nn.Linear(2 * out_dim, 1, bias=False)
self.reset_parameters()
def reset_parameters(self):
"""Reinitialize learnable parameters."""
gain = nn.init.calculate_gain('relu')
nn.init.xavier_normal_(self.fc.weight, gain=gain)
nn.init.xavier_normal_(self.attn_fc.weight, gain=gain)
def edge_attention(self, edges):
# edge UDF for equation (2)
z2 = torch.cat([edges.src['z'], edges.dst['z']], dim=1)
a = self.attn_fc(z2)
return {'e': F.leaky_relu(a)}
def message_func(self, edges):
# message UDF for equation (3) & (4)
return {'z': edges.src['z'], 'e': edges.data['e']}
def reduce_func(self, nodes):
# reduce UDF for equation (3) & (4)
# equation (3)
alpha = F.softmax(nodes.mailbox['e'], dim=1)
# equation (4)
h = torch.sum(alpha * nodes.mailbox['z'], dim=1)
return {'h': h}
def forward(self, h):
# equation (1)
z = self.fc(h)
self.g.ndata['z'] = z
# equation (2)
self.g.apply_edges(self.edge_attention)
# equation (3) & (4)
self.g.update_all(self.message_func, self.reduce_func)
return self.g.ndata.pop('h')
# multi-heads通過疊加多個GATLayer實現(xiàn)
class MultiHeadGATLayer(nn.Module):
def __init__(self, g, in_dim, out_dim, num_heads, merge='cat'):
super(MultiHeadGATLayer, self).__init__()
self.heads = nn.ModuleList()
for i in range(num_heads):
self.heads.append(GATLayer(g, in_dim, out_dim))
self.merge = merge
def forward(self, h):
head_outs = [attn_head(h) for attn_head in self.heads]
if self.merge == 'cat':
# concat on the output feature dimension (dim=1)
return torch.cat(head_outs, dim=1)
else:
# merge using average
return torch.mean(torch.stack(head_outs))
5.2 GAT model
tips:
- 第一層hidden layer沒有residual connection
- output layer 沒有activation,其多頭注意力機制采用均值的方法
class GAT(nn.Module):
def __init__(self,
g,
num_layers,
in_dim,
num_hidden,
num_classes,
heads,
activation,
feat_drop,
attn_drop,
negative_slope,
residual):
super(GAT, self).__init__()
self.g = g
self.num_layers = num_layers
self.gat_layers = nn.ModuleList()
self.activation = activation
# input projection (no residual)
self.gat_layers.append(GATConv(
in_dim, num_hidden, heads[0],
feat_drop, attn_drop, negative_slope, False, self.activation))
# hidden layers
for l in range(1, num_layers):
# due to multi-head, the in_dim = num_hidden * num_heads
self.gat_layers.append(GATConv(
num_hidden * heads[l-1], num_hidden, heads[l],
feat_drop, attn_drop, negative_slope, residual, self.activation))
# output projection
self.gat_layers.append(GATConv(
num_hidden * heads[-2], num_classes, heads[-1],
feat_drop, attn_drop, negative_slope, residual, None))
def forward(self, inputs):
h = inputs
for l in range(self.num_layers):
h = self.gat_layers[l](self.g, h).flatten(1)
# output projection
logits = self.gat_layers[-1](self.g, h).mean(1) # mean aggregation
return logits
--end--
如果有講得不清楚的地方昭灵,歡迎提問和提意見~