Working with Heterogeneous Graphs in DGL

In this tutorial, you learn about:

  • Examples of heterogenous graph data and typical applications.
  • Creating and manipulating a heterogenous graph in DGL.
  • Implementing Relational-GCN, a popular GNN model, for heterogenous graph input.
  • Training a model to solve a node classification task.

Heterogeneous graphs, or heterographs for short, are graphs that contain different types of nodes and edges. The different types of nodes and edges tend to have different types of attributes that are designed to capture the characteristics of each node and edge type. Within the context of graph neural networks, depending on their complexity, certain node and edge types might need to be modeled with representations that have a different number of dimensions.

DGL supports graph neural network computations on such heterogeneous graphs, by using the heterograph class and its associated API.

Examples of heterographs

Many graph datasets represent relationships among various types of entities. This section provides an overview for several graph use-cases that show such relationships and can have their data represented as heterographs.

1. Citation graph

The Association for Computing Machinery publishes an ACM dataset that contains two million papers, their authors, publication venues, and the other papers that were cited. This information can be represented as a heterogeneous graph.

The following diagram shows several entities in the ACM dataset and the relationships among them (taken from Shi et al., 2015).

acm-example.png

This graph has three types of entities that correspond to papers, authors, and publication venues. It also contains three types of edges that connect the following:

  • Authors with papers corresponding to written-by relationships
  • Papers with publication venues corresponding to published-in relationships
  • Papers with other papers corresponding to cited-by relationships
2. Recommender systems

The datasets used in recommender systems often contain interactions between users and items. For example, the data could include the ratings that users have provided to movies. Such interactions can be modeled as heterographs.

The nodes in these heterographs will have two types, users and movies. The edges will correspond to the user-movie interactions. Furthermore, if an interaction is marked with a rating, then each rating value could correspond to a different edge type. The following diagram shows an example of user-item interactions as a heterograph.

recsys-example.png
3. Knowledge graph

Knowledge graphs are inherently heterogenous. For example, in Wikidata, Barack Obama (item Q76) is an instance of a human, which could be viewed as the entity class, whose spouse (item P26) is Michelle Obama (item Q13133) and occupation (item P106) is politician (item Q82955). The relationships are shown in the following diagram.

image.png

Creating a heterograph in DGL

You can create a heterograph in DGL using the dgl.heterograph() API. The argument to dgl.heterograph() is a dictionary. The keys are tuples in the form of (srctype, edgetype, dsttype) specifying the relation name and the two entity types it connects. Such tuples are called canonical edge types. The values are data to initialize the graph structures, that is, which nodes the edges actually connect.

For instance, the following code creates the user-item interactions heterograph shown earlier.

# Each value of the dictionary is a list of edge tuples.
# Nodes are integer IDs starting from zero. Nodes IDs of different types have
# separate countings.
import dgl

ratings = dgl.heterograph(
    {('user', '+1', 'movie') : [(0, 0), (0, 1), (1, 0)],
     ('user', '-1', 'movie') : [(2, 1)]})
ratings

# Results:
Graph(num_nodes={'user': 3, 'movie': 2},
      num_edges={('user', '+1', 'movie'): 3, ('user', '-1', 'movie'): 1},
      metagraph=[('user', 'movie'), ('user', 'movie')])

DGL supports creating a graph from a variety of data sources. The following code creates the same graph as the above.

Creating from scipy matrix

import scipy.sparse as sp
plus1 = sp.coo_matrix(([1, 1, 1], ([0, 0, 1], [0, 1, 0])), shape=(3, 2))
minus1 = sp.coo_matrix(([1], ([2], [1])), shape=(3, 2))
ratings = dgl.heterograph(
    {('user', '+1', 'movie') : plus1,
     ('user', '-1', 'movie') : minus1})

# Creating from networkx graph
import networkx as nx
plus1 = nx.DiGraph()
plus1.add_nodes_from(['u0', 'u1', 'u2'], bipartite=0)
plus1.add_nodes_from(['m0', 'm1'], bipartite=1)
plus1.add_edges_from([('u0', 'm0'), ('u0', 'm1'), ('u1', 'm0')])
# To simplify the example, reuse the minus1 object.
# This also means that you could use different sources of graph data
# for different relationships.
ratings = dgl.heterograph(
    {('user', '+1', 'movie') : plus1,
     ('user', '-1', 'movie') : minus1})

# Creating from edge indices
ratings = dgl.heterograph(
    {('user', '+1', 'movie') : ([0, 0, 1], [0, 1, 0]),
     ('user', '-1', 'movie') : ([2], [1])})
ratings

# Results:
Graph(num_nodes={'user': 3, 'movie': 2},
      num_edges={('user', '+1', 'movie'): 3, ('user', '-1', 'movie'): 1},
      metagraph=[('user', 'movie'), ('user', 'movie')])

Manipulating heterograph

You can create a more realistic heterograph using the ACM dataset. To do this, first download the dataset as follows:

import scipy.io
import urllib.request

data_url = 'https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/ACM.mat'
data_file_path = '/tmp/ACM.mat'

urllib.request.urlretrieve(data_url, data_file_path)
data = scipy.io.loadmat(data_file_path)
print(list(data.keys()))

# Results:
['__header__', '__version__', '__globals__', 'TvsP', 'PvsA', 'PvsV', 'AvsF', 'VvsC', 'PvsL', 'PvsC', 'A', 'C', 'F', 'L', 'P', 'T', 'V', 'PvsT', 'CNormPvsA', 'RNormPvsA', 'CNormPvsC', 'RNormPvsC', 'CNormPvsT', 'RNormPvsT', 'CNormPvsV', 'RNormPvsV', 'CNormVvsC', 'RNormVvsC', 'CNormAvsF', 'RNormAvsF', 'CNormPvsL', 'RNormPvsL', 'stopwords', 'nPvsT', 'nT', 'CNormnPvsT', 'RNormnPvsT', 'nnPvsT', 'nnT', 'CNormnnPvsT', 'RNormnnPvsT', 'PvsP', 'CNormPvsP', 'RNormPvsP']

The dataset stores node information by their types: P for paper, A for author, C for conference, L for subject code, and so on. The relationships are stored as SciPy sparse matrix under key XvsY, where X and Y could be any of the node type code.

The following code prints out some statistics about the paper-author relationships.

print(type(data['PvsA']))
print('#Papers:', data['PvsA'].shape[0])
print('#Authors:', data['PvsA'].shape[1])
print('#Links:', data['PvsA'].nnz)

# Results:
<class 'scipy.sparse.csc.csc_matrix'>
#Papers: 12499
#Authors: 17431
#Links: 37055

Converting this SciPy matrix to a heterograph in DGL is straightforward.

pa_g = dgl.heterograph({('paper', 'written-by', 'author') : data['PvsA']})
# equivalent (shorter) API for creating heterograph with two node types:
pa_g = dgl.bipartite(data['PvsA'], 'paper', 'written-by', 'author')

You can easily print out the type names and other structural information.

print('Node types:', pa_g.ntypes)
print('Edge types:', pa_g.etypes)
print('Canonical edge types:', pa_g.canonical_etypes)

# Nodes and edges are assigned integer IDs starting from zero and each type has its own counting.
# To distinguish the nodes and edges of different types, specify the type name as the argument.
print(pa_g.number_of_nodes('paper'))
# Canonical edge type name can be shortened to only one edge type name if it is
# uniquely distinguishable.
print(pa_g.number_of_edges(('paper', 'written-by', 'author')))
print(pa_g.number_of_edges('written-by'))
print(pa_g.successors(1, etype='written-by'))  # get the authors that write paper #1

# Type name argument could be omitted whenever the behavior is unambiguous.
print(pa_g.number_of_edges())  # Only one edge type, the edge type argument could be omitted

# Results:
Node types: ['paper', 'author']
Edge types: ['written-by']
Canonical edge types: [('paper', 'written-by', 'author')]
12499
37055
37055
tensor([3532, 6421, 8516, 8560])
37055

A homogeneous graph is just a special case of a heterograph with only one type of node and edge. In this case, all the APIs are exactly the same as in DGLGraph.

# Paper-citing-paper graph is a homogeneous graph
pp_g = dgl.heterograph({('paper', 'citing', 'paper') : data['PvsP']})
# equivalent (shorter) API for creating homogeneous graph
pp_g = dgl.graph(data['PvsP'], 'paper', 'cite')

# All the ntype and etype arguments could be omitted because the behavior is unambiguous.
print(pp_g.number_of_nodes())
print(pp_g.number_of_edges())
print(pp_g.successors(3))

# Results:
12499
30789
tensor([1361, 2624, 8670, 9845])

Create a subset of the ACM graph using the paper-author, paper-paper, and paper-subject relationships. Meanwhile, also add the reverse relationship to prepare for the later sections.

G = dgl.heterograph({
        ('paper', 'written-by', 'author') : data['PvsA'],
        ('author', 'writing', 'paper') : data['PvsA'].transpose(),
        ('paper', 'citing', 'paper') : data['PvsP'],
        ('paper', 'cited', 'paper') : data['PvsP'].transpose(),
        ('paper', 'is-about', 'subject') : data['PvsL'],
        ('subject', 'has', 'paper') : data['PvsL'].transpose(),
    })

print(G)

# Results:
Graph(num_nodes={'paper': 12499, 'author': 17431, 'subject': 73},
      num_edges={('paper', 'written-by', 'author'): 37055, ('author', 'writing', 'paper'): 37055, ('paper', 'citing', 'paper'): 30789, ('paper', 'cited', 'paper'): 30789, ('paper', 'is-about', 'subject'): 12499, ('subject', 'has', 'paper'): 12499},
      metagraph=[('paper', 'author'), ('paper', 'paper'), ('paper', 'paper'), ('paper', 'subject'), ('author', 'paper'), ('subject', 'paper')])

Metagraph (or network schema) is a useful summary of a heterograph. Serving as a template for a heterograph, it tells how many types of objects exist in the network and where the possible links exist.

DGL provides easy access to the metagraph, which could be visualized using external tools.

# Draw the metagraph using graphviz.
import pygraphviz as pgv
def plot_graph(nxg):
    ag = pgv.AGraph(strict=False, directed=True)
    for u, v, k in nxg.edges(keys=True):
        ag.add_edge(u, v, label=k)
    ag.layout('dot')
    ag.draw('graph.png')

plot_graph(G.metagraph)
image.png

Learning tasks associated with heterographs

Some of the typical learning tasks that involve heterographs include:

  • Node classification and regression to predict the class of each node or estimate a value associated with it.
  • Link prediction to predict if there is an edge of a certain type between a pair of nodes, or predict which other nodes a particular node is connected with (and optionally the edge types of such connections).
  • Graph classification/regression to assign an entire heterograph into one of the target classes or to estimate a numerical value associated with it.

In this tutorial, we designed a simple example for the first task.

A semi-supervised node classification example

Our goal is to predict the publishing conference of a paper using the ACM academic graph we just created. To further simplify the task, we only focus on papers published in three conferences: KDD, ICML, and VLDB. All the other papers are not labeled, making it a semi-supervised setting.

The following code extracts those papers from the raw dataset and prepares the training, validation, testing split.

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

pvc = data['PvsC'].tocsr()
# find all papers published in KDD, ICML, VLDB
c_selected = [0, 11, 13]  # KDD, ICML, VLDB
p_selected = pvc[:, c_selected].tocoo()
# generate labels
labels = pvc.indices
labels[labels == 11] = 1
labels[labels == 13] = 2
labels = torch.tensor(labels).long()

# generate train/val/test split
pid = p_selected.row
shuffle = np.random.permutation(pid)
train_idx = torch.tensor(shuffle[0:800]).long()
val_idx = torch.tensor(shuffle[800:900]).long()
test_idx = torch.tensor(shuffle[900:]).long()

Relational-GCN on heterograph

We use Relational-GCN to learn the representation of nodes in the graph. Its message-passing equation is as follows:

h_i^{l+1}=\sigma(\sum_{r\in R}\sum_{j\in N_r(i)}W_r^{(l)}h_j^{(l)})\tag{1}

Breaking down the equation, you see that there are two parts in the computation.

  1. Message computation and aggregation within each relation r
  2. Reduction that merges the results from multiple relationships

Following this intuition, perform message passing on a heterograph in two steps.

  1. Per-edge-type message passing
  2. Type wise reduction
import dgl.function as fn

class HeteroRGCNLayer(nn.Module):
    def __init__(self, in_size, out_size, etypes):
        super(HeteroRGCNLayer, self).__init__()
        # W_r for each relation
        self.weight = nn.ModuleDict({
                name : nn.Linear(in_size, out_size) for name in etypes
            })

    def forward(self, G, feat_dict):
        # The input is a dictionary of node features for each type
        funcs = {}
        for srctype, etype, dsttype in G.canonical_etypes:
            # Compute W_r * h
            Wh = self.weight[etype](feat_dict[srctype])
            # Save it in graph for message passing
            G.nodes[srctype].data['Wh_%s' % etype] = Wh
            # Specify per-relation message passing functions: (message_func, reduce_func).
            # Note that the results are saved to the same destination feature 'h', which
            # hints the type wise reducer for aggregation.
            funcs[etype] = (fn.copy_u('Wh_%s' % etype, 'm'), fn.mean('m', 'h'))
        # Trigger message passing of multiple types.
        # The first argument is the message passing functions for each relation.
        # The second one is the type wise reducer, could be "sum", "max",
        # "min", "mean", "stack"
        G.multi_update_all(funcs, 'sum')
        # return the updated node feature dictionary
        return {ntype : G.nodes[ntype].data['h'] for ntype in G.ntypes}

Create a simple GNN by stacking two HeteroRGCNLayer. Since the nodes do not have input features, make their embeddings trainable.

class HeteroRGCN(nn.Module):
    def __init__(self, G, in_size, hidden_size, out_size):
        super(HeteroRGCN, self).__init__()
        # Use trainable node embeddings as featureless inputs.
        embed_dict = {ntype : nn.Parameter(torch.Tensor(G.number_of_nodes(ntype), in_size))
                      for ntype in G.ntypes}
        for key, embed in embed_dict.items():
            nn.init.xavier_uniform_(embed)
        self.embed = nn.ParameterDict(embed_dict)
        # create layers
        self.layer1 = HeteroRGCNLayer(in_size, hidden_size, G.etypes)
        self.layer2 = HeteroRGCNLayer(hidden_size, out_size, G.etypes)

    def forward(self, G):
        h_dict = self.layer1(G, self.embed)
        h_dict = {k : F.leaky_relu(h) for k, h in h_dict.items()}
        h_dict = self.layer2(G, h_dict)
        # get paper logits
        return h_dict['paper']

Train and evaluate

Train and evaluate this network.

# Create the model. The output has three logits for three classes.
model = HeteroRGCN(G, 10, 10, 3)

opt = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

best_val_acc = 0
best_test_acc = 0

for epoch in range(100):
    logits = model(G)
    # The loss is computed only for labeled nodes.
    loss = F.cross_entropy(logits[train_idx], labels[train_idx])

    pred = logits.argmax(1)
    train_acc = (pred[train_idx] == labels[train_idx]).float().mean()
    val_acc = (pred[val_idx] == labels[val_idx]).float().mean()
    test_acc = (pred[test_idx] == labels[test_idx]).float().mean()

    if best_val_acc < val_acc:
        best_val_acc = val_acc
        best_test_acc = test_acc

    opt.zero_grad()
    loss.backward()
    opt.step()

    if epoch % 5 == 0:
        print('Loss %.4f, Train Acc %.4f, Val Acc %.4f (Best %.4f), Test Acc %.4f (Best %.4f)' % (
            loss.item(),
            train_acc.item(),
            val_acc.item(),
            best_val_acc.item(),
            test_acc.item(),
            best_test_acc.item(),
        ))

# Results:
Loss 1.0100, Train Acc 0.4963, Val Acc 0.5800 (Best 0.5800), Test Acc 0.5075 (Best 0.5075)
Loss 0.9120, Train Acc 0.5150, Val Acc 0.6200 (Best 0.6200), Test Acc 0.5134 (Best 0.5134)
Loss 0.7628, Train Acc 0.7188, Val Acc 0.7100 (Best 0.7100), Test Acc 0.5771 (Best 0.5771)
Loss 0.5589, Train Acc 0.8075, Val Acc 0.8000 (Best 0.8000), Test Acc 0.6851 (Best 0.6851)
Loss 0.3738, Train Acc 0.8788, Val Acc 0.8000 (Best 0.8000), Test Acc 0.7337 (Best 0.6851)
Loss 0.2392, Train Acc 0.9400, Val Acc 0.7800 (Best 0.8000), Test Acc 0.7454 (Best 0.6851)
Loss 0.1458, Train Acc 0.9762, Val Acc 0.7900 (Best 0.8000), Test Acc 0.7504 (Best 0.6851)
Loss 0.0901, Train Acc 0.9912, Val Acc 0.7800 (Best 0.8000), Test Acc 0.7621 (Best 0.6851)
Loss 0.0602, Train Acc 1.0000, Val Acc 0.7800 (Best 0.8000), Test Acc 0.7663 (Best 0.6851)
Loss 0.0445, Train Acc 1.0000, Val Acc 0.7800 (Best 0.8000), Test Acc 0.7647 (Best 0.6851)
Loss 0.0359, Train Acc 1.0000, Val Acc 0.7700 (Best 0.8000), Test Acc 0.7647 (Best 0.6851)
Loss 0.0304, Train Acc 1.0000, Val Acc 0.7700 (Best 0.8000), Test Acc 0.7663 (Best 0.6851)
Loss 0.0264, Train Acc 1.0000, Val Acc 0.7700 (Best 0.8000), Test Acc 0.7630 (Best 0.6851)
Loss 0.0234, Train Acc 1.0000, Val Acc 0.7900 (Best 0.8000), Test Acc 0.7680 (Best 0.6851)
Loss 0.0213, Train Acc 1.0000, Val Acc 0.7900 (Best 0.8000), Test Acc 0.7680 (Best 0.6851)
Loss 0.0194, Train Acc 1.0000, Val Acc 0.8000 (Best 0.8000), Test Acc 0.7680 (Best 0.6851)
Loss 0.0179, Train Acc 1.0000, Val Acc 0.8000 (Best 0.8000), Test Acc 0.7688 (Best 0.6851)
Loss 0.0166, Train Acc 1.0000, Val Acc 0.7800 (Best 0.8000), Test Acc 0.7705 (Best 0.6851)
Loss 0.0156, Train Acc 1.0000, Val Acc 0.7800 (Best 0.8000), Test Acc 0.7697 (Best 0.6851)
Loss 0.0147, Train Acc 1.0000, Val Acc 0.7800 (Best 0.8000), Test Acc 0.7705 (Best 0.6851)

API: dgl.DGLHeteroGraph.multi_update_all

import dgl
import dgl.function as fn
import torch

g1 = dgl.graph([(0, 1), (1, 1)], 'user', 'follows')
g2 = dgl.bipartite([(0, 1)], 'game', 'attracts', 'user')
g = dgl.hetero_from_relations([g1, g2])
g.nodes['user'].data['h'] = torch.tensor([[100.], [2.]])
g.nodes['game'].data['h'] = torch.tensor([[5.]])

g.multi_update_all(
    {'follows': (fn.copy_src('h', 'm'), fn.sum('m', 'h'))
     ,'attracts': (fn.copy_src('h', 'm'), fn.sum('m', 'h'))
    },
"sum")
g.nodes['user'].data['h'], g.nodes['game'].data['h']

# Results:
(tensor([[  0.],
         [107.]]),
 tensor([[5.]]))

原文鏈接:
https://docs.dgl.ai/tutorials/hetero/1_basics.html

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末蝶怔,一起剝皮案震驚了整個濱河市吧史,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖浪默,帶你破解...
    沈念sama閱讀 218,386評論 6 506
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件淌铐,死亡現(xiàn)場離奇詭異,居然都是意外死亡陷揪,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,142評論 3 394
  • 文/潘曉璐 我一進(jìn)店門杂穷,熙熙樓的掌柜王于貴愁眉苦臉地迎上來悍缠,“玉大人,你說我怎么就攤上這事耐量》沈荆” “怎么了?”我有些...
    開封第一講書人閱讀 164,704評論 0 353
  • 文/不壞的土叔 我叫張陵廊蜒,是天一觀的道長趴拧。 經(jīng)常有香客問我溅漾,道長,這世上最難降的妖魔是什么著榴? 我笑而不...
    開封第一講書人閱讀 58,702評論 1 294
  • 正文 為了忘掉前任添履,我火速辦了婚禮,結(jié)果婚禮上脑又,老公的妹妹穿的比我還像新娘暮胧。我一直安慰自己,他們只是感情好问麸,可當(dāng)我...
    茶點故事閱讀 67,716評論 6 392
  • 文/花漫 我一把揭開白布往衷。 她就那樣靜靜地躺著,像睡著了一般严卖。 火紅的嫁衣襯著肌膚如雪席舍。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,573評論 1 305
  • 那天妄田,我揣著相機與錄音俺亮,去河邊找鬼。 笑死疟呐,一個胖子當(dāng)著我的面吹牛脚曾,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播启具,決...
    沈念sama閱讀 40,314評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼本讥,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了鲁冯?” 一聲冷哼從身側(cè)響起拷沸,我...
    開封第一講書人閱讀 39,230評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎薯演,沒想到半個月后撞芍,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,680評論 1 314
  • 正文 獨居荒郊野嶺守林人離奇死亡跨扮,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,873評論 3 336
  • 正文 我和宋清朗相戀三年序无,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片衡创。...
    茶點故事閱讀 39,991評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡帝嗡,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出璃氢,到底是詐尸還是另有隱情哟玷,我是刑警寧澤,帶...
    沈念sama閱讀 35,706評論 5 346
  • 正文 年R本政府宣布一也,位于F島的核電站巢寡,受9級特大地震影響喉脖,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜讼渊,卻給世界環(huán)境...
    茶點故事閱讀 41,329評論 3 330
  • 文/蒙蒙 一动看、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧爪幻,春花似錦菱皆、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,910評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至奶甘,卻和暖如春篷店,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背臭家。 一陣腳步聲響...
    開封第一講書人閱讀 33,038評論 1 270
  • 我被黑心中介騙來泰國打工疲陕, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人钉赁。 一個月前我還...
    沈念sama閱讀 48,158評論 3 370
  • 正文 我出身青樓蹄殃,卻偏偏與公主長得像,于是被迫代替她去往敵國和親你踩。 傳聞我的和親對象是個殘疾皇子诅岩,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 44,941評論 2 355

推薦閱讀更多精彩內(nèi)容