安裝
1. Install PyTorch
2. Clone the OpenKE-PyTorch branch:
$ git clone -b OpenKE-PyTorch [https://github.com/thunlp/OpenKE](https://github.com/thunlp/OpenKE)
$ cd OpenKE
3. Compile C++ files
$ bash make.sh
訓(xùn)練
需要三個文件
-
train2id.txt
訓(xùn)練文件卸奉,第一行是三元組的數(shù)量钝诚,接下來的數(shù)據(jù)格式是(e1,e2,rel)
,需要注意的是,e1,和e2是實(shí)體的編號榄棵,rel是關(guān)系的編號凝颇,其對應(yīng)關(guān)系存放在文件entity2id.txt
和relation2id.txt
。 -
entity2id.txt
第一行是實(shí)體的數(shù)目疹鳄,接下來的每一行是實(shí)體對應(yīng)相關(guān)的id拧略。 -
relation2id.txt
第一行是關(guān)系的數(shù)目,接下來的每一行是關(guān)系對應(yīng)的id瘪弓。
測試
需要5個文件垫蛆,除過上面的三個,還需要
-
test2id.txt
第一行是測試三元組的數(shù)目腺怯,接下來的是(e1,e2,rel)
. -
valid2id.txt
驗(yàn)證數(shù)據(jù)集袱饭,第一行是驗(yàn)證三元組的個數(shù),接下來的行是驗(yàn)證數(shù)據(jù)(e1,e2,rel)
-
type_constrain.txt
類型限制文件呛占,表示關(guān)系只能和特定類型的頭實(shí)體和尾實(shí)體結(jié)合虑乖,文件第一行是關(guān)系的數(shù)目,接下來的行是每一種關(guān)系的類型限制晾虑,例如: 某個relation的id是1200 疹味,他的頭實(shí)體(head entities)有四種類型 3123 1034,58,5733 這個relation同時又有4種類型的尾實(shí)體12123,4388,11087,11088 帜篇,這種n對n的關(guān)系可以通過through n-n.py in folder benchmarks/FB15K來查看糙捺。
Quick Start
import config
import models
import json
import numpy as np
con = config.Config()
#Input training files from benchmarks/FB15K/ folder.
con.set_in_path("./benchmarks/FB15K/")
con.set_work_threads(4)
con.set_train_times(500)
con.set_nbatches(100)
con.set_alpha(0.001)
con.set_margin(1.0)
con.set_bern(0)
con.set_dimension(50)
con.set_ent_neg_rate(1)
con.set_rel_neg_rate(0)
con.set_opt_method("SGD")
#Models will be exported via tf.Saver() automatically.
con.set_export_files("./res/model.vec.tf", 0)
#Model parameters will be exported to json files automatically.
con.set_out_files("./res/embedding.vec.json")
#Initialize experimental settings.
con.init()
#Set the knowledge embedding model
con.set_model(models.TransE)
#Train the model.
con.run()
步驟1 加載數(shù)據(jù)
這個文件夾下面有三個文件train2id.txt
,entity2id.txt
,relation2id.txt
。
con.set_in_path("benchmarks/FB15K/")
可以分配幾個threads進(jìn)行采樣sample positive and negative cases.
con.set_work_threads(8)
步驟2, 設(shè)置訓(xùn)練參數(shù)
最大訓(xùn)練輪數(shù)笙隙,batchSeize继找,實(shí)體和關(guān)系的維數(shù),
con.set_train_times(500)
con.set_nbatches(100)
con.set_alpha(0.5)
con.set_dimension(200)
con.set_margin(1)
對于負(fù)采樣逃沿,我們可以把正常實(shí)體和關(guān)系拆分來構(gòu)造negative triples婴渡, set_bern(0)
是傳統(tǒng)的采樣方法,set_bern(1)
表示使用 (Wang et al. 2014) denoted as "bern"提出的構(gòu)造方法凯亮,set_ent_neg_rate
是設(shè)置實(shí)體的負(fù)采樣率边臼,set_rel_neg_rate
是設(shè)置關(guān)系的負(fù)采樣率。
con.set_bern(0)
con.set_ent_neg_rate(1)
con.set_rel_neg_rate(0)
設(shè)置優(yōu)化方法
con.set_optimizer("SGD")
步驟3假消,輸出結(jié)果
模型參數(shù)每隔幾輪就會使用torch.save()
自動的保存下來柠并,同時最終的結(jié)果會保存成json 文件的形式。
con.set_export_files("./res/model.vec.pt")
con.set_out_files("./res/embedding.vec.json")
步驟4 訓(xùn)練模型
con.init()
con.set_model(models.TransE)
con.run()
步驟5 測試
測試任務(wù)
link prediction任務(wù):用于預(yù)測三元組中缺失的關(guān)系或者尾實(shí)體富拗,對于測試的三元組臼予,我們replace掉了head/tail 實(shí)體,并以降序的順序給出預(yù)測出實(shí)體的得分啃沪。平均的指標(biāo)有:
- MR:正確實(shí)體的平均rank粘拾。
- MRR: the average of the reciprocal ranks of correct entities。
- Hit@N:正確實(shí)體在top-N的比率
三元組分類任務(wù):判斷一個三元組(h,r,t)
是否正確创千,這是一個二分類問題缰雇。
預(yù)測頭實(shí)體任務(wù): 預(yù)測topk個可能的頭實(shí)體,所有的頭實(shí)體用id表示
def predict_head_entity(self, t, r, k):
r'''This mothod predicts the top k head entities given tail entity and relation.
Args:
t (int): tail entity id
r (int): relation id
k (int): top k head entities
Returns:
list: k possible head entity ids
'''
self.init_link_prediction()
if self.importName != None:
self.restore_pytorch()
test_h = np.array(range(self.entTotal))
test_r = np.array([r] * self.entTotal)
test_t = np.array([t] * self.entTotal)
res = self.trainModel.predict(test_h, test_t, test_r).data.numpy().reshape(-1).argsort()[:k]
print(res)
return res
預(yù)測尾實(shí)體:與預(yù)測頭實(shí)體相似追驴。
預(yù)測關(guān)系:
def predict_relation(self, h, t, k):
r'''This methods predict the relation id given head entity and tail entity.
Args:
h (int): head entity id
t (int): tail entity id
k (int): top k relations
Returns:
list: k possible relation ids
'''
self.init_link_prediction()
if self.importName != None:
self.restore_pytorch()
test_h = np.array([h] * self.relTotal)
test_r = np.array(range(self.relTotal))
test_t = np.array([t] * self.relTotal)
res = self.trainModel.predict(test_h, test_t, test_r).data.numpy().reshape(-1).argsort()[:k]
print(res)
return res
預(yù)測三元組:給一個三元組械哟,這個函數(shù)告訴我們是否這個三元組是否正確,如果threshold沒有給出殿雪,那么函數(shù)從驗(yàn)證集中計算出這個關(guān)系的threshold暇咆。
def predict_triple(self, h, t, r, thresh = None):
r'''This method tells you whether the given triple (h, t, r) is correct of wrong
Args:
h (int): head entity id
t (int): tail entity id
r (int): relation id
thresh (fload): threshold for the triple
'''
self.init_triple_classification()
if self.importName != None:
self.restore_pytorch()
res = self.trainModel.predict(np.array([h]), np.array([t]), np.array([r])).data.numpy()
if thresh != None:
if res < thresh:
print("triple (%d,%d,%d) is correct" % (h, t, r))
else:
print("triple (%d,%d,%d) is wrong" % (h, t, r))
return
self.lib.getValidBatch(self.valid_pos_h_addr, self.valid_pos_t_addr, self.valid_pos_r_addr, self.valid_neg_h_addr, self.valid_neg_t_addr, self.valid_neg_r_addr)
res_pos = self.trainModel.predict(self.valid_pos_h, self.valid_pos_t, self.valid_pos_r)
res_neg = self.trainModel.predict(self.valid_neg_h, self.valid_neg_t, self.valid_neg_r)
self.lib.getBestThreshold(self.relThresh_addr, res_pos.data.numpy().__array_interface__['data'][0], res_neg.data.numpy().__array_interface__['data'][0])
if res < self.relThresh[r]:
print("triple (%d,%d,%d) is correct" % (h, t, r))
else:
print("triple (%d,%d,%d) is wrong" % (h, t, r))
具體實(shí)施
第一步是導(dǎo)入數(shù)據(jù)集,然后配置參數(shù)丙曙,然后設(shè)置模型參數(shù)和測試的模型爸业,例如我們需要測試TransE:有三種方法來測試模型。
- 設(shè)置導(dǎo)入文件河泳,OpenKE-PyTorch會自動的通過torch.load()來加載模型沃呢。
import config
import models
import numpy as np
import json
con = config.Config()
con.set_in_path("./benchmarks/FB15K/")
con.test_link_prediction(True)
con.test_triple_classification(True)
con.set_work_threads(4)
con.set_dimension(100)
con.set_import_files("./res/model.vec.pt")
con.init()
con.set_model(models.TransE)
con.test()
- 從json文件中讀取模型參數(shù),手動的加載參數(shù)。
import config
import models
import numpy as np
import json
con = config.Config()
con.set_in_path("./benchmarks/FB15K/")
con.test_link_prediction(True)
con.test_triple_classification(True)
con.set_work_threads(4)
con.set_dimension(100)
con.init()
con.set_model(models.TransE)
f = open("./res/embedding.vec.json", "r")
content = json.loads(f.read())
f.close()
con.set_parameters(content)
con.test()
- 使用torch.load()手動的加載模型拆挥。
import config
import models
import numpy as np
import json
con = config.Config()
con.set_in_path("./benchmarks/FB15K/")
con.test_link_prediction(True)
con.test_triple_classification(True)
con.set_work_threads(4)
con.set_dimension(100)
con.init()
con.set_model(models.TransE)
con.import_variables("./res/model.vec.pt")
con.test()
獲取embedding 矩陣
有四種方式來獲取embedding矩陣
- 設(shè)置import 文件那么OpenKE 會自動的使用torch.load()加載模型.
使用con.get_parameters()
函數(shù)來獲得list類型的embedding矩陣薄霜,可以通過con.get_parameters("numpy")
獲得numpy
類型的參數(shù)。
import json
import numpy as py
import config
import models
con = config.Config()
con.set_in_path("./benchmarks/FB15K/")
con.test_link_prediction(True)
con.test_triple_classification(True)
con.set_work_threads(4)
con.set_dimension(100)
con.set_import_files("./res/model.vec.pt")
con.init()
con.set_model(models.TransE)
# Get the embeddings (numpy.array)
embeddings = con.get_parameters("numpy")
# Get the embeddings (python list)
embeddings = con.get_parameters()
- 從json文件中獲取
import json
import numpy as py
import config
import models
con = config.Config()
con.set_in_path("./benchmarks/FB15K/")
con.test_link_prediction(True)
con.test_triple_classification(True)
con.set_work_threads(4)
con.set_dimension(100)
con.init()
con.set_model(models.TransE)
f = open("./res/embedding.vec.json", "r")
embeddings = json.loads(f.read())
f.close()
- 手動的使用torch.load()加載模型纸兔,但是獲取embedding的方式還是一樣的惰瓜。
con = config.Config()
con.set_in_path("./benchmarks/FB15K/")
con.test_link_prediction(True)
con.test_triple_classification(True)
con.set_work_threads(4)
con.set_dimension(100)
con.init()
con.set_model(models.TransE)
con.import_variables("./res/model.vec.pt")
# Get the embeddings (numpy.array)
embeddings = con.get_parameters("numpy")
# Get the embeddings (python list)
embeddings = con.get_parameters()
- 從一個訓(xùn)練好的模型中立即拿到embedding。
#Models will be exported via tf.Saver() automatically.
con.set_export_files("./res/model.vec.pt")
#Model parameters will be exported to json files automatically.
con.set_out_files("./res/embedding.vec.json")
#Initialize experimental settings.
con.init()
#Set the knowledge embedding model
con.set_model(models.TransE)
#Train the model.
con.run()
#Get the embeddings (numpy.array)
embeddings = con.get_parameters("numpy")
#Get the embeddings (python list)
embeddings = con.get_parameters()
接口
Config的接口
-
def set_alpha(alpha = 0.001)
設(shè)置學(xué)習(xí)率 -
def set_lmbda(lmbda = 0.0)
汉矿,設(shè)置正則化前面的系數(shù)崎坊。
To set the degree of the regularization on the parameters
-
set_train_times(self, times)
相當(dāng)于設(shè)置epoch -
def sampling()
從正樣本和負(fù)樣本中采樣一個batch -
def set_in_path(self, path)
讀取數(shù)據(jù) -
def set_out_files(self, path)
當(dāng)訓(xùn)練結(jié)束的時候?qū)⒛P偷膮?shù)變成json文件存儲下來。 -
def set_import_files(self, path)
模型所有的參數(shù)都可以用這個文件夾里面恢復(fù)洲拇。 -
def set_export_steps(self, steps)
每隔多少步存儲一次文件 -
def save_pytorch(self)
使用torch.save來保存模型 -
def import_variables(self, path = None)
恢復(fù) tensorflow模型奈揍,相當(dāng)于restore_tensorflow()
-
def set_parameters(self, lists)
從jaon 文件中加載parameters -
def get_parameters(self, mode = "numpy")
獲取模型參數(shù)也就是每個實(shí)體的embedding -
def set_model(model)
表示使用什么model進(jìn)行knowledge embedding -
def set_log_on(flag = 1)
如果設(shè)置為1那么表示會打印loss函數(shù)
The framework will print loss values during training if flag = 1
class Config(object):
#To set the learning rate
def set_alpha(alpha = 0.001)
#To set the degree of the regularization on the parameters
def set_lmbda(lmbda = 0.0)
#To set the gradient descent optimization algorithm (SGD, Adagrad, Adadelta, Adam)
def set_optimizer(optimizer = "SGD")
#To set the data traversing rounds
def set_train_times(self, times)
#To split the training triples into several batches, nbatches is the number of batches
def set_nbatches(nbatches = 100)
#To set the margin for the loss function
def set_margin(margin = 1.0)
#To set the dimensions of the entities and relations at the same time
def set_dimension(dim)
#To set the dimensions of the entities
def set_ent_dimension(self, dim)
#To set the dimensions of the relations
def set_rel_dimension(self, dim)
#To allocate threads for each batch sampling
def set_work_threads(threads = 1)
#To set negative sampling algorithms, unif (bern = 0) or bern (bern = 1)
def set_bern(bern = 1)
#For each positive triple, we construct rate negative triples by corrupt the entity
def set_ent_neg_rate(rate = 1)
#For each positive triple, we construct rate negative triples by corrupt the relation
def set_rel_neg_rate(rate = 0)
#To sample a batch of training triples, including positive and negative ones.
def sampling()
#To import dataset from the benchmark folder
def set_in_path(self, path)
#To export model parameters to json files when training completed
def set_out_files(self, path)
#To set the import files, all parameters can be restored from the import files
def set_import_files(self, path)
#To set the export file of model paramters, and export results every few rounds
def set_export_files(self, path, steps = 0)
#To export results every few rounds
def set_export_steps(self, steps)
#To save model via torch.save()
def save_pytorch(self)
#To restore model via torch.load()
def restore_pytorch(self)
#To export model paramters, when path is none, equivalent to save_tensorflow()
def export_variables(self, path = None)
#To import model paramters, when path is none, equivalent to restore_tensorflow()
def import_variables(self, path = None)
#To export model paramters to designated path
def save_parameters(self, path = None)
#To manually load parameters which are read from json files
def set_parameters(self, lists)
#To get model paramters, if using mode "numpy", you can get np.array , else you can get python lists
def get_parameters(self, mode = "numpy")
#To set the knowledge embedding model
def set_model(model)
#The framework will print loss values during training if flag = 1
def set_log_on(flag = 1)
#This is essential when testing
def test_link_prediction(True)
def test_triple_classification(True)
模型的接口
class Model(object)
# in_batch = True, return [positive_head, positive_tail, positive_relation]
# The shape of positive_head is [batch_size, 1]
# in_batch = False, return [positive_head, positive_tail, positive_relation]
# The shape of positive_head is [batch_size]
get_positive_instance(in_batch = True)
# in_batch = True, return [negative_head, negative_tail, negative_relation]
# The shape of positive_head is [batch_size, negative_ent_rate + negative_rel_rate]
# in_batch = False, return [negative_head, negative_tail, negative_relation]
# The shape of positive_head is [(negative_ent_rate + negative_rel_rate) * batch_size]
get_negative_instance(in_batch = True)
# in_batch = True, return all training instances with the shape [batch_size, (1 + negative_ent_rate + negative_rel_rate)]
# in_batch = False, return all training instances with the shape [(negative_ent_rate + negative_rel_rate + 1) * batch_size]
def get_all_instance(in_batch = False)
# in_batch = True, return all training labels with the shape [batch_size, (1 + negative_ent_rate + negative_rel_rate)]
# in_batch = False, return all training labels with the shape [(negative_ent_rate + negative_rel_rate + 1) * batch_size]
# The positive triples are labeled as 1, and the negative triples are labeled as -1
def get_all_labels(in_batch = False)
#To calulate the loss
def forward(self)
# To define loss functions for knowledge embedding models
def loss_func()
# To define the prediction functions for knowledge embedding models
def predict(self)
def __init__(config)
#The implementation for TransE
class TransE(Model)
#The implementation for TransH
class TransH(Model)
#The implementation for TransR
class TransR(Model)
#The implementation for TransD
class TransD(Model)
#The implementation for RESCAL
class RESCAL(Model)
#The implementation for DistMult
class DistMult(Model)
#The implementation for ComplEx
class ComplEx(Model)