原理
所謂自動摘要姓迅,就是從文章中自動抽取關(guān)鍵句葡公。何謂關(guān)鍵句糯彬?人類的理解是能夠概括文章中心的句子芳绩,機(jī)器的理解只能模擬人類的理解,即擬定一個(gè)權(quán)重的評分標(biāo)準(zhǔn)昆箕,給每個(gè)句子打分鸦列,之后給出排名靠前的幾個(gè)句子。
TextRank的打分思想依然是從PageRank的迭代思想衍生過來的鹏倘,如下公式所示:
等式左邊表示一個(gè)句子的權(quán)重(WS是weight_sum的縮寫)薯嗤,右側(cè)的求和表示每個(gè)相鄰句子對本句子的貢獻(xiàn)程度。與提取關(guān)鍵字的時(shí)候不同纤泵,一般認(rèn)為全部句子都是相鄰的骆姐,不再提取窗口。
求和的分子wji表示兩個(gè)句子的相似程度捏题,相似程度wji的計(jì)算玻褪,推薦使用BM25算法。分母又是一個(gè)weight_sum涉馅,而WS(Vj)代表上次迭代j的權(quán)重。整個(gè)公式是一個(gè)迭代的過程黄虱。
代碼實(shí)現(xiàn)
text = '''
自然語言處理是計(jì)算機(jī)科學(xué)領(lǐng)域與人工智能領(lǐng)域中的一個(gè)重要方向稚矿。
它研究能實(shí)現(xiàn)人與計(jì)算機(jī)之間用自然語言進(jìn)行有效通信的各種理論和方法。
自然語言處理是一門融語言學(xué)捻浦、計(jì)算機(jī)科學(xué)晤揣、數(shù)學(xué)于一體的科學(xué)。
因此朱灿,這一領(lǐng)域的研究將涉及自然語言昧识,即人們?nèi)粘J褂玫恼Z言,
所以它與語言學(xué)的研究有著密切的聯(lián)系盗扒,但又有重要的區(qū)別跪楞。
自然語言處理并不是一般地研究自然語言缀去,
而在于研制能有效地實(shí)現(xiàn)自然語言通信的計(jì)算機(jī)系統(tǒng),
特別是其中的軟件系統(tǒng)甸祭。因而它是計(jì)算機(jī)科學(xué)的一部分缕碎。
'''
import jieba
from utils import utils
from snownlp import seg
class TextRank(object):
def __init__(self, docs):
self.docs = docs
self.bm25 = BM25(docs)
self.D = len(docs)
self.d = 0.85
self.weight = []
self.weight_sum = []
self.vertex = []
self.max_iter = 200
self.min_diff = 0.001
self.top = []
def text_rank(self):
for cnt, doc in enumerate(self.docs):
scores = self.bm25.simall(doc)
self.weight.append(scores)
self.weight_sum.append(sum(scores)-scores[cnt])
self.vertex.append(1.0)
for _ in range(self.max_iter):
m = []
max_diff = 0
for i in range(self.D):
m.append(1-self.d)
for j in range(self.D):
if j == i or self.weight_sum[j] == 0:
continue
# TextRank的公式
m[-1] += (self.d*self.weight[j][i]
/ self.weight_sum[j]*self.vertex[j])
if abs(m[-1] - self.vertex[i]) > max_diff:
max_diff = abs(m[-1] - self.vertex[i])
self.vertex = m
if max_diff <= self.min_diff:
break
self.top = list(enumerate(self.vertex))
self.top = sorted(self.top, key=lambda x: x[1], reverse=True)
def top_index(self, limit):
return list(map(lambda x: x[0], self.top))[:limit]
def top(self, limit):
return list(map(lambda x: self.docs[x[0]], self.top))
if __name__ == '__main__':
sents = utils.get_sentences(text)
doc = []
for sent in sents:
words = seg.seg(sent)
# words = list(jieba.cut(sent))
words = utils.filter_stop(words)
doc.append(words)
print(doc)
rank = TextRank(doc)
rank.text_rank()
for index in rank.top_index(3):
print(sents[index])
self.weight
每一句與其它句的相似度
[[10.011936342719583, 0.0, 0.9413276860939246, 0, 2.5208967587765487, -0.42128772462816594, 0, 0, -0.41117681923708993, 0.0, 0, 1.7776032315706807], [0.0, 7.362470286473312, -0.1203528298812426, 0, 0.3208550531092889, -0.42128772462816594, 0.8638723295025715, 0, 0.16249889568919007, 1.1941879837414735, 0, 0.42128772462816594], [1.071635908557153, -0.2234656626288532, 7.174478670010185, 0, -0.6108314377461014, -0.8425754492563319, 0, 0, -0.8223536384741799, -0.258091618961588, 0, 3.1339187385131955], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1.256853150903099, 0.23476226867251077, -0.32977059288034255, 0, 3.7397693866902904, -0.42128772462816594, 0.8638723295025715, 0, 0.16249889568919007, -0.258091618961588, 0, 0], [-0.197031608341979, -0.2234656626288532, -0.32977059288034255, 0, -0.3054157188730507, 2.3454371414406934, 0, 0, -0.41117681923708993, -0.258091618961588, 0, 0], [0, 0.45822793130136397, 0, 0, 0.6262707719823396, 0, 3.630597195571431, 0, 0.57367571492628, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 3.1672694847490694, 0, 0, 0, 0], [-0.394063216683958, 0.011296606043657564, -0.6595411857606851, 0, 0.015439334236238222, -0.8425754492563319, 0.8638723295025715, 0, 1.5886339237482168, -0.516183237923176, 0, 0], [0.0, 1.0339739436867168, -0.1203528298812426, 0, -0.3054157188730507, -0.42128772462816594, 0, 0, -0.41117681923708993, 5.778308586740918, 1.730455352739724, 0.42128772462816594], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1.1941879837414735, 6.642686196030381, 0], [0.8313653667915449, 0.2234656626288532, 1.271098278974267, 0, 0, 0, 0, 0, 0, 0.258091618961588, 0, 1.7776032315706807]]
self.weight_sum
每一句與其它句相似度和
[4.407363132575895, 2.421061432161282, 1.4482368400032932, 0, 1.5088367082972747, -1.7249520209229035, 1.658174418209983, 0.0, -1.5217548198416835, 1.9274839284350573, 1.1941879837414735, 2.584020927356253]
(j, m)
第一次迭代中前兩句的權(quán)重變化
1 [0.15000000000000002]
2 [0.7789651644764877]
4 [1.4870107550912874]
5 [1.584101494462158]
6 [1.584101494462158]
8 [1.8042116789767775]
9 [1.8042116789767775]
10 [1.8042116789767775]
11 [2.0776849137681874]
0 [2.0776849137681874, 0.15000000000000002]
2 [2.0776849137681874, 0.018843404622897686]
4 [2.0776849137681874, 0.15109623706944147]
5 [2.0776849137681874, 0.261212814765845]
6 [2.0776849137681874, 0.4961059221065203]
8 [2.0776849137681874, 0.4897960257868833]
9 [2.0776849137681874, 0.9457675849621023]
10 [2.0776849137681874, 0.9457675849621023]
11 [2.0776849137681874, 1.0192754312893615]
每次迭代過程中m值(即每一句的權(quán)重)的變化
迭代200次,前五次和最后五次m值的變化池户,可以看到最后m值已收斂
[2.0776849137681874, 1.0192754312893615, 0.999457826141497, 0.15000000000000002, 0.7185396874241888, -0.5261633807600671, 0.4574244460937142, 0.15000000000000002, 0.05200189320790127, 1.6227868709805937, 0.9131124846903355, 2.66587982716429]
[1.9767903479098448, 1.0990295797831187, 1.3128224919934568, 0.15000000000000002, 0.7652761963157931, -1.111371191008174, 0.7837318722726239, 0.15000000000000002, -0.6395683714253901, 1.2720237049753234, 1.3883689212368555, 3.1528964479465524]
[2.131123478696624, 0.9565423086380485, 1.1548328753945554, 0.15000000000000002, 0.6827525917271398, -1.5413388479058974, 1.1643685871601586, 0.15000000000000002, -0.7329978403690465, 1.4226914336015335, 1.1206971700887252, 3.6413282429681653]
[2.0445854067537668, 1.1136053809668183, 1.2961406802982383, 0.15000000000000002, 0.8363765234805878, -1.507050641192699, 1.12607464161547, 0.15000000000000002, -0.6871676552565422, 1.1312077269369323, 1.2356735948433215, 3.4105543415541146]
[2.192542113515565, 0.9600086901987991, 1.1866885268412732, 0.15000000000000002, 0.7930661765454192, -1.553868352553225, 1.2263591164249343, 0.15000000000000002, -0.6769452402058755, 1.2490292917375383, 1.0132387392037487, 3.6098809382918335]
...
[3.0660780434944765, 0.4978862574608699, 1.8170234457076675, 0.15000000000000002, 1.2100915598373658, -1.7210905907446725, 1.273125426875697, 0.15000000000000002, -0.7943466908131793, -0.05874055540868156, 0.10506476396899506, 4.604908339621487]
[3.0670178684613028, 0.49797632644810413, 1.8169978091525545, 0.15000000000000002, 1.2097806282899073, -1.7214471894836905, 1.2732051380734033, 0.15000000000000002, -0.7946293748838906, -0.06010081416007357, 0.10517434880999071, 4.606025259292418]
[3.0669901938072566, 0.4973816702146632, 1.81759950114689, 0.15000000000000002, 1.2104144540568738, -1.721330726750902, 1.2732175406130832, 0.15000000000000002, -0.7945170165565797, -0.05995284501299358, 0.10413631837439419, 4.60606091010734]
[3.0678632028596278, 0.4974716987107603, 1.817569225028541, 0.15000000000000002, 1.2101189112709647, -1.721663112850048, 1.2732914272902789, 0.15000000000000002, -0.7947807217322551, -0.06121755331865397, 0.1042492354778799, 4.607097687262931]
[3.067828117464969, 0.49691854169441674, 1.8181282756563857, 0.15000000000000002, 1.2107106526021962, -1.7215513928316113, 1.2733021487789122, 0.15000000000000002, -0.7946735518644014, -0.06106655593120924, 0.1032841207803389, 4.607119643650029]
摘要(前三句)
用jieba分詞的結(jié)果
因而它是計(jì)算機(jī)科學(xué)的一部分
自然語言處理是計(jì)算機(jī)科學(xué)領(lǐng)域與人工智能領(lǐng)域中的一個(gè)重要方向
自然語言處理是一門融語言學(xué)咏雌、計(jì)算機(jī)科學(xué)、數(shù)學(xué)于一體的科學(xué)
詳細(xì)代碼
https://github.com/jllan/jannlp/blob/master/summary/textrank.py