基于語(yǔ)義網(wǎng)絡(luò)和語(yǔ)料庫(kù)統(tǒng)計(jì)的句子相似度算法

最近看到一篇有趣的論文欧漱,Sentence Similarity Based on Semantic Nets and Corpus Statistics.恰好最近也遇上了類似的需求。因此便實(shí)現(xiàn)了論文中的算法。
我的算法實(shí)現(xiàn)是基于python3Natural Language Toolkit(NLTK).因?yàn)閚ltk中含有實(shí)現(xiàn)算法的WordNet和Brown Corpus。以下是算法:

from math import e,log,sqrt

import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import brown

corpus = []  # brown 語(yǔ)料庫(kù)
for i in brown.categories():
    corpus.extend(brown.words(categories=i))

word_buff = {}

threshold = 0.25        # 最小相似度閾值
semantic_and_word_order_factor=0.8   # 語(yǔ)義權(quán)重(語(yǔ)義和詞序)


def get_min_path_distance_and_subsumer_between_two_words(word1,word2):
    """
    獲取兩個(gè)詞之間的最小距離和父節(jié)點(diǎn)的最小深度
    """
    if word1 in word_buff:
        word1_synsets = word_buff[word1]
    else:
        word1_synsets = wn.synsets(word1)
        word_buff[word1] = word1_synsets
    if word2 in word_buff:
        word2_synsets = word_buff[word2]
    else:
        word2_synsets = wn.synsets(word2)
        word_buff[word2] = word2_synsets
    if not word1_synsets or not word2_synsets:
        return 0,0
    min_distance = 999999
    min_pairs = None
    for word1_synset in word1_synsets:
        for word2_synset in word2_synsets:
            distance = word1_synset.shortest_path_distance(word2_synset)
            if distance and distance < min_distance:
                min_distance = distance
                min_pairs = (word1_synset,word2_synset)
    subsumer_depth = 0
    if min_pairs:
        subsumer = min_pairs[0].lowest_common_hypernyms(min_pairs[0])
        if subsumer and len(subsumer) == 1:
            subsumer_depth = subsumer[0].min_depth()
        else:
            raise BaseException('function "min_path_distance_between_two_words" went wrong,check it')
    else:
        min_distance = None
    return min_distance,subsumer_depth


def similarity_between_two_words(word1,word2,length_factor=0.2,depth_factor=0.45):
    # 計(jì)算相似度
    length,subsumer_depth = get_min_path_distance_and_subsumer_between_two_words(word1,word2)
    if not length:
        return 0
    function_length = e ** -(length_factor*length)
    temp1 = e ** (depth_factor * subsumer_depth)
    temp2 = e ** -(depth_factor * subsumer_depth)
    function_depth = (temp1 - temp2) / (temp1 + temp2)
    return function_length * function_depth


def get_information_content(word,corpus):
    # 獲取詞的information content
    n = corpus.count(word)
    N = len(corpus)
    I_w = 1 - (log(n + 1) / log(N + 1))
    return I_w


def word_order_vector(word_vector,joint_words):
    res = []
    for word in joint_words:
        if word in word_vector:
            res.append(joint_words.index(word) + 1)
        else:
            max_similarity_word = None
            max_similarity = -1
            for t_word in word_vector:
                current_similarity = similarity_between_two_words(word,t_word)
                if current_similarity > max_similarity:
                    max_similarity_word = t_word
                if current_similarity > threshold and current_similarity > max_similarity:
                    max_similarity = current_similarity
            res.append(joint_words.index(max_similarity_word) + 1)
    return res


def semantic_vector(word_vector,joint_words):
    res = []
    for word in joint_words:
        i_w1 = get_information_content(word, corpus)
        if word in word_vector:
            res.append(i_w1 * i_w1)
        else:
            # 遍歷word_vector居凶,尋找與word相似度最大的詞
            max_similarity_word = None
            max_similarity = -1
            for t1_word in word_vector:
                current_similarity = similarity_between_two_words(word, t1_word)
                if current_similarity > threshold and current_similarity > max_similarity:
                    max_similarity = current_similarity
                    max_similarity_word = t1_word
            if max_similarity != -1:
                i_w2 = get_information_content(max_similarity_word, corpus)
                res.append(max_similarity * i_w1 * i_w2)
            else:
                res.append(0)
    return res


def sentence_similarity(sentence1,sentence2):
    # sentence1 = row['question1']
    # sentence2 = row['question2']
    words_1 = nltk.word_tokenize(sentence1)
    words_2 = nltk.word_tokenize(sentence2)
    if not words_1 or not words_2:
        return 0
    joint_words = list(set(words_1 + words_2))
    semantic_vector1,semantic_vector2 = semantic_vector(words_1,joint_words),semantic_vector(words_2,joint_words)
    word_order1,word_order2 = word_order_vector(words_1,joint_words),word_order_vector(words_2,joint_words)
    s_s = sum(map(lambda x: x[0] * x[1], zip(semantic_vector1, semantic_vector2))) / sqrt(
        sum(map(lambda x: x ** 2, semantic_vector1)) * sum(map(lambda x: x ** 2, semantic_vector2)))
    s_r = sqrt(sum(map(lambda x: (x[0] - x[1]) ** 2, zip(word_order1, word_order2)))) / sqrt(
        sum(map(lambda x: (x[0] + x[1]) ** 2, zip(word_order1, word_order2))))
    sentence_similarity = semantic_and_word_order_factor * s_s + (1 - semantic_and_word_order_factor) * s_r
    print(sentence1, '%%', sentence2, ':', sentence_similarity)
    return sentence_similarity

一些測(cè)試:

What is the step by step guide to invest in share market in india?  |  What is the step by step guide to invest in share market? : 0.6834055667921426
What is the story of Kohinoor (Koh-i-Noor) Diamond?  |  What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? : 0.7238159709057276
How can I increase the speed of my internet connection while using a VPN?  |  How can Internet speed be increased by hacking through DNS? : 0.3474180327786902
Why am I mentally very lonely? How can I solve it?  |  Find the remainder when [math]23^{24}[/math] is divided by 24,23? : 0.24185376358110777
Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?  |  Which fish would survive in salt water? : 0.5557426453712866
Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?  |  I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me? : 0.5619685362853818
Should I buy tiago?  |  What keeps childern active and far from phone and video games? : 0.273650666926712
How can I be a good geologist?  |  What should I do to be a great geologist? : 0.7444940225200597
When do you use シ instead of し?  |  When do you use "&" instead of "and"? : 0.33368722311749527
Motorola (company): Can I hack my Charter Motorolla DCX3400?  |  How do I hack Motorola DCX3400 for free internet? : 0.679325702169737
Method to find separation of slits using fresnel biprism?  |  What are some of the things technicians can tell about the durability and reliability of Laptops and its components? : 0.42371839556731794
How do I read and find my YouTube comments?  |  How can I see all my Youtube comments? : 0.39666438912838764
What can make Physics easy to learn?  |  How can you make physics easy to learn? : 0.7470727852312119
What was your first sexual experience like?  |  What was your first sexual experience? : 0.7939444688772478
What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?  |  What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan? : 0.7893963850595556
What would a Trump presidency mean for current international master’s students on an F1 visa?  |  How will a Trump presidency affect the students presently in US or planning to study in US? : 0.4490581992952136
What does manipulation mean?  |  What does manipulation means? : 0.8021629585217567
Why do girls want to be friends with the guy they reject?  |  How do guys feel after rejecting a girl? : 0.6173692627635123
Why are so many Quora users posting questions that are readily answered on Google?  |  Why do people ask Quora questions which can be answered easily by Google? : 0.6794045129534761
Which is the best digital marketing institution in banglore?  |  Which is the best digital marketing institute in Pune? : 0.5332225611879753
Why do rockets look white?  |  Why are rockets and boosters painted white? : 0.7624609655280314
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市毙石,隨后出現(xiàn)的幾起案子炼蹦,更是在濱河造成了極大的恐慌,老刑警劉巖谐腰,帶你破解...
    沈念sama閱讀 218,122評(píng)論 6 505
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件孕豹,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡十气,警方通過(guò)查閱死者的電腦和手機(jī)励背,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,070評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門(mén),熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)砸西,“玉大人叶眉,你說(shuō)我怎么就攤上這事∏奂希” “怎么了衅疙?”我有些...
    開(kāi)封第一講書(shū)人閱讀 164,491評(píng)論 0 354
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)鸳慈。 經(jīng)常有香客問(wèn)我饱溢,道長(zhǎng),這世上最難降的妖魔是什么走芋? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,636評(píng)論 1 293
  • 正文 為了忘掉前任绩郎,我火速辦了婚禮,結(jié)果婚禮上翁逞,老公的妹妹穿的比我還像新娘肋杖。我一直安慰自己,他們只是感情好挖函,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,676評(píng)論 6 392
  • 文/花漫 我一把揭開(kāi)白布状植。 她就那樣靜靜地躺著,像睡著了一般怨喘。 火紅的嫁衣襯著肌膚如雪津畸。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書(shū)人閱讀 51,541評(píng)論 1 305
  • 那天必怜,我揣著相機(jī)與錄音洼畅,去河邊找鬼。 笑死棚赔,一個(gè)胖子當(dāng)著我的面吹牛帝簇,可吹牛的內(nèi)容都是我干的徘郭。 我是一名探鬼主播,決...
    沈念sama閱讀 40,292評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼丧肴,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼残揉!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起芋浮,我...
    開(kāi)封第一講書(shū)人閱讀 39,211評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤抱环,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后纸巷,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體镇草,經(jīng)...
    沈念sama閱讀 45,655評(píng)論 1 314
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,846評(píng)論 3 336
  • 正文 我和宋清朗相戀三年瘤旨,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了梯啤。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 39,965評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡存哲,死狀恐怖因宇,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情祟偷,我是刑警寧澤察滑,帶...
    沈念sama閱讀 35,684評(píng)論 5 347
  • 正文 年R本政府宣布,位于F島的核電站修肠,受9級(jí)特大地震影響贺辰,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜嵌施,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,295評(píng)論 3 329
  • 文/蒙蒙 一饲化、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧艰管,春花似錦滓侍、人聲如沸蒋川。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 31,894評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)捺球。三九已至缸浦,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間氮兵,已是汗流浹背裂逐。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 33,012評(píng)論 1 269
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留泣栈,地道東北人卜高。 一個(gè)月前我還...
    沈念sama閱讀 48,126評(píng)論 3 370
  • 正文 我出身青樓弥姻,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親掺涛。 傳聞我的和親對(duì)象是個(gè)殘疾皇子庭敦,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,914評(píng)論 2 355

推薦閱讀更多精彩內(nèi)容

  • 秋天,是一首激昂的詩(shī)薪缆,一幅艷麗的畫(huà)秧廉,一首動(dòng)聽(tīng)的歌,一簾斑斕的夢(mèng)拣帽! 秋天疼电,感覺(jué)有一絲絲的涼意,有夾雜著夏天的余溫减拭! ...
    冰峰淚痕閱讀 385評(píng)論 1 3
  • 這該死的南方的冬天蔽豺! 冰冷而且潮濕的空氣無(wú)孔不入,它們穿過(guò)我衣服纖維的縫隙峡谊,直沖進(jìn)我的身體茫虽。 我感覺(jué)我的血液似乎都...
    半朽閱讀 1,490評(píng)論 5 10