1.導(dǎo)入詞庫(kù)
# 詞典庫(kù)
vocab = set([line.rstrip() for line in open('vocab.txt')])
print(vocab)
結(jié)果展示:
{'Rousseau', 'capsules', 'penetrated', 'predicting', 'unmeshed', 'Epstein', 'Eduardo', 'timetables', 'mahogany', 'catalog', 'Sodium', 'distortion', 'gilded', 'urinals', 'gagwriters', 'Fires', 'against', 'banner', 'Summerspace', 'apartment', 'conjure', '72.6', 'morticians', 'Goodis', 'distraction', 'Vikulov', 'brains', 'kidnapping', 'depraved', 'shock', 'pumping', 'fluxes', 'fecundity', 'pirate', 'willy', ',', 'Danchin', 'superlunary', 'Thurber', 'reminds', 'flattery', 'image', 'kernel', 'synthesize', 'Clint', 'Browne', 'Mahayana', 'sad', 'Len', 'screenings', 'dispenser', 'sustained', 'geographers', 'plug', 'succeeded', 'Falls', 'unkind', 'Severe', 'exploded', 'chronicle', 'honorably', 'fruit', 'lapped', 'rotate', 'crises', 'gentlemanly', 'revival', 'Alsing', 'members', 'synergistic', 'nakedly', 'singular', 'Flood', 'enzymatic', 'eyewitness', 'Premier', 'unsure', 'swarming', 'depot', 'survivals', 'Grace', 'anthropologists', 'Heretic', 'pastes', 'modifier', 'Catholics', 'some', 'pitcher'
2.生成所有候選集合
候選集合:一個(gè)正確單詞可能會(huì)出現(xiàn)的錯(cuò)誤輸入
編輯距離:一個(gè)字符串(錯(cuò)誤輸出)經(jīng)過(guò)幾次字母插入孵滞、刪除擎场、替換才能轉(zhuǎn)換成相應(yīng)的正確單詞。
注:下面代碼將采用編輯距離為1建立模型岔绸。即經(jīng)一個(gè)操作就能變成其他單詞兰伤。
# 需要生成所有候選集合
def generate_candidates(word):
"""
word: 給定的輸入(錯(cuò)誤的輸入)
返回所有(valid)候選集合
"""
"""
"""
# 生成編輯距離為1的單詞
# 1.insert 2. delete 3. replace
# appl: replace: bppl, cppl, aapl, abpl...
# insert: bappl, cappl, abppl, acppl....
# delete: ppl, apl, app
# 假設(shè)使用26個(gè)字符
letters = 'abcdefghijklmnopqrstuvwxyz'
#splits所有可能的插入點(diǎn)信息找出來(lái),然后在不同位置進(jìn)行插入刪除替換
#word[:i], word[i:] 從0到i-1行沛鸵,從i到最后一行,只有行就是第i個(gè)
splits = [(word[:i], word[i:]) for i in range(len(word)+1)]
# insert操作 L:左邊部分 c:插入的部分 R:右邊部分 LR來(lái)自splits中缆八,c來(lái)自letter中的每一個(gè)字符a-z
inserts = [L+c+R for L, R in splits for c in letters]
# delete L加上R的從第二個(gè)字符開始 R部位空
deletes = [L+R[1:] for L,R in splits if R]
# replace L加上c(從letter中截取的每一個(gè)字母)加上右邊的第二個(gè)字符到最后
replaces = [L+c+R[1:] for L,R in splits if R for c in letters]
candidates = set(inserts+deletes+replaces)
# 過(guò)來(lái)掉不存在于詞典庫(kù)里面的單詞
return [word for word in candidates if word in vocab]
generate_candidates("apple")
結(jié)果展示:
['apple', 'ample', 'apply', 'apples']
3.讀取語(yǔ)料庫(kù)
from nltk.corpus import reuters
# 讀取語(yǔ)料庫(kù)
categories = reuters.categories()
corpus = reuters.sents(categories=categories)
格式展示:
raining: rainning, raning
writings: writtings
disparagingly: disparingly
yellow: yello
4.構(gòu)建語(yǔ)言模型
注:下面代碼的語(yǔ)言模型使用bigram曲掰,并且只考慮第i個(gè)單詞和第i+1個(gè)單詞的關(guān)系
# 構(gòu)建語(yǔ)言模型: bigram
term_count = {}
bigram_count = {}
for doc in corpus:
doc = ['<s>'] + doc #給第一個(gè)加一個(gè),這樣第一個(gè)單詞也會(huì)和前一個(gè)產(chǎn)生聯(lián)系,數(shù)據(jù)就會(huì)統(tǒng)一 trigrams應(yīng)該加兩個(gè)<s>
for i in range(0, len(doc)-1): #遍歷文章中的每一個(gè)詞
# bigram: [i,i+1]
term = doc[i] #第i個(gè)index里面的單詞
bigram = doc[i:i+2] # 第i個(gè)單詞和第i+1個(gè)單詞
#統(tǒng)計(jì)信息
if term in term_count: #存在
term_count[term]+=1
else:
term_count[term]=1 #第一次出現(xiàn)
bigram = ' '.join(bigram)
if bigram in bigram_count: #前后兩個(gè)單詞在一起的概率
bigram_count[bigram]+=1
else:
bigram_count[bigram]=1
# sklearn里面有現(xiàn)成的包
5.根據(jù)數(shù)據(jù)奈辰,為每一個(gè)錯(cuò)誤輸入計(jì)算出現(xiàn)的概率
# 用戶打錯(cuò)的概率統(tǒng)計(jì) - channel probability
channel_prob = {}
#讀取正確的和錯(cuò)誤的
for line in open('spell-errors.txt'):
items = line.split(":")
correct = items[0].strip()
mistakes = [item.strip() for item in items[1].strip().split(",")]
channel_prob[correct] = {}
for mis in mistakes:
channel_prob[correct][mis] = 1.0/len(mistakes)
print(channel_prob)
結(jié)果展示:
{'raining': {'rainning': 0.5, 'raning': 0.5}, 'writings': {'writtings': 1.0}, 'disparagingly': {'disparingly': 1.0}, 'yellow': {'yello': 1.0},
6.導(dǎo)入測(cè)試數(shù)據(jù)蜈缤,進(jìn)行拼寫糾錯(cuò)
數(shù)據(jù)格式:
1 1 They told Reuter correspondents in Asian capitals a U.S. Move against Japan might boost protectionst sentiment in the U.S. And lead to curbs on American imports of their products.
代碼實(shí)現(xiàn):
import numpy as np
V = len(term_count.keys())
file = open("testdata.txt", 'r')
for line in file:
#一行三個(gè)元素 itmes[0]句子序號(hào) itmes[1]句子中有幾個(gè)錯(cuò)誤單詞 itmes[2]句子
items = line.rstrip().split('\t')
line = items[2].split()
# line = ["I", "like", "playing"]
for word in line:
if word not in vocab:
# 需要替換word成正確的單詞
# Step1: 生成所有的(valid)候選集合
candidates = generate_candidates(word)
# 一種方式: if candidate = [], 多生成幾個(gè)candidates, 比如生成編輯距離不大于2的
# TODO : 根據(jù)條件生成更多的候選集合
if len(candidates) < 1:
continue # 不建議這么做(這是不對(duì)的)
probs = []
# 對(duì)于每一個(gè)candidate, 計(jì)算它的score
# score = p(correct)*p(mistake|correct)
# = log p(correct) + log p(mistake|correct)
# 返回score最大的candidate
for candi in candidates:
prob = 0
# a. 計(jì)算channel probability
if candi in channel_prob and word in channel_prob[candi]:
#之前計(jì)算好的概率
prob += np.log(channel_prob[candi][word])
else:
prob += np.log(0.0001)
# b. 計(jì)算語(yǔ)言模型的概率
idx = items[2].index(word)+1
# 這個(gè)biagrams是否存在于之前的語(yǔ)言模型 是:
if items[2][idx - 1] in bigram_count and candi in bigram_count[items[2][idx - 1]]:
prob += np.log((bigram_count[items[2][idx - 1]][candi] + 1.0) / (
term_count[bigram_count[items[2][idx - 1]]] + V))
# TODO: 也要考慮當(dāng)前 [word, post_word]
# prob += np.log(bigram概率)
#否:賦予一個(gè)不為0但很小的值
else:
prob += np.log(1.0 / V)
probs.append(prob)
max_idx = probs.index(max(probs))
print (word, candidates[max_idx])
結(jié)果展示:由于沒(méi)有對(duì)句子中的標(biāo)點(diǎn)進(jìn)行過(guò)濾,有標(biāo)點(diǎn)的地方也當(dāng)成了拼寫錯(cuò)誤冯挎。
protectionst protectionist
products. products
long-run, long-run
gain. gain
17, 17e
retaiation retaliation
cost. cost