產(chǎn)生一個(gè)文本携冤,一般要基于一個(gè)已有的訓(xùn)練集盆繁,或者說(shuō)是種子,來(lái)告訴程序詞匯的分布以及用詞習(xí)慣竖席,下面是一個(gè)最為基礎(chǔ)的文本產(chǎn)生函數(shù)耘纱,基于nltk的條件頻率分布函數(shù)構(gòu)建:
def generate_model(cfd, word, num=15):
for i in range (num):
print word #輸出當(dāng)前詞匯
word = cfd[word].max() #該詞匯的下一個(gè)"最有可能"與之聯(lián)結(jié)的詞匯, 并替代當(dāng)前詞匯,使之輸入到下一次循環(huán)當(dāng)中
text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
這里想詳細(xì)說(shuō)說(shuō)nltk.ConditionalFreqDist這個(gè)函數(shù)毕荐,個(gè)人感覺(jué)這個(gè)函數(shù)意義非凡束析。該函數(shù)是頻率分布的集合,比如憎亚,我們想統(tǒng)計(jì)在新聞文體中和言情小說(shuō)文體中給定詞的頻率分布员寇,那么這里的“新聞”以及“言情小說(shuō)”就是兩個(gè)條件弄慰,而給定的詞,就是我們觀察到的事件丁恭。在一個(gè)ConditionalFreqDist函數(shù)中曹动,(條件,事件)的集合牲览,就是輸入的argument墓陈,比如:
cfd = ConditionalFreqDist(條件,事件)
舉例說(shuō)明第献,我們想知道brown語(yǔ)料庫(kù)中贡必,news和romance兩種文學(xué)體裁的詞頻分布,那么我們可以使用如下代碼:
#我們先設(shè)置(條件庸毫,事件)的集合:
genre_word= [(genre, word) for genre in ['news','romance'] for word in brown.words(categories = genre)]
#輸出條件頻率
cfdist = nltk.ConditionalFreqDist(genre_word)
#這個(gè)函數(shù)的輸出仔拟,事實(shí)上是有“news”以及“romance”條件的counter default字典,下面是一部分#output:
#defaultdict(nltk.probability.FreqDist,
{'news': Counter({u'sunbonnet': 1,
u'Elevated': 1,
u'narcotic': 2,
u'four': 73,
u'woods': 4,
u'railing': 1,
u'Until': 5,
#我們可以進(jìn)一步的切片這個(gè)結(jié)果:
news = cfdist['news’]
<FreqDist with 14394 samples and 100554 outcomes>
news_four = cfdist['news']['four’] #cfdist[條件][事件]
Out[39]: 73
除此之外飒赃,我們還可以對(duì)cfdist做一寫(xiě)表達(dá)式處理利花,比如tabulate或者plot:
In[44]: cfdist.tabulate(conditions = ['news'],samples = ['four'])
four
news 73
In[45]: cfdist.tabulate(samples = ['four'])
four
news 73
romance 8
In[46]: cfdist.tabulate(samples = ['I','love','you'])
I love you
news 179 3 55
romance 951 32 456
#我們也可以讓他顯示百分比而不是counts:
cfdist_copy = cfdist
total_news = cfdist['news'].N()
total_romance = cfdist['romance'].N()
for i in cfdist_copy['news']:
cfdist_copy['news'][i] = float(cfdist_copy['news'][i])/float(total_news)
for j in cfdist_copy['romance']:
cfdist_copy['romance'][j] = float(cfdist_copy['romance'][j])/ float(total_romance)
print cfdist['romance']['I']
Out[78]: 0.013581445831310159
我們也可以對(duì)結(jié)果進(jìn)行畫(huà)圖,使之更加淺顯易懂:
cfdist.plot(samples = [‘I’, ‘love’, ‘you’])
接下來(lái)载佳,我們還可以利用CFD做一些更有趣的事情炒事,比如自動(dòng)生成一個(gè)文本, 即該文一開(kāi)頭的例子,這里我們有言情小說(shuō)來(lái)構(gòu)建一篇更有趣的“電腦寫(xiě)的言情小說(shuō)”:
from nltk.corpus import brown
def generate_romance(rcfdist, word, num = 100):
for i in range(num):
print word
word = rcfdist[word].max()
refined = [w for w in brown.words(categories = 'romance') if w.isalpha()]
bigrams = nltk.bigrams(refined)
rcfdist = nltk.ConditionalFreqDist(bigrams)
generate_romance(rcfdist,’love’)
output:
love
you
have
to
the
same
time
to
the
same
time
to
the
可以看到蔫慧,這個(gè)程序?qū)嶋H上存在很大問(wèn)題挠乳,因?yàn)槟承゜igrams一旦出現(xiàn)固定循環(huán),程序就會(huì)不停的在這個(gè)循環(huán)內(nèi)滾動(dòng)姑躲,不過(guò)這樣運(yùn)用條件概率分布的例子睡扬,仍然對(duì)我們是有啟發(fā)性的。