用的是ipython notebook
1.框架是打開(kāi)文件,寫(xiě)入文件
for line in open(in_file):
continue
out = open(out_file, 'w')
out.write()```
2.簡(jiǎn)單的統(tǒng)計(jì)詞頻大致模板
def count(in_file,out_file):
#讀取文件并統(tǒng)計(jì)詞頻
word_count={}#統(tǒng)計(jì)詞頻的字典
for line in open(in_file):
words = line.strip().split(" ")
for word in words:
if word in word_count:
word_count[word]+=1
else:
word_count[word]=1
out = open(out_file,'w')#打開(kāi)一個(gè)文件
for word in word_count:
print word,word_count[word]#輸出字典的key值和value值
out.write(word+"--"+str(word_count[word])+"\n")#寫(xiě)入文件
out.close()
count(in_file,out_file)```
一段很長(zhǎng)的英文文本窗骑,此代碼都是用split(" ")空格區(qū)分一個(gè)單詞但惶,顯然是不合格的比如: "I will endeavor," said he,那么"I 和he,等等會(huì)被看成一個(gè)詞耳鸯,此段代碼就是告訴你基本的統(tǒng)計(jì)詞頻思路湿蛔。看如下一道題
1.在網(wǎng)上摘錄一段英文文本(盡量長(zhǎng)一些)县爬,粘貼到input.txt阳啥,統(tǒng)計(jì)其中每個(gè)單詞的詞頻(出現(xiàn)的次數(shù)),并按照詞頻的順序?qū)懭雘ut.txt文件财喳,每一行的內(nèi)容為“單詞:頻次”
用的模板
#統(tǒng)計(jì)詞頻察迟,按詞頻順序?qū)懭胛募?in_file = 'input_word.txt'
out_file = 'output_word.txt'
def count_word(in_file,out_file):
word_count={}#統(tǒng)計(jì)詞頻的字典
for line in open(in_file):
words = line.strip().split(" ")
for word in words:
if word in word_count:
word_count[word]+=1
else:
word_count[word]=1
out = open(out_file,'w')
for word in sorted(word_count.keys()):#按詞頻的順序遍歷字典的每個(gè)元素
print word,word_count[word]
out.write('%s:%d' % (word, word_count.get(word)))
out.write('\n')
out.close()
count_word(in_file,out_file)```
正則表達(dá)式的方法
import re
f = open('input_word.txt')
words = {}
rc = re.compile('\w+')
for l in f:
w_l = rc.findall(l)
for w in w_l:
if words.has_key(w):
words[w] += 1
else:
words[w] = 1
f.close()
f = open('out.txt', 'w')
for k in sorted(words.keys()):
print k,words[k]
f.write('%s:%d' % (k, words.get(k)))
f.write('\n')
f.close()```