第0006題:你有一個(gè)目錄晓铆,放了你一個(gè)月的日記,都是txt绰播,為了避免分詞的問(wèn)題骄噪,假設(shè)內(nèi)容都是英文,請(qǐng)統(tǒng)計(jì)出你認(rèn)為每篇日記最重要的詞蠢箩。
解題思路:可以用剛寫(xiě)的另一篇文章collections庫(kù)里面的一些方法链蕊,比如Counter()
和most_common()
。
代碼如下:
#! /usr/bin/env python
#coding=utf-8
import os
import re
from collections import Counter
def get_filepaths(directory):
file_paths = []
for root, directories, files in os.walk(directory):
for filename in files:
filepath = os.path.join(root, filename)
file_paths.append(filepath)
return file_paths
def counter_more_words(li):
word_dict = Counter(li)
return [i[0] for i in word_dict.most_common()[:10]]
if __name__ == '__main__':
for file in get_filepaths(r'C:\diaries'):
with open(file, 'r') as f:
word_li = re.findall("\w+", f.read())
print " ".join(counter_more_words(word_li))