1. 題目
第 0004 題:任一個(gè)英文的純文本文件蘸炸,統(tǒng)計(jì)其中的單詞出現(xiàn)的個(gè)數(shù)。
2. 效果
#------1.txt-----------
There are moments in life when you miss only
one life and one chance to do
you want to do.is
isn't don't word_d common
#------輸出------------
do: 2
word_d: 1
want: 1
to: 2
is: 1
you: 2
isn't: 1
don't: 1
...
- 將所有單詞按照小寫(xiě)處理
-
isn't
和word_d
這種應(yīng)當(dāng)作為一個(gè)單詞
3. 實(shí)現(xiàn)
# -*- coding:utf-8 -*-
import re
def get_word_dict(file_path=None):
if file_path is None:
print("Error")
return
word_dict = {}
with open(file_path, "r", encoding="utf-8") as file:
for line in file.readlines():
words = re.findall(r"[a-z\'_-]+\b", line.lower())
for word in words:
if word not in word_dict:
word_dict[word] = 1
else:
word_dict[word] += 1
for word, count in word_dict.items():
print("%s: %d\n" % (word, count))
return word_dict
if __name__ == "__main__":
get_word_dic("1.txt")
4. 解決問(wèn)題
<i>I. 無(wú)法識(shí)別isn't
這樣的單詞</i>
在正則匹配時(shí)需要在加入一個(gè)\b
來(lái)作為單詞邊界聊疲。
<i>II. 讀取文件出現(xiàn)編碼錯(cuò)誤</i>
在open()
函數(shù)中加入encoding參數(shù)播急。