下載數(shù)據(jù):
http://www.gutenberg.org/cache/epub/5200/pg5200.txt
將開頭和結(jié)尾的一些信息去掉廓俭,使得開頭如下:
One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
結(jié)尾如下:
And, as if in confirmation of their new dreams and good intentions, as soon as they reached their destination Grete was the first to get up and stretch out her young body.
保存為:metamorphosis_clean.txt
加載數(shù)據(jù):
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
1. 用空格分隔:
words = text.split()
print(words[:100])
# ['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', ...]
2. 用 re 分隔單詞:
和上一種方法的區(qū)別是侦副,'armour-like' 被識(shí)別成兩個(gè)詞 'armour', 'like'莫辨,'"What's' 變成了 'What', 's'
import re
words = re.split(r'\W+', text)
print(words[:100])
3. 用空格分隔并去掉標(biāo)點(diǎn):
string 里的 string.punctuation 可以知道都有哪些算是標(biāo)點(diǎn)符號岸梨,
maketrans() 可以建立一個(gè)空的映射表,其中 string.punctuation 是要被去掉的列表,
translate() 可以將一個(gè)字符串集映射到另一個(gè)集,
也就是 'armour-like' 被識(shí)別成 'armourlike'巴柿,'"What's' 被識(shí)別成 'Whats'
words = text.split()
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
print(stripped[:100])
4. 都變成小寫:
當(dāng)然大寫可以用 word.upper()。
words = [word.lower() for word in words]
print(words[:100])
安裝 NLTK:
nltk.download() 后彈出對話框死遭,選擇 all广恢,點(diǎn)擊 download
import nltk
nltk.download()
5. 分成句子:
用到 sent_tokenize()
from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[0])
6. 分成單詞:
用到 word_tokenize,
這次 'armour-like' 還是 'armour-like'呀潭,'"What's' 就是 'What', "'s",
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens[:100])
7. 過濾標(biāo)點(diǎn):
只保留 alphabetic钉迷,其他的濾掉,
這樣的話 “armour-like” 和 “‘s” 也被濾掉了蜗侈。
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
words = [word for word in tokens if word.isalpha()]
print(tokens[:100])
8. 過濾掉沒有深刻含義的 stop words:
在 stopwords.words('english') 可以查看這樣的詞表篷牌。
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])
9. 轉(zhuǎn)化成詞根:
運(yùn)行 porter.stem(word) 之后,單詞會(huì)變成相應(yīng)的詞根形式踏幻,例如 “fishing,” “fished,” “fisher” 會(huì)變成 “fish”
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])
學(xué)習(xí)資源:
http://blog.csdn.net/lanxu_yy/article/details/29002543
https://machinelearningmastery.com/clean-text-machine-learning-python/
推薦閱讀 歷史技術(shù)博文鏈接匯總
http://www.reibang.com/p/28f02bb59fe5
也許可以找到你想要的:
[入門問題][TensorFlow][深度學(xué)習(xí)][強(qiáng)化學(xué)習(xí)][神經(jīng)網(wǎng)絡(luò)][機(jī)器學(xué)習(xí)][自然語言處理][聊天機(jī)器人]