內(nèi)容包括:
- 讀入文本
- 分詞
- 建立字典愤兵,將每個詞映射到一個索引
- 將文本從詞的序列轉(zhuǎn)換為索引的序列
讀入文本
import collections
import re
def read_time_machine():
with open('/home/kesci/input/timemachine7163/timemachine.txt', 'r') as f:
lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f]
return lines
lines = read_time_machine()
print('# sentences %d' % len(lines))
將文件里所有標(biāo)點全都去掉粥诫,劃分為一個一個句子并讀入list
分詞
每句話按照空格進(jìn)行split后為一個個單詞存儲
def tokenize(sentences, token='word'):
"""Split sentences into word or char tokens"""
if token == 'word':
return [sentence.split(' ') for sentence in sentences]
elif token == 'char':
return [list(sentence) for sentence in sentences]
else:
print('ERROR: unkown token type '+token)
建立字典
- 對文本去重并統(tǒng)計詞頻
- 添加一些特殊token杀狡,比如padding,因為文本是一個二維句子,每一個句子長度不一定相同轩缤,因此需要對短句子進(jìn)行填充使得每個句子一樣長,還有其他特殊token如padding(pad), begin of sentence(bos), end of sentence(ens), unknown(unk)bos贩绕,ens標(biāo)記語句的開始與結(jié)束
unk表示沒有見過的詞 - 按照文本去重返回的字典火的,按順序取出token放入idx_to_token,同時構(gòu)造token_to_idx
class Vocab(object):
def __init__(self, tokens, min_freq=0, use_special_tokens=False):
counter = count_corpus(tokens) # : 統(tǒng)計詞頻
self.token_freqs = list(counter.items())
self.idx_to_token = [] #:索引到token
if use_special_tokens:#是否需要特殊token
# padding, begin of sentence, end of sentence, unknown
self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)
self.idx_to_token += ['', '', '', '']
else:#:unk是需要保留的
self.unk = 0
self.idx_to_token += ['']
self.idx_to_token += [token for token, freq in self.token_freqs
if freq >= min_freq and token not in self.idx_to_token]#:大于閾值的需要保留
self.token_to_idx = dict()
for idx, token in enumerate(self.idx_to_token):
self.token_to_idx[token] = idx
def __len__(self):
return len(self.idx_to_token)
def __getitem__(self, tokens):#按照索引取出toke
if not isinstance(tokens, (list, tuple)):
return self.token_to_idx.get(tokens, self.unk)
return [self.__getitem__(token) for token in tokens]
def to_tokens(self, indices):
if not isinstance(indices, (list, tuple)):
return self.idx_to_token[indices]
return [self.idx_to_token[index] for index in indices]
def count_corpus(sentences):
tokens = [tk for st in sentences for tk in st]
return collections.Counter(tokens) # 返回一個字典淑倾,記錄每個詞的出現(xiàn)次數(shù)
將文本從詞的序列轉(zhuǎn)換為索引的序列
直接讀取分詞的字典即可
但是以上分詞過于粗糙馏鹤,我們可以使用一些庫spaCy和NLTK。
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])
from nltk.tokenize import word_tokenize
from nltk import data
data.path.append('/home/kesci/input/nltk_data3784/nltk_data')
print(word_tokenize(text))