詞嵌入+神經(jīng)網(wǎng)絡(luò)進(jìn)行郵件分類

1 問(wèn)題描述

問(wèn)題:郵件分類問(wèn)題(Email classification)

任務(wù):將郵件分為兩類(spam or ham)

數(shù)據(jù)集:https://www.kaggle.com/uciml/sms-spam-collection-dataset#spam.csv

2 數(shù)據(jù)處理

import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word
import re
from sklearn.model_selection import train_test_split

讀取數(shù)據(jù)

# 讀取數(shù)據(jù)
data = pd.read_csv('spam.csv', encoding = "ISO-8859-1")
data.columns
Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')
# 查看前5行數(shù)據(jù)
data.head()

去除無(wú)用數(shù)據(jù)

# 去除無(wú)用數(shù)據(jù)周荐,后3列是無(wú)用數(shù)據(jù)
data = data[['v1', 'v2']]
data.head()

修改表頭信息

# 修改表頭信息
data = data.rename(columns={"v1":"label","v2":"text"})
data.head()

去除標(biāo)點(diǎn)符號(hào)及多余的空格

# 去除標(biāo)點(diǎn)符號(hào)及兩個(gè)以上的空格
data['text'] = data['text'].apply(lambda x:re.sub('[!@#$:).;,?&]', ' ', x.lower()))
data['text'] = data['text'].apply(lambda x:re.sub(' ', ' ', x))
data['text'][0]
'go until jurong point  crazy   available only in bugis n great world la e buffet    cine there got amore wat   '

單詞轉(zhuǎn)換為小寫

# 單詞轉(zhuǎn)換為小寫
data['text'] = data['text'].apply(lambda x:" ".join(x.lower() for x in x.split()))
# 或者 
#data['text'] = data['text'].apply(lambda x:x.lower())
data['text'][0]
'go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat'

去除停止詞

# 去除停止詞 喂窟,如a勾习、an沫屡、the济竹、高頻介詞其爵、連詞强胰、代詞等
stop = stopwords.words('english')
data['text'] = data['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
data['text'][0]
'go jurong point crazy available bugis n great world la e buffet cine got amore wat'

分詞處理

# 分詞處理迅箩,希望能夠?qū)崿F(xiàn)還原英文單詞原型
st = PorterStemmer()
data['text'] = data['text'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
data['text'] = data['text'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
data['text'][0]
'go jurong point crazi avail bugi n great world la e buffet cine got amor wat'
data.head()

3 特征提取

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
Using TensorFlow backend.

分出訓(xùn)練集和測(cè)試集

#以 8:2 的比例分出訓(xùn)練集和測(cè)試集
train, test = train_test_split(data, test_size=0.2)

設(shè)置參數(shù)

# 每個(gè)序列的最大長(zhǎng)度侨歉,多了截?cái)辔菀。倭搜a(bǔ)0
max_sequence_length = 300

#只保留頻率最高的前20000個(gè)詞
num_words = 20000

# 嵌入的維度
embedding_dim = 100

構(gòu)建分詞器

# 找出經(jīng)常出現(xiàn)的單詞,分詞器
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(train.text)
train_sequences = tokenizer.texts_to_sequences(train.text)
test_sequences = tokenizer.texts_to_sequences(test.text)

# dictionary containing words and their index
word_index = tokenizer.word_index


# print(tokenizer.word_index)
# total words in the corpus
print('Found %s unique tokens.' % len(word_index))
# get only the top frequent words on train

train_x = pad_sequences(train_sequences, maxlen=max_sequence_length)
# get only the top frequent words on test
test_x = pad_sequences(test_sequences, maxlen=max_sequence_length)

print(train_x.shape)
print(test_x.shape)
Found 6702 unique tokens.
(4457, 300)
(1115, 300)

標(biāo)簽向量化

# 標(biāo)簽向量化
# [0,1]: ham;[1,0]:spam
import numpy as np

def lable_vectorize(labels):
    label_vec = np.zeros([len(labels),2])
    for i, label in enumerate(labels):
        if str(label)=='ham':
            label_vec[i][0] = 1
        else:
            label_vec[i][1] = 1
    return label_vec
            
train_y = lable_vectorize(train['label'])            
test_y = lable_vectorize(test['label'])


# 或者
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

# converts the character array to numeric array. Assigns levels to unique labels.
train_labels = train['label']
test_labels = test['label']

le = LabelEncoder()
le.fit(train_labels)
train_labels = le.transform(train_labels)
test_labels = le.transform(test_labels)

# changing data types
labels_train = to_categorical(np.asarray(train_labels))
labels_test = to_categorical(np.asarray(test_labels))

4 構(gòu)建模型并訓(xùn)練

# Import Libraries
import sys, os, re, csv, codecs, numpy as np, pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, LSTM, Embedding,Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D,Conv1D, SimpleRNN
from keras.models import Model
from keras.models import Sequential
from keras import initializers, regularizers, constraints,optimizers, layers
from keras.layers import Dense, Input, Flatten, Dropout,BatchNormalization
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Sequential

model = Sequential()
model.add(Embedding(num_words,
                    embedding_dim,
                    input_length=max_sequence_length))
model.add(Dropout(0.5))
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))

model.add(BatchNormalization())
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))

model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

model.fit(train_x, train_y,
            batch_size=64,
            epochs=5,
            validation_split=0.2)
Train on 3565 samples, validate on 892 samples
Epoch 1/5
3565/3565 [==============================] - 25s 7ms/step - loss: 0.3923 - acc: 0.8480 - val_loss: 0.1514 - val_acc: 0.9451
Epoch 2/5
3565/3565 [==============================] - 23s 7ms/step - loss: 0.1729 - acc: 0.9372 - val_loss: 0.0789 - val_acc: 0.9753
Epoch 3/5
3565/3565 [==============================] - 25s 7ms/step - loss: 0.0940 - acc: 0.9731 - val_loss: 0.2079 - val_acc: 0.9787
Epoch 4/5
3565/3565 [==============================] - 23s 7ms/step - loss: 0.0590 - acc: 0.9857 - val_loss: 0.3246 - val_acc: 0.9843
Epoch 5/5
3565/3565 [==============================] - 23s 7ms/step - loss: 0.0493 - acc: 0.9882 - val_loss: 0.3150 - val_acc: 0.9877





<keras.callbacks.History at 0x1cac6187940>

5 模型評(píng)估

# [0.07058866604882806, 0.9874439467229116]
model.evaluate(test_x, test_y)
1115/1115 [==============================] - 2s 2ms/step





[0.32723046118903054, 0.97847533632287]
# prediction on test data
predicted=model.predict(test_x)
predicted
array([[0.71038646, 0.28961352],
       [0.71285075, 0.28714925],
       [0.7101978 , 0.28980213],
       ...,
       [0.7092874 , 0.29071262],
       [0.70976096, 0.290239  ],
       [0.70463425, 0.29536578]], dtype=float32)
#模型評(píng)估
import sklearn
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(test_y,predicted.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(test_y,predicted.round()))
precision: [0.97961264 0.97014925]
recall: [0.99585492 0.86666667]
fscore: [0.98766701 0.91549296]
support: [965 150]
############################
             precision    recall  f1-score   support

          0       0.98      1.00      0.99       965
          1       0.97      0.87      0.92       150

avg / total       0.98      0.98      0.98      1115





文章來(lái)源: https://foochane.cn/article/2019052202.html

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末幽邓,一起剝皮案震驚了整個(gè)濱河市炮温,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌牵舵,老刑警劉巖茅特,帶你破解...
    沈念sama閱讀 219,539評(píng)論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異棋枕,居然都是意外死亡白修,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,594評(píng)論 3 396
  • 文/潘曉璐 我一進(jìn)店門重斑,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)兵睛,“玉大人,你說(shuō)我怎么就攤上這事窥浪∽婧埽” “怎么了?”我有些...
    開封第一講書人閱讀 165,871評(píng)論 0 356
  • 文/不壞的土叔 我叫張陵漾脂,是天一觀的道長(zhǎng)假颇。 經(jīng)常有香客問(wèn)我,道長(zhǎng)骨稿,這世上最難降的妖魔是什么笨鸡? 我笑而不...
    開封第一講書人閱讀 58,963評(píng)論 1 295
  • 正文 為了忘掉前任姜钳,我火速辦了婚禮,結(jié)果婚禮上形耗,老公的妹妹穿的比我還像新娘哥桥。我一直安慰自己,他們只是感情好激涤,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,984評(píng)論 6 393
  • 文/花漫 我一把揭開白布拟糕。 她就那樣靜靜地躺著,像睡著了一般倦踢。 火紅的嫁衣襯著肌膚如雪送滞。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,763評(píng)論 1 307
  • 那天辱挥,我揣著相機(jī)與錄音犁嗅,去河邊找鬼。 笑死般贼,一個(gè)胖子當(dāng)著我的面吹牛愧哟,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播哼蛆,決...
    沈念sama閱讀 40,468評(píng)論 3 420
  • 文/蒼蘭香墨 我猛地睜開眼蕊梧,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了腮介?” 一聲冷哼從身側(cè)響起肥矢,我...
    開封第一講書人閱讀 39,357評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎叠洗,沒(méi)想到半個(gè)月后甘改,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,850評(píng)論 1 317
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡灭抑,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,002評(píng)論 3 338
  • 正文 我和宋清朗相戀三年十艾,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片腾节。...
    茶點(diǎn)故事閱讀 40,144評(píng)論 1 351
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡忘嫉,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出案腺,到底是詐尸還是另有隱情庆冕,我是刑警寧澤,帶...
    沈念sama閱讀 35,823評(píng)論 5 346
  • 正文 年R本政府宣布劈榨,位于F島的核電站访递,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏同辣。R本人自食惡果不足惜拷姿,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,483評(píng)論 3 331
  • 文/蒙蒙 一惭载、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧跌前,春花似錦棕兼、人聲如沸陡舅。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,026評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)靶衍。三九已至灾炭,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間颅眶,已是汗流浹背蜈出。 一陣腳步聲響...
    開封第一講書人閱讀 33,150評(píng)論 1 272
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留涛酗,地道東北人铡原。 一個(gè)月前我還...
    沈念sama閱讀 48,415評(píng)論 3 373
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像商叹,于是被迫代替她去往敵國(guó)和親燕刻。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,092評(píng)論 2 355

推薦閱讀更多精彩內(nèi)容