中文文本分類對比（經(jīng)典方法和CNN）

背景介紹

筆者實(shí)驗(yàn)室項(xiàng)目正好需要用到文本分類，作為NLP領(lǐng)域最經(jīng)典的場景之一赌躺，文本分類積累了大量的技術(shù)實(shí)現(xiàn)方法扛伍，如果將是否使用深度學(xué)習(xí)技術(shù)作為標(biāo)準(zhǔn)來衡量掂榔，實(shí)現(xiàn)方法大致可以分成兩類：

基于傳統(tǒng)機(jī)器學(xué)習(xí)的文本分類
基于深度學(xué)習(xí)的文本分類

facebook之前開源的fastText屬于簡化版的第二類探颈，詞向量取平均直接進(jìn)softmax層熟丸，還有業(yè)界研究上使用比較多的TextCNN模型屬于第二類。有一個(gè)github項(xiàng)目很好的把這些模型都集中到了一起伪节，并做了一些簡單的性能比較光羞，想要進(jìn)一步了解這些高大上模型的同學(xué)可以查看如下鏈接：

all kinds of text classificaiton models and more with deep learning

本文的目的主要記錄筆者自己構(gòu)建文本分類系統(tǒng)的過程，分別構(gòu)建基于傳統(tǒng)機(jī)器學(xué)習(xí)的文本分類和基于深度學(xué)習(xí)的文本分類系統(tǒng)怀大，并在同一數(shù)據(jù)集上進(jìn)行測試纱兑。

經(jīng)典的機(jī)器學(xué)習(xí)方法采用獲取tf-idf文本特征，分別喂入logistic regression分類器和隨機(jī)森林分類器的思路化借，并對兩種方法做性能對比潜慎。

基于深度學(xué)習(xí)的文本分類，這里主要采用CNN對文本分類屏鳍，考慮到RNN模型相較CNN模型性能差異不大并且耗時(shí)還比較久勘纯，這里就不多做實(shí)驗(yàn)了。

實(shí)驗(yàn)過程有些比較有用的small trick分享钓瞭，包括多進(jìn)程分詞、訓(xùn)練全量tf-idf淫奔、python2對中文編碼的處理技巧等等山涡，在下文都會(huì)仔細(xì)介紹。

食材準(zhǔn)備

本文采用的數(shù)據(jù)集是很流行的搜狗新聞數(shù)據(jù)集唆迁，get到的時(shí)候已經(jīng)是經(jīng)過預(yù)處理的了鸭丛，所以省去了很多數(shù)據(jù)預(yù)處理的麻煩，數(shù)據(jù)集下載鏈接如下：

(感謝張凱強(qiáng)同學(xué)指出了我的錯(cuò)誤唐责，數(shù)據(jù)集是THUCnews的鳞溉，清華大學(xué)根據(jù)新浪新聞RSS訂閱頻道2005-2011年間的歷史數(shù)據(jù)篩選過濾生成，非常感謝鼠哥，下面的鏈接也更新過一次熟菲，參考鏈接中有原始我參考的博文，如果鏈接再失效朴恳，數(shù)據(jù)集也可以去那里找找看抄罕，由于我的學(xué)業(yè)和實(shí)習(xí)導(dǎo)致我的生活越來越忙，不能及時(shí)回復(fù)大家了于颖，請多見諒呆贿，謝謝！)

新聞文本分類數(shù)據(jù)集下載

密碼：kxxa

數(shù)據(jù)集一共包括10類新聞森渐，每類新聞65000條文本數(shù)據(jù)做入，訓(xùn)練集50000條冒晰，測試集10000條，驗(yàn)證集5000條竟块。

經(jīng)典機(jī)器學(xué)習(xí)方法

分詞翩剪、去停用詞

調(diào)用之前短文本分類博文中提到的分詞工具類，對訓(xùn)練集彩郊、測試集前弯、驗(yàn)證集進(jìn)行多進(jìn)程分詞，以節(jié)省時(shí)間：

import multiprocessing


tmp_catalog = '/home/zhouchengyu/haiNan/textClassifier/data/cnews/'
file_list = [tmp_catalog+'cnews.train.txt', tmp_catalog+'cnews.test.txt']
write_list = [tmp_catalog+'train_token.txt', tmp_catalog+'test_token.txt']

def tokenFile(file_path, write_path):
    word_divider = WordCut()
    with open(write_path, 'w') as w:
        with open(file_path, 'r') as f:
            for line in f.readlines():
                line = line.decode('utf-8').strip()
                token_sen = word_divider.seg_sentence(line.split('\t')[1])
                w.write(line.split('\t')[0].encode('utf-8') + '\t' + token_sen.encode('utf-8') + '\n') 
    print file_path + ' has been token and token_file_name is ' + write_path

pool = multiprocessing.Pool(processes=4)
for file_path, write_path in zip(file_list, write_list):
    pool.apply_async(tokenFile, (file_path, write_path, ))
pool.close()
pool.join() # 調(diào)用join()之前必須先調(diào)用close()
print "Sub-process(es) done."

計(jì)算tf-idf

這里有幾點(diǎn)需要注意的秫逝，一是計(jì)算tf-idf是全量計(jì)算恕出，所以需要將train+test+val的所有corpus都相加，再進(jìn)行計(jì)算违帆，二是為了防止文本特征過大浙巫，需要去低頻詞，因?yàn)槭窃趈upyter上寫的刷后，所以測試代碼的時(shí)候的畴，先是選擇最小的val數(shù)據(jù)集，成功后尝胆，再對test,train數(shù)據(jù)集迭代操作丧裁，希望不要給大家留下代碼冗余的影響...[悲傷臉]。實(shí)現(xiàn)代碼如下：

def constructDataset(path):
    """
    path: file path
    rtype: lable_list and corpus_list
    """
    label_list = []
    corpus_list = []
    with open(path, 'r') as p:
        for line in p.readlines():
            label_list.append(line.split('\t')[0])
            corpus_list.append(line.split('\t')[1])
    return label_list, corpus_list
    
tmp_catalog = '/home/zhouchengyu/haiNan/textClassifier/data/cnews/'
file_path = 'val_token.txt'
val_label, val_set = constructDataset(tmp_catalog+file_path)
print len(val_set)

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer


tmp_catalog = '/home/zhouchengyu/haiNan/textClassifier/data/cnews/'
write_list = [tmp_catalog+'train_token.txt', tmp_catalog+'test_token.txt']

tarin_label, train_set = constructDataset(write_list[0]) # 50000
test_label, test_set = constructDataset(write_list[1]) # 10000
# 計(jì)算tf-idf
corpus_set = train_set + val_set + test_set # 全量計(jì)算tf-idf
print "length of corpus is: " + str(len(corpus_set))
vectorizer = CountVectorizer(min_df=1e-5) # drop df < 1e-5,去低頻詞
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus_set))
words = vectorizer.get_feature_names()
print "how many words: {0}".format(len(words))
print "tf-idf shape: ({0},{1})".format(tfidf.shape[0], tfidf.shape[1])

"""
length of corpus is: 65000
how many words: 379000
tf-idf shape: (65000,379000)
"""

標(biāo)簽數(shù)字化含衔，抽取數(shù)據(jù)

因?yàn)楸緛砦谋揪褪且砸欢S機(jī)性抽取成3份數(shù)據(jù)集的煎娇，所以，這里就不shuffle啦贪染，偷懶一下下缓呛。。但是如果能shuffle的話杭隙，盡量還是做這一步哟绊，堅(jiān)持正途。

from sklearn import preprocessing

# encode label
corpus_label = tarin_label + val_label + test_label
encoder = preprocessing.LabelEncoder()
corpus_encode_label = encoder.fit_transform(corpus_label)
train_label = corpus_encode_label[:50000]
val_label = corpus_encode_label[50000:55000]
test_label = corpus_encode_label[55000:]
# get tf-idf dataset
train_set = tfidf[:50000]
val_set = tfidf[50000:55000]
test_set = tfidf[55000:]

喂入分類器

logistic regression分類器

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# from sklearn.metrics import confusion_matrix

# LogisticRegression classiy model
lr_model = LogisticRegression()
lr_model.fit(train_set, train_label)
print "val mean accuracy: {0}".format(lr_model.score(val_set, val_label))
y_pred = lr_model.predict(test_set)
print classification_report(test_label, y_pred)

分類報(bào)告如下（包括準(zhǔn)確率痰憎、召回率票髓、F1值）:

mean accuracy: 0.9626
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      1000
          1       0.99      0.98      0.98      1000
          2       0.94      0.87      0.91      1000
          3       0.91      0.91      0.91      1000
          4       0.97      0.93      0.95      1000
          5       0.97      0.98      0.98      1000
          6       0.93      0.96      0.95      1000
          7       0.99      0.97      0.98      1000
          8       0.94      0.99      0.96      1000
          9       0.95      0.99      0.97      1000

avg / total       0.96      0.96      0.96     10000

Random Forest 分類器

# 隨機(jī)森林分類器
from sklearn.ensemble import RandomForestClassifier    


rf_model = RandomForestClassifier(n_estimators=200, random_state=1080)
rf_model.fit(train_set, train_label)
print "val mean accuracy: {0}".format(rf_model.score(val_set, val_label))
y_pred = rf_model.predict(test_set)
print classification_report(test_label, y_pred)

分類報(bào)告如下（包括準(zhǔn)確率、召回率信殊、F1值）:

val mean accuracy: 0.9228
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      1000
          1       0.98      0.98      0.98      1000
          2       0.89      0.57      0.69      1000
          3       0.81      0.97      0.88      1000
          4       0.95      0.89      0.92      1000
          5       0.97      0.96      0.97      1000
          6       0.85      0.94      0.89      1000
          7       0.95      0.97      0.96      1000
          8       0.95      0.97      0.96      1000
          9       0.91      0.99      0.95      1000

avg / total       0.93      0.92      0.92     10000

分析

上面采用邏輯回歸分類器和隨機(jī)森林分類器做對比：
可以發(fā)現(xiàn)炬称，除了個(gè)別分類隨機(jī)森林方法有較大進(jìn)步，大部分都差于邏輯回歸分類器
并且200棵樹的隨機(jī)森林耗時(shí)過長涡拘，比起邏輯回歸分類器來說玲躯，運(yùn)算效率太低

CNN文本分類

這一部分主要是參考tensorflow社區(qū)的一份博客進(jìn)行實(shí)驗(yàn)的，這里也不再贅述，博客講的非常好跷车，附上原文鏈接棘利，前去膜拜：NN-RNN中文文本分類，基于tensorflow

字符級特征提取

這里和前文差異比較大的地方朽缴，主要是提取文本特征這一塊善玫，這里的CNN模型采用的是字符級特征提取，比如data目錄下cnews_loader.py中：

def read_file(filename):
    """讀取文件數(shù)據(jù)"""
    contents, labels = [], []
    with open_file(filename) as f:
        for line in f:
            try:
                label, content = line.strip().split('\t')
                contents.append(list(content)) # 字符級特征
                labels.append(label)
            except:
                pass
    return contents, labels

def build_vocab(train_dir, vocab_dir, vocab_size=5000):
    """根據(jù)訓(xùn)練集構(gòu)建詞匯表密强，存儲"""
    data_train, _ = read_file(train_dir)

    all_data = []
    for content in data_train:
        all_data.extend(content)

    counter = Counter(all_data)
    count_pairs = counter.most_common(vocab_size - 1)
    words, _ = list(zip(*count_pairs))
    # 添加一個(gè) <PAD> 來將所有文本pad為同一長度
    words = ['<PAD>'] + list(words)

筆者做了下測試：

#! /bin/env python
# -*- coding: utf-8 -*-
from collections import Counter

"""
字符級別處理,
對于中文來說茅郎，基本不是原意的字，但是也能作為一種統(tǒng)計(jì)特征來表征文本
"""
content1 = "你好呀大家"
content2 = "你真的好嗎或渤？"
# content = "abcdefg"
all_data = []
all_data.extend(list(content1))
all_data.extend(list(content2))
# print list(content) # 字符級別處理
# print "length: " + str(len(list(content)))
counter = Counter(all_data)
count_pairs = counter.most_common(5)
words, _ = list(zip(*count_pairs))
words = ['<PAD>'] + list(words) #['<PAD>', '\xe5', '\xbd', '\xa0', '\xe4', '\xe7']

這種基本不是原意的字符級別的特征系冗，也能從統(tǒng)計(jì)意義上表征文本，從而作為特征薪鹦，這一點(diǎn)需要清楚掌敬。

遷移python2

github上的版本是python3的，由于筆者一直使用的是python2,所以對上述工作做了一點(diǎn)版本遷移池磁，使得在如下環(huán)境下也能順利運(yùn)行：

Python 2.7
TensorFlow 1.3
numpy
scikit-learn

除了p3和py2差異比較明顯的類定義奔害、print、除法運(yùn)算外地熄，還有就是中文編碼华临，使用codecs模塊可以很好的解決這個(gè)問題，由于是細(xì)枝末節(jié)离斩，這里也就不展開來說了银舱。

最終，在同一數(shù)據(jù)集上跛梗，得到的測試報(bào)告如下：

Test Loss:   0.13, Test Acc:  96.06%
Precision, Recall and F1-Score...
              precision    recall  f1-score   support

      sports       0.99      0.99      0.99      1000
     finance       0.96      0.99      0.98      1000
       house       1.00      0.99      1.00      1000
      living       0.99      0.88      0.93      1000
   education       0.90      0.93      0.92      1000
        tech       0.92      0.99      0.95      1000
     fashion       0.95      0.97      0.96      1000
      policy       0.97      0.92      0.94      1000
        game       0.97      0.97      0.97      1000
entertaiment       0.95      0.98      0.96      1000

 avg / total       0.96      0.96      0.96     10000

分析

可以看出與傳統(tǒng)機(jī)器學(xué)習(xí)方法相比，貌似深度學(xué)習(xí)方法優(yōu)勢不大棋弥，但是考慮到數(shù)據(jù)集數(shù)量不多核偿、深度學(xué)習(xí)模型仍舊是個(gè)baseline,還可以通過進(jìn)一步的調(diào)節(jié)參數(shù)，來達(dá)到更好的效果,深度學(xué)習(xí)在文本分類性能優(yōu)化方面顽染，依舊是大有可為的漾岳。

參考資料

NN-RNN中文文本分類，基于tensorflow

詳細(xì)代碼見筆者的github：中文文本分類對比（經(jīng)典方法和CNN）

××××××××××××××××××××××××××××××××××××××××××

本文屬于筆者（EdwardChou）原創(chuàng)

轉(zhuǎn)載請注明出處