一逾苫、下載數(shù)據(jù)集搀捷,并上傳到hdfs
Download and extract the 20news-bydate.tar.gz from the 20newsgroups dataset to the working directory.
1.下載數(shù)據(jù)集
wget http://101.96.10.65/people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
2.解壓數(shù)據(jù)集
tar -zxvf 20news-bydate.tar.gz
3.精簡(jiǎn)數(shù)據(jù)集
刪除一部分?jǐn)?shù)據(jù),不需要那么大的數(shù)據(jù)集饭于。如果你想對(duì)全部數(shù)據(jù)進(jìn)行計(jì)算殖熟,這一步就可以省略了。
1.刪除train部分的冗余數(shù)據(jù)
cd 20news-bydate-train
rm -rf rec* talk* sci* comp*
2.刪除test部分的冗余數(shù)據(jù)
cd ../20news-bydate-test
rm -rf rec* talk* sci* comp*
4.上傳數(shù)據(jù)集到HDFS
1.在hdfs上創(chuàng)建目錄
hadoop fs -mkdir /input/mahout/20news_all
2.上傳數(shù)據(jù)到hdfs
hadoop fs -put -p ./20news-bydate-test/ /input/mahout/20news_all/
hadoop fs -put -p ./20news-bydate-train/ /input/mahout/20news_all/
5.HDFS執(zhí)行效果
二照皆、將數(shù)據(jù)集轉(zhuǎn)化為序列文件
Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile.
執(zhí)行命令:
${MAHOUT_HOME}/bin/mahout seqdirectory \
-i /input/mahout/20news_all \
-o /input/mahout/20news_all_seq
shell執(zhí)行效果:
hadoo yarn web執(zhí)行效果:
hadoo hdfs web執(zhí)行效果:
查看序列文件的內(nèi)容
${MAHOUT_HOME}/bin/mahout seqdumper -i /input/mahout/20news_all_seq/part-m-00000
三、將序列文件轉(zhuǎn)化為向量
Convert and preprocesses the dataset into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.
執(zhí)行命令:
${MAHOUT_HOME}/bin/mahout seq2sparse \
-i /input/mahout/20news_all_seq \
-o /input/mahout/20news_all_vec \
-wt tfidf \
-lnorm \
-nv
shell執(zhí)行效果:
hadoo yarn web執(zhí)行效果:
hadoo hdfs web執(zhí)行效果:
四沸停、將向量拆分為訓(xùn)練集和測(cè)驗(yàn)集
Split the preprocessed dataset into training and testing sets.
執(zhí)行命令:
${MAHOUT_HOME}/bin/mahout split \
-i /input/mahout/20news_all_vec/tfidf-vectors \
-tr /input/mahout/20news_all_rt/train-vectors \
-te /input/mahout/20news_all_rt/test-vectors \
-xm sequential \
-rp 20 \
-seq \
-ow
其中參數(shù)的意義如下:
? -tr訓(xùn)練集
? -te測(cè)試集
? -rp參數(shù)設(shè)定的是測(cè)試數(shù)據(jù)集占總數(shù)據(jù)集的百分比膜毁,以下代碼設(shè)定為20%!
shell執(zhí)行效果:
hadoo hdfs web執(zhí)行效果:
五愤钾、訓(xùn)練分類器
Train the classifier.
1.開(kāi)始訓(xùn)練
這個(gè)過(guò)程將進(jìn)行多次迭代瘟滨,就想是進(jìn)行多次訓(xùn)練來(lái)熟悉一種模型一樣。等待時(shí)間比較長(zhǎng)能颁,需要一點(diǎn)耐心杂瘸。。伙菊。
執(zhí)行命令:
${MAHOUT_HOME}/bin/mahout trainnb \
-i /input/mahout/20news_all_rt/train-vectors -el \
-o /input/mahout/20news_all_mi/nbmodel \
-li /input/mahout/20news_all_mi/labelindex \
-ow \
-c
shell執(zhí)行效果:
hadoo yarn web執(zhí)行效果:
hadoo hdfs web執(zhí)行效果:
2.檢驗(yàn)訓(xùn)練結(jié)果
2.1.查看訓(xùn)練出來(lái)的模型:
hadoop fs -ls /input/mahout/20news_all_mi/nbmodel
2.2. 查看生成的索引:
a.使用Hadoop命令
hadoop fs -text /input/mahout/20news_all_mi/labelindex
b.使用mahout命令
${MAHOUT_HOME}/bin/mahout seqdumper -i /input/mahout/20news_all_mi/labelindex
六败玉、測(cè)試分類器
Test the classifier.
執(zhí)行命令:
${MAHOUT_HOME}/bin/mahout testnb \
-i /input/mahout/20news_all_rt/test-vectors \
-m /input/mahout/20news_all_mi/nbmodel \
-l /input/mahout/20news_all_mi/labelindex \
-o /input/mahout/20news_all_testing \
-ow \
-c
shell執(zhí)行效果:
hadoo yarn web執(zhí)行效果:
hadoo hdfs web執(zhí)行效果:
查看結(jié)果文件內(nèi)容效果:
${MAHOUT_HOME}/bin/mahout seqdumper -i /input/mahout/20news_all_testing/part-m-00000
七敌土、參考文獻(xiàn)
1.貝葉斯算法參考鏈接
http://mahout.apache.org/users/classification/bayesian.html
2.新聞分類參考鏈接
http://mahout.apache.org/users/classification/twenty-newsgroups.html