一先匪、讀取文章數(shù)據(jù)
pandas讀取mysql數(shù)據(jù)到DataFrame中
import pandas as pd
from sqlalchemy import create_engine
db_info = {'user':'root',
'password':'',
'host':'localhost',
'database':'article_spider'
}
engine = create_engine('mysql://%(user)s:%(password)s@%(host)s/%(database)s?charset=utf8' % db_info,encoding='utf-8')
sql = 'select * from jobbole_article;'
df = pd.read_sql(sql , con = engine)
二肢础、數(shù)據(jù)分析
1. 查看數(shù)據(jù)
df.info()
查看數(shù)據(jù)信息
df.isnull()
判斷數(shù)據(jù)是否缺失
[圖片上傳失敗...(image-617282-1518698087842)]
2. 清洗數(shù)據(jù)
只保留title嫉戚、creat_data梅肤、tags三個(gè)屬性的數(shù)據(jù)
df.loc[:,['create_date','title','tags']]
[圖片上傳失敗...(image-a426da-1518698087842)]
按時(shí)間進(jìn)行排序
df.sort_values(by='create_date',ascending = False)
[圖片上傳失敗...(image-4c0fec-1518698087842)]
將數(shù)據(jù)類(lèi)型轉(zhuǎn)換為日期類(lèi)型并設(shè)置為索引
df['create_date'] = pd.to_datetime(df['create_date']) #將數(shù)據(jù)類(lèi)型轉(zhuǎn)換為日期類(lèi)型
df = df.set_index('create_date') # 將dcreate_date設(shè)置為索引
獲取2017年的文章信息及tags和title內(nèi)容
df = df['2017']
tags = df['tags']
title = df['title']
[圖片上傳失敗...(image-ab1994-1518698087842)]
3.數(shù)據(jù)類(lèi)型轉(zhuǎn)換
首先使用np.array()函數(shù)把DataFrame轉(zhuǎn)化為np.ndarray()伴郁,再利用tolist()函數(shù)把np.ndarray()轉(zhuǎn)為list類(lèi)型
tags_data = np.array(tags)#np.ndarray()
tags_list = tags_data.tolist()#list
tags_text = "".join(tags_list) # 拼接成text
tags_text = tags_text.replace(',','') #把逗號(hào)換為空
tags_text = tags_text.replace('/','')
4.中文分詞
利用結(jié)巴分詞進(jìn)行中文分詞操作
import jieba
import pandas as pd
jieba.add_word('C/C++')
segment = jieba.lcut(tags_text)
words_df = pd.DataFrame({'segment':segment})
words_df.head()
[圖片上傳失敗...(image-3ab9a-1518698087842)]
進(jìn)行詞頻統(tǒng)計(jì)
import numpy
words_stat = words_df.groupby(by=['segment'])['segment'].agg({"計(jì)數(shù)":np.size})
words_stat = words_stat.reset_index().sort_values(by=["計(jì)數(shù)"],ascending=False)
words_stat.head()
5. 詞云顯示數(shù)據(jù)
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud#詞云包
#用詞云進(jìn)行顯示
wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80)
word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}
wordcloud = wordcloud.fit_words(word_frequence)
plt.imshow(wordcloud)
得到關(guān)于伯樂(lè)在線(xiàn)2017年的文章的標(biāo)簽的使用程度如下
[圖片上傳失敗...(image-fef70e-1518698087842)]