Python作業(yè)20170526：美股吧爬蟲

帖子網(wǎng)頁分析

帖子導航

從這個標簽中可以獲得帖子總數(shù)1706烹俗，以及每一頁帖子的數(shù)量80，當前處于第幾頁：第一頁次和。
![美股吧帖子列表網(wǎng)頁分析](http://upload-images.jianshu.io/upload_images/5298387-ca563fc7a0c2552e.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

- 構(gòu)造帖子列表的url

http://guba.eastmoney.com/list,meigu_2.html

帖子列表的url可以表示為：

'http://guba.eastmoney.com/list,meigu_{}.html'.format(page_num)

可以根據(jù)帖子總數(shù)/每頁帖子的數(shù)量得到一個帖子url的列表包各，代碼表示:

page_data = soup.find(name='span', class_='pagernums').get('data-pager').split('|')
page_nums = math.ceil(int(page_data[1]) / int(page_data[2]))

**注意：使用math模塊的ceil函數(shù)向上取整**

- 循環(huán)獲取每一頁帖子的信息

## 評論網(wǎng)頁分析
- 評論頁導航
> 查看網(wǎng)頁的html信息七咧，查詢105，有三個地方可以獲取到這個信息叮叹，這里用了正則表達式從script中獲取艾栋。

{var num=40030; }var pinglun_num=105;var xgti="";if(typeof (count) != "undefined"){xgti="<a href='list,meigu.html'>相關(guān)帖子"+count+"條</a>";}

![評論頁導航信息](http://upload-images.jianshu.io/upload_images/5298387-8e7eae194d70eb22.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

- 構(gòu)造評論頁url

http://guba.eastmoney.com/news,meigu,613304918_2.html

帖子評論url可以表示為：

'http://guba.eastmoney.com/news,meigu,613304918_{}.html'.format(page_num)

可以根據(jù)評論總數(shù)reply_count / 30（有分頁情況下，每頁帖子的數(shù)量最多為30）得到一個帖子url的列表蛉顽，代碼表示:

pattern = re.compile(r'var pinglun_num=(.*?);')

文章評論數(shù)

reply_count = int(re.search(pattern, resp.text).group(1))
page_num = math.ceil(reply_count / 30)

**注意：使用math模塊的ceil函數(shù)向上取整**

- 循環(huán)獲取每一頁評論的信息
先判斷有沒有評論蝗砾，如果有的話遍歷評論url，返回帖子的評論信息

## 使用的庫
- requests：發(fā)起網(wǎng)頁請求
- BeautifulSoup：解析網(wǎng)頁
- re：正則表達式解析網(wǎng)頁
- math：使用ceil函數(shù)向上取整
- csv：數(shù)據(jù)保存為csv文件

## 爬取過程
1. 以http://guba.eastmoney.com/list,meigu.html為入口携冤；
2. 先獲取帖子的總數(shù)悼粮、計算出帖子導航頁的頁碼數(shù)；
3. 得到帖子的導航url列表曾棕；
4. 遍歷帖子的導航url扣猫，得到帖子的信息；
 - 遍歷帖子url的地址翘地，得到帖子的閱讀量申尤、評論數(shù)、標題
 - 獲取評論信息
    - 以帖子url衙耕，如http://guba.eastmoney.com/news,meigu,646708357.html 為入口
    - 先獲取評論的總數(shù)昧穿，計算出帖子評論的頁數(shù)
    - 得到評論導航的url列表
    - 遍歷評論url列表，得到帖子的評論信息

## 代碼

import requests
from bs4 import BeautifulSoup
import math
import re
import csv

start_url = 'http://guba.eastmoney.com/list,meigu_1.html'

url = "http://guba.eastmoney.com/news,meigu,646708357.html"

base_url = "http://guba.eastmoney.com"

獲取所有帖子的信息

def get_articles_info(start_url):
resp = get_html(start_url)
soup = BeautifulSoup(resp.text, 'html.parser')
page_data = soup.find(name='span', class_='pagernums').get('data-pager').split('|')
page_nums = math.ceil(int(page_data[1]) / int(page_data[2]))
print('共{}頁'.format(page_nums))
articles_infos = []
with open('meigu.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(['閱讀量', '評論數(shù)', '發(fā)布時間', '帖子網(wǎng)址', '帖子標題', '帖子評論'])
for i in range(1, page_nums+1):
print('爬取第{}頁...'.format(i))
articles_url = start_url.split('')[0] + '' + str(i) + '.html'
articles_infos = parser_articles_info(articles_url)
articles_infos.extend(articles_infos)
return articles_infos

獲取一頁的所有帖子信息：閱讀量臭杰、評論數(shù)粤咪、發(fā)布時間、帖子的url渴杆、帖子的標題寥枝、帖子的所有評論

param：每一頁帖子的鏈接

def parser_articles_info(article_list_url):
resp = get_html(article_list_url)
articles_soup = BeautifulSoup(resp.text, 'html.parser')
articles_infos = articles_soup.find_all(name='div', class_='articleh')
articles = []
for info in articles_infos:
if '/news' in info.find(name='span', class_='l3').find(name='a').get('href'):
article_infos = {
'read_count': info.find(name='span', class_='l1').text,
'reply_count': info.find(name='span', class_='l2').text,
'release_time': info.find(name='span', class_='l5').text,
'article_url': base_url + info.find(name='span', class_='l3').find(name='a').get('href'),
'article_title': info.find(name='span', class_='l3').find(name='a').get('title'),
'article_comments': parse_comment_page(get_html(base_url + info.find(name='span', class_='l3').find(name='a').get('href')))
}
with open('meigu.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(article_infos.values())
articles.append(article_infos)
# print(articles)
return articles

根據(jù)url獲取html文檔

def get_html(url):
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
}
resp = requests.get(url)
if resp.status_code == 200:
return resp
return None

解析帖子的html文檔，提取需要的數(shù)據(jù)：帖子的內(nèi)容以及帖子的所有評論

def parse_comment_page(resp):
soup = BeautifulSoup(resp.text, 'html.parser')
# 正則表達式獲取總的評論數(shù)
pattern = re.compile(r'var pinglun_num=(.*?);')
article_info = {}
# 文章評論數(shù)
article_info['reply_count'] = int(re.search(pattern, resp.text).group(1))
# 文章內(nèi)容
article_info['article_content'] = soup.find(name='div', class_='stockcodec').text.strip()
# print(article_info['article_content'])
page_num = math.ceil(article_info['reply_count'] / 30)
print('{}條評論'.format(article_info['reply_count'] ), ',', '共{}頁'.format(page_num))
# 爬取所有的評論
article_comments = []
if article_info['reply_count'] > 0:
for i in range(1, page_num+1):
comment_url = '.'.join(resp.url.split('.')[:-1]) + '_{}'.format(i) + '.html'
print(comment_url)
article_comments.extend(parser_article_comment(comment_url))
else:
article_comments.append('本帖子暫時沒有評論內(nèi)容')
return article_comments

獲得帖子一頁的評論信息

def parser_article_comment(comment_list_url):
resp = get_html(comment_list_url)
if resp:
comment_soup = BeautifulSoup(resp.text, 'html.parser')
comments_infos = comment_soup.find_all(name='div', class_='zwlitxt')
comments = []
# print(len(comments_infos))
for info in comments_infos:
comment = {}
comment['commentator'] = info.find(name='span', class_='zwnick').find('a').text if info.find(name='span', class_='zwnick').find('a') else None
comment['reply_time'] = info.find(name='div', class_='zwlitime').text
comment['reply_content'] = info.find(name='div', class_='zwlitext').text
comments.append(comment)
return comments

def main():
get_articles_info(start_url)

if name == 'main':
main()


## 運行結(jié)果
![爬取的結(jié)果](http://upload-images.jianshu.io/upload_images/5298387-f98fdbc345c75d57.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
> 爬蟲能正常運行，但是爬取的過程很慢

最后編輯于：2017.12.07 18:20:42

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市兔港，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌冠跷，老刑警劉巖，帶你破解...
沈念sama閱讀 211,376評論 6贊 491
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件身诺，死亡現(xiàn)場離奇詭異蜜托，居然都是意外死亡，警方通過查閱死者的電腦和手機霉赡，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,126評論 2贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門橄务，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人穴亏，你說我怎么就攤上這事蜂挪≈靥簦” “怎么了？”我有些...
開封第一講書人閱讀 156,966評論 0贊 347
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵棠涮，是天一觀的道長谬哀。經(jīng)常有香客問我，道長严肪，這世上最難降的妖魔是什么史煎？我笑而不...
開封第一講書人閱讀 56,432評論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮诬垂，結(jié)果婚禮上劲室，老公的妹妹穿的比我還像新娘。我一直安慰自己结窘，他們只是感情好，可當我...
茶點故事閱讀 65,519評論 6贊 385
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布充蓝。她就那樣靜靜地躺著隧枫，像睡著了一般。火紅的嫁衣襯著肌膚如雪谓苟。梳的紋絲不亂的頭發(fā)上官脓，一...
開封第一講書人閱讀 49,792評論 1贊 290
城市分裂傳說
那天，我揣著相機與錄音涝焙，去河邊找鬼卑笨。笑死，一個胖子當著我的面吹牛仑撞，可吹牛的內(nèi)容都是我干的赤兴。我是一名探鬼主播，決...
沈念sama閱讀 38,933評論 3贊 406
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼隧哮，長吁一口氣：“原來是場噩夢啊……” “哼桶良！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起沮翔，我...
開封第一講書人閱讀 37,701評論 0贊 266
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤陨帆，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后采蚀，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體疲牵，經(jīng)...
沈念sama閱讀 44,143評論 1贊 303
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 36,488評論 2贊 327
?白月光啟示錄
正文我和宋清朗相戀三年榆鼠，在試婚紗的時候發(fā)現(xiàn)自己被綠了纲爸。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 38,626評論 1贊 340
活死人
序言：一個原本活蹦亂跳的男人離奇死亡璧眠，死狀恐怖缩焦，靈堂內(nèi)的尸體忽然破棺而出读虏，到底是詐尸還是另有隱情，我是刑警寧澤袁滥，帶...
沈念sama閱讀 34,292評論 4贊 329
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布盖桥，位于F島的核電站，受9級特大地震影響题翻，放射性物質(zhì)發(fā)生泄漏揩徊。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 39,896評論 3贊 313
男人毒藥：我在死后第九天來索命
文/蒙蒙一嵌赠、第九天我趴在偏房一處隱蔽的房頂上張望塑荒。院中可真熱鬧，春花似錦姜挺、人聲如沸齿税。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,742評論 0贊 21
一樁弒父案炊豪，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽凌箕。三九已至，卻和暖如春词渤，著一層夾襖步出監(jiān)牢的瞬間牵舱，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,977評論 1贊 265
情欲美人皮
我被黑心中介騙來泰國打工缺虐，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留芜壁，地道東北人。一個月前我還...
沈念sama閱讀 46,324評論 2贊 360
代替公主和親
正文我出身青樓高氮，卻偏偏與公主長得像慧妄，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子纫溃，可洞房花燭夜當晚...
茶點故事閱讀 43,494評論 2贊 348