課程作業(yè)-爬蟲入門04-構(gòu)建爬蟲-WilliamZeng-20170729

課堂作業(yè)

爬取解密大數(shù)據(jù)專題所有文章列表评腺，并輸出到文件中保存
每篇文章需要爬取的數(shù)據(jù)：作者帘瞭，標(biāo)題，文章地址蒿讥，摘要蝶念，縮略圖地址锋拖，閱讀數(shù)，評(píng)論數(shù)祸轮，點(diǎn)贊數(shù)和打賞數(shù)

參考資料

- Beautiful Soup 4.4.0 文檔英文版

謝謝曾老師的指導(dǎo)和分享兽埃。感謝已經(jīng)做完這次作業(yè)的同學(xué)，其中的一些經(jīng)驗(yàn)值得借鑒适袜。joe同學(xué)的這次作業(yè)很詳盡柄错，而且用到了正則表達(dá)式匹配的方法和一些新函數(shù)，結(jié)合商業(yè)數(shù)據(jù)分析課程學(xué)到的知識(shí)做了一些圖表和分析苦酱，大家可以去瞻仰一下售貌。

爬蟲作業(yè)的難度變大了。在對(duì)python的函數(shù)不熟悉疫萤，對(duì)各個(gè)爬蟲工具官方文檔比較陌生的情況下颂跨，遇到了不少問題。

中文字符解碼是我最近2次爬蟲作業(yè)遇到的主要問題之一扯饶，這個(gè)問題也困擾了不少其他同學(xué)恒削。不知道新生大學(xué)的課程中會(huì)不會(huì)涉及？有沒有什么系統(tǒng)的教程或文檔可以借鑒尾序？
為什么要給種子頁面或?qū)嶋H有效頁面加上后綴&page=%d,是因?yàn)檫@是比較通用的模擬頁面動(dòng)態(tài)(異步)加載的方法嗎钓丰？這樣才能完整加載并讀取這個(gè)目標(biāo)頁面的所有內(nèi)容？曾老師在7月29日的課堂里提到過每币，可截圖里我沒看出什么情況下會(huì)出現(xiàn)帶如http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=1的鏈接携丁。我在自己Windows 7 64位的Chrome里面未觀察出來。
這次抓取的信息較多兰怠，路徑也比較復(fù)雜梦鉴，比較考驗(yàn)觀察和編寫代碼的細(xì)致程度。這點(diǎn)上有Joe同學(xué)作榜樣揭保，我花了不少時(shí)間肥橙，但錯(cuò)誤基本都改了。

代碼部分：beautifulsoup4實(shí)現(xiàn)

導(dǎo)入模塊
基礎(chǔ)的下載函數(shù)：download
抓取專題頁上文章列表區(qū)域的函數(shù)：crawl_list
抓取每篇文章目標(biāo)標(biāo)簽信息的函數(shù)：crawl_paper_tag
把抓取到的文章標(biāo)簽信息按文章寫入不同文件的函數(shù)：write_file
把標(biāo)題中不適合做文件名的字符替換的函數(shù)：clean_title
執(zhí)行爬取并寫入文件的函數(shù)：crawl_papers

導(dǎo)入模塊

import os
import time
import urllib2
from bs4 import BeautifulSoup
import urlparse

download函數(shù)

def download(url, retry=2):
    """
    下載頁面的函數(shù)掖举，會(huì)下載完整的頁面信息
    :param url: 要下載的url
    :param retry: 重試次數(shù)
    :return: 原生html
    """
    print "downloading: ", url
    # 設(shè)置header信息快骗，模擬瀏覽器請(qǐng)求
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
    }
    try: #爬取可能會(huì)失敗，采用try-except方式來捕獲處理
        request = urllib2.Request(url, headers=header) #設(shè)置請(qǐng)求數(shù)據(jù)
        html = urllib2.urlopen(request).read() #抓取url
    except urllib2.URLError as e: #異常處理
        print "download error: ", e.reason
        html = None
        if retry > 0: #未超過重試次數(shù)塔次，可以繼續(xù)爬取
            if hasattr(e, 'code') and 500 <= e.code < 600: #錯(cuò)誤碼范圍方篮，是請(qǐng)求出錯(cuò)才繼續(xù)重試爬取
                print e.code
                return download(url, retry - 1)
    time.sleep(1) #等待1s，避免對(duì)服務(wù)器造成壓力励负，也避免被服務(wù)器屏蔽爬取
    return html

crawl_list函數(shù)

def crawl_list(url):
    """
    爬取文章列表
    :param url 下載的種子頁面地址
    :return:
    """
    html = download(url) #下載頁面
    if html == None:  # 下載頁面為空藕溅，表示已爬取到最后
        return

    soup = BeautifulSoup(html, "html.parser")  # 格式化爬取的頁面數(shù)據(jù)
    return soup.find(id='list-container').find('ul', {'class': 'note-list'})  # 文章列表

這一部分是基于老師課堂上的代碼。

crawl_paper_tag函數(shù)

def crawl_paper_tag(list, url_root):
    """
    獲取文章列表詳情
    :param list: 要爬取的文章列表
    :param url_root: 爬取網(wǎng)站的根目錄
    :return:
    """
    paperList = [] # 文章屬性集列表
    lists = list.find_all('li')
    # print (lists)
    for paperTag in lists:
        author = paperTag.find('div', {'class': 'content'}).find('a', {'class': 'blue-link'}).text # 作者
        title = paperTag.find('div', {'class': 'content'}).find('a', {'class': 'title'}).text # 標(biāo)題
        paperURL = paperTag.find('div', {'class': 'content'}).find('a', {'class': 'title'}).get('href') # 文章網(wǎng)址
        abstract = paperTag.find('div', {'class': 'content'}).find('p', {'class': 'abstract'}).text # 文章摘要
        if paperTag.find('a', {'class': 'wrap-img'}) != None:
            pic = paperTag.find('a', {'class': 'wrap-img'}).find('img').get('src') # 文章縮略圖
        else:
             pic = 'No Pic'
        metaRead = paperTag.find('div', {'class': 'content'}).find('i', {'class': 'iconfont ic-list-read'}).find_parent('a').text # 閱讀數(shù)
        metaComment = paperTag.find('div', {'class': 'content'}).find('i', {'class': 'iconfont ic-list-comments'}).find_parent('a').text # 評(píng)論數(shù)
        metaLike = paperTag.find('div', {'class': 'content'}).find('i', {'class': 'iconfont ic-list-like'}).find_parent('span').text # 點(diǎn)贊數(shù)
        if paperTag.find('div', {'class': 'content'}).find('i', {'class': 'iconfont ic-list-money'}) != None:
            metaReward =  paperTag.find('div', {'class': 'content'}).find('i', {'class': 'iconfont ic-list-money'}).find_parent('span').text# 打賞數(shù)
        else:
            metaReward = 0
        paperAttr = {
            'author': author,
            'title': title,
            'url': urlparse.urljoin(url_root, paperURL),
            'abstract': abstract,
            'pic': pic,
            'read': metaRead,
            'comment': metaComment,
            'like': metaLike,
            'reward': metaReward
        }
        # print (paperAttr)
        write_file(title, paperAttr)
        paperList.append(paperAttr)
    return paperList

基于曾老師課堂上提供的代碼補(bǔ)充和修改继榆。嘗試過把含字典變量的列表直接寫入文件無法正確對(duì)字符進(jìn)行編碼巾表，可能需要再做個(gè)循環(huán)才能寫入正確字符汁掠。這里直接調(diào)用寫入文件函數(shù)把字典變量寫入文件。

write_file函數(shù)

def write_file(title, paperattr):
    if os.path.exists('spider_output/') == False:  # 檢查保存文件的地址
        os.mkdir('spider_output/')
    cleaned_title = clean_title(title)
    file_name = 'spider_output/' + cleaned_title + '.txt' #設(shè)置要保存的文件名  # 設(shè)置要保存的文件名
    # if os.path.exists(file_name):
        # os.remove(file_name) # 刪除文件
        # return  # 已存在的文件不再寫
    file = open(file_name, 'wb')
    content =  'Author:' + (unicode(paperattr['author']).encode('utf-8', errors='ignore')) + '\n' \
               + 'Title:' + (unicode(paperattr['title']).encode('utf-8', errors='ignore')) + '\n' \
               + 'URL:' + (unicode(paperattr['url']).encode('utf-8', errors='ignore')) + '\n' \
               + 'Abstract:' + (unicode(paperattr['abstract']).encode('utf-8', errors='ignore')) + '\n' \
               + 'ArtilcePic:' + (unicode(paperattr['pic']).encode('utf-8', errors='ignore')) + '\n' \
               + 'Read:' + (unicode(paperattr['read']).encode('utf-8', errors='ignore')) + '\n' \
               + 'Comment:' + (unicode(paperattr['comment']).encode('utf-8', errors='ignore')) + '\n' \
               + 'Like:' + (unicode(paperattr['like']).encode('utf-8', errors='ignore')) + '\n' \
               + 'Reward:' + (unicode(paperattr['reward']).encode('utf-8', errors='ignore')) + '\n'
    file.write(content)
    file.close()

clean_title函數(shù)

def clean_title(title):
    """
    替換特殊字符集币，否則根據(jù)文章標(biāo)題生成文件名的代碼會(huì)運(yùn)行出錯(cuò)
    """
    title = title.replace('|', ' ')
    title = title.replace('"', ' ')
    title = title.replace('/', ',')
    title = title.replace('<', ' ')
    title = title.replace('>', ' ')
    title = title.replace('\x08', '')
    return title

上面2個(gè)函數(shù)是把目標(biāo)標(biāo)簽信息寫入帶文章標(biāo)題的文件中考阱，標(biāo)題替換函數(shù)參考了別的同學(xué)的代碼。

crawl_papers函數(shù)

def crawl_papers(url_seed, url_root):
    """
    抓取所有的文章列表
    :param url_seed: 下載的種子頁面地址
    :param url_root: 爬取網(wǎng)站的根目錄
    :return:
    """
    i = 1
    flag = True  # 標(biāo)記是否需要繼續(xù)爬取
    while flag:
        url = url_seed % i  # 真正爬取的頁面
        i += 1  # 下一次需要爬取的頁面
        article_list = crawl_list(url)  # 下載文章列表
        article_tag = crawl_paper_tag(article_list, url_root)
        if article_tag.__len__() == 0:  # 下載文章列表返回長度為0的列表鞠苟，表示已爬取到最后
            flag = False

目前實(shí)際執(zhí)行的理解需要通過遞增調(diào)用&page=%d這個(gè)參數(shù)才能爬取所有的文章列表信息乞榨。當(dāng)沒有文章列表信息可以抓取到時(shí)，終止爬取当娱。

調(diào)用函數(shù)執(zhí)行頁面抓取

url_root = 'http://www.reibang.com/'
url_seed = 'http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=%d'
crawl_papers(url_seed, url_root)

Python Console的輸出結(jié)果如下

downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=1
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=2
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=3
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=4
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=5
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=6
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=7
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=8
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=9
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=10
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=11
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=12
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=13
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=14
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=15
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=16
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=17
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=18
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=19
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=20
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=21
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=22
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=23
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=24
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=25
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=26
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=27
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=28
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=29
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=30
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=31
downloading:  http://www.reibang.com/c/9b4685b6357c/?order_by=added_at&page=32

抓取的結(jié)果文件如下：

抓取結(jié)果文件.png

結(jié)果文件內(nèi)容示例.png

最后編輯于：2017.12.10 00:18:11

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末吃既，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子跨细，更是在濱河造成了極大的恐慌鹦倚，老刑警劉巖，帶你破解...
沈念sama閱讀 216,402評(píng)論 6贊 499
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件冀惭，死亡現(xiàn)場(chǎng)離奇詭異震叙，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)云头，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,377評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門捐友，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人溃槐，你說我怎么就攤上這事】瓶裕” “怎么了昏滴？”我有些...
開封第一講書人閱讀 162,483評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長对人。經(jīng)常有香客問我谣殊，道長，這世上最難降的妖魔是什么牺弄？我笑而不...
開封第一講書人閱讀 58,165評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任姻几，我火速辦了婚禮，結(jié)果婚禮上势告，老公的妹妹穿的比我還像新娘蛇捌。我一直安慰自己，他們只是感情好咱台，可當(dāng)我...
茶點(diǎn)故事閱讀 67,176評(píng)論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布络拌。她就那樣靜靜地躺著，像睡著了一般回溺。火紅的嫁衣襯著肌膚如雪春贸。梳的紋絲不亂的頭發(fā)上混萝，一...
開封第一講書人閱讀 51,146評(píng)論 1贊 297
城市分裂傳說
那天，我揣著相機(jī)與錄音萍恕，去河邊找鬼逸嘀。笑死，一個(gè)胖子當(dāng)著我的面吹牛允粤，可吹牛的內(nèi)容都是我干的崭倘。我是一名探鬼主播，決...
沈念sama閱讀 40,032評(píng)論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼维哈，長吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼绳姨！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起阔挠，我...
開封第一講書人閱讀 38,896評(píng)論 0贊 274
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤飘庄，失蹤者是張志新（化名）和其女友劉穎，沒想到半個(gè)月后购撼，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體跪削，經(jīng)...
沈念sama閱讀 45,311評(píng)論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,536評(píng)論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年迂求，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了碾盐。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 39,696評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡揩局，死狀恐怖毫玖，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情凌盯，我是刑警寧澤付枫，帶...
沈念sama閱讀 35,413評(píng)論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站驰怎，受9級(jí)特大地震影響阐滩，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜县忌，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,008評(píng)論 3贊 325
男人毒藥：我在死后第九天來索命
文/蒙蒙一掂榔、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧症杏，春花似錦装获、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,659評(píng)論 0贊 22
一樁弒父案饱溢，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至走芋，卻和暖如春绩郎，著一層夾襖步出監(jiān)牢的瞬間潘鲫，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,815評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工肋杖，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留溉仑，地道東北人。一個(gè)月前我還...
沈念sama閱讀 47,698評(píng)論 2贊 368
代替公主和親
正文我出身青樓状植，卻偏偏與公主長得像浊竟，于是被迫代替她去往敵國和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子津畸，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,592評(píng)論 2贊 353