網絡爬蟲-入門

Scrapy

什么是scrapy

Scrapy是一個為了爬取網站數據屑柔，提取結構性數據而編寫的應用框架。可以應用在包括數據挖掘，信息處理或存儲歷史數據等一系列的程序中。其最初是為了頁面抓取 (更確切來說, 網絡抓取 )所設計的笨蚁，也可以應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。

什么是網絡爬蟲

又被稱為網頁蜘蛛趟庄，網絡機器人括细，在FOAF社區(qū)中間，更經常被稱為網頁追逐者）戚啥，是一種按照一定的規(guī)則奋单，自動的抓取萬維網信息的程序或者腳本，已被廣泛應用于互聯網領域猫十。搜索引擎使用網絡爬蟲抓取Web網頁览濒、文檔甚至圖片、音頻拖云、視頻等資源贷笛，通過相應的索引技術組織這些信息，提供給搜索用戶進行查詢宙项。網絡爬蟲也為中小站點的推廣提供了有效的途徑乏苦，網站針對搜索引擎爬蟲的優(yōu)化曾風靡一時.

Scrapy的安裝過程

首先在終端中輸入python,ubuntu檢查是否自帶了python 如果沒有自帶則重新安裝

在終端輸入python 然后出現以下

Paste_Image.png

接著輸入import lxml

然后再輸入import OpenSSL

Paste_Image.png

如果沒有報錯的話,說明ubuntu自帶了python

依次在終端中輸入以下指令

sudo apt-get install python-dev

sudo apt-get install libevent-dev

sudo apt-get install python-pip

sudo pip install Scrapy

然后輸入scrapy 出現以下界面

Paste_Image.png

scrapy安裝完成之后,便是進行簡單的抓取網頁數據(對簡書的首頁進行數據抓取)

創(chuàng)建新工程:scrapy startproject XXX(例如:scrapy startproject jianshu)

Paste_Image.png

顯示文件的樹狀結構

Paste_Image.png

spiders文件夾下就是要實現爬蟲功能（具體如何爬取數據的代碼），爬蟲的核心尤筐。

在spiders文件夾下自己創(chuàng)建一個spider汇荐，用于爬取簡書首頁熱門文章。

scrapy.cfg是項目的配置文件盆繁。

settings.py用于設置請求的參數掀淘，使用代理，爬取數據后文件保存等油昂。

items.py: 項目中的item文件革娄，該文件存放的是抓取的類目倾贰，類似于dict字典規(guī)則

pipelines.py: 項目中的pipelines文件，該文件為數據抓取后進行數據處理的方法

進行簡單的簡書的首頁數據的抓取

在spiders文件夾下創(chuàng)建了文件jianshuSpider.py并且在里面輸入

MicrosoftInternetExplorer4
0
2
DocumentNotSpecified
7.8 磅

Normal

0

@font-face{
font-family:"Times New Roman";
}

@font-face{
font-family:"宋體";
}

@font-face{
font-family:"Calibri";
}

@font-face{
font-family:"Courier New";
}

p.MsoNormal{
mso-style-name:正文;
mso-style-parent:"";
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:Calibri;
mso-fareast-font-family:宋體;
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}

p.MsoPlainText{
mso-style-name:純文本;
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:宋體;
mso-hansi-font-family:'Courier New';
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}

span.msoIns{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:underline;
text-underline:single;
color:blue;
}

span.msoDel{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:line-through;
color:red;
}
@page{mso-page-border-surround-header:no;
    mso-page-border-surround-footer:no;}@page Section0{
}
div.Section0{page:Section0;}

#coding=utf-8

import scrapy

from scrapy.spiders import CrawlSpider

from scrapy.selector import Selector

from scrapy.http import Request

from jianshu.items import JianshuItem

import urllib

 

 

class Jianshu(CrawlSpider):

    name='jianshu'

    start_urls=['http://www.reibang.com/top/monthly']

    url = 'http://www.reibang.com'

 

    def parse(self, response):

        item = JianshuItem()

        selector = Selector(response)

        articles = selector.xpath('//ul[@class="article-list thumbnails"]/li')

 

        for article in articles:

            title = article.xpath('div/h4/a/text()').extract()

            url = article.xpath('div/h4/a/@href').extract()

            author = article.xpath('div/p/a/text()').extract()

 

            # 下載所有熱門文章的縮略圖, 注意有些文章沒有圖片

            try:

                image = article.xpath("a/img/@src").extract()

                urllib.urlretrieve(image[0], '/Users/apple/Documents/images/%s-%s.jpg' %(author[0],title[0]))

            except:

                print '--no---image--'

 

 

            listtop = article.xpath('div/div/a/text()').extract()

            likeNum = article.xpath('div/div/span/text()').extract()

 

            readAndComment = article.xpath('div/div[@class="list-footer"]')

            data = readAndComment[0].xpath('string(.)').extract()[0]

 

 

            item['title'] = title

            item['url'] = 'http://www.reibang.com/'+url[0]

            item['author'] = author

 

            item['readNum']=listtop[0]

            # 有的文章是禁用了評論的

            try:

                item['commentNum']=listtop[1]

            except:

                item['commentNum']=''

            item['likeNum']= likeNum

            yield item

 

        next_link = selector.xpath('//*[@id="list-container"]/div/button/@data-url').extract()

 

 

 

        if len(next_link)==1 :

 

            next_link = self.url+ str(next_link[0])

            print "----"+next_link

            yield Request(next_link,callback=self.parse)

在items.py中

MicrosoftInternetExplorer4
0
2
DocumentNotSpecified
7.8 磅

Normal

0

@font-face{
font-family:"Times New Roman";
}

@font-face{
font-family:"宋體";
}

@font-face{
font-family:"Calibri";
}

@font-face{
font-family:"Courier New";
}

p.MsoNormal{
mso-style-name:正文;
mso-style-parent:"";
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:Calibri;
mso-fareast-font-family:宋體;
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}

p.MsoPlainText{
mso-style-name:純文本;
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:宋體;
mso-hansi-font-family:'Courier New';
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}

span.msoIns{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:underline;
text-underline:single;
color:blue;
}

span.msoDel{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:line-through;
color:red;
}
@page{mso-page-border-surround-header:no;
    mso-page-border-surround-footer:no;}@page Section0{
}
div.Section0{page:Section0;}

# -*- coding: utf-8 -*-

 

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

 

from scrapy.item import Item,Field

 

class JianshuItem(Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = Field()

    author = Field()

    url = Field()

    readNum = Field()

    commentNum = Field()

    likeNum = Field()

在setting.py中

···
MicrosoftInternetExplorer4
0
2
DocumentNotSpecified
7.8 磅

Normal

@font-face{
font-family:"Times New Roman";
}

@font-face{
font-family:"宋體";
}

@font-face{
font-family:"Calibri";
}

@font-face{
font-family:"Courier New";
}

p.MsoNormal{
mso-style-name:正文;
mso-style-parent:"";
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:Calibri;
mso-fareast-font-family:宋體;
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}

p.MsoPlainText{
mso-style-name:純文本;
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:宋體;
mso-hansi-font-family:'Courier New';
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}

span.msoIns{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:underline;
text-underline:single;
color:blue;
}

span.msoDel{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:line-through;
color:red;
}
@page{mso-page-border-surround-header:no;
mso-page-border-surround-footer:no;}@page Section0{
}
div.Section0{page:Section0;}

-- coding: utf-8 --

Scrapy settings for jianshu project

For simplicity, this file contains only settings considered important or

commonly used. You can find more settings consulting the documentation:

http://doc.scrapy.org/en/latest/topics/settings.html

http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jianshu'

SPIDER_MODULES = ['jianshu.spiders']

NEWSPIDER_MODULE = 'jianshu.spiders'

Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'jianshu (+http://www.yourdomain.com)'

Obey robots.txt rules

ROBOTSTXT_OBEY = True

Configure maximum concurrent requests performed by Scrapy (default: 16)

CONCURRENT_REQUESTS = 32

Configure a delay for requests for the same website (default: 0)

See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

DOWNLOAD_DELAY = 3

The download delay setting will honor only one of:

CONCURRENT_REQUESTS_PER_DOMAIN = 16

CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

COOKIES_ENABLED = False

Disable Telnet Console (enabled by default)

TELNETCONSOLE_ENABLED = False

Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',

'Accept-Language': 'en',

}

Enable or disable spider middlewares

See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = {

'jianshu.middlewares.MyCustomSpiderMiddleware': 543,

}

Enable or disable downloader middlewares

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

'jianshu.middlewares.MyCustomDownloaderMiddleware': 543,

}

Enable or disable extensions

See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

EXTENSIONS = {

'scrapy.extensions.telnet.TelnetConsole': None,

}

Configure item pipelines

See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'jianshu.pipelines.SomePipeline': 300,

}

Enable and configure the AutoThrottle extension (disabled by default)

See http://doc.scrapy.org/en/latest/topics/autothrottle.html

AUTOTHROTTLE_ENABLED = True

The initial download delay

AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

AUTOTHROTTLE_DEBUG = False

Enable and configure HTTP caching (disabled by default)

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

HTTPCACHE_IGNORE_HTTP_CODES = []

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

FEED_URI=u'/Users/apple/Documents/jianshu-monthly.csv'

FEED_FORMAT='CSV'
···

運行結果還有點問題拦惋，CSV保存文件字節(jié)數為0

最后編輯于：2017.12.05 05:36:53

?著作權歸作者所有,轉載或內容合作請聯系作者

人面猴
序言：七十年代末匆浙，一起剝皮案震驚了整個濱河市，隨后出現的幾起案子架忌，更是在濱河造成了極大的恐慌吞彤，老刑警劉巖，帶你破解...
沈念sama閱讀 218,682評論 6贊 507
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件叹放，死亡現場離奇詭異饰恕，居然都是意外死亡，警方通過查閱死者的電腦和手機井仰，發(fā)現死者居然都...
沈念sama閱讀 93,277評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門埋嵌，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人俱恶，你說我怎么就攤上這事雹嗦。” “怎么了合是？”我有些...
開封第一講書人閱讀 165,083評論 0贊 355
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵了罪，是天一觀的道長。經常有香客問我聪全，道長泊藕，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,763評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任难礼，我火速辦了婚禮娃圆，結果婚禮上，老公的妹妹穿的比我還像新娘蛾茉。我一直安慰自己讼呢，他們只是感情好，可當我...
茶點故事閱讀 67,785評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布谦炬。她就那樣靜靜地躺著悦屏，像睡著了一般。火紅的嫁衣襯著肌膚如雪吧寺。梳的紋絲不亂的頭發(fā)上窜管，一...
開封第一講書人閱讀 51,624評論 1贊 305
城市分裂傳說
那天，我揣著相機與錄音稚机，去河邊找鬼。笑死获搏，一個胖子當著我的面吹牛赖条，可吹牛的內容都是我干的失乾。我是一名探鬼主播，決...
沈念sama閱讀 40,358評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼纬乍，長吁一口氣：“原來是場噩夢啊……” “哼碱茁！你這毒婦竟也來了？” 一聲冷哼從身側響起仿贬，我...
開封第一講書人閱讀 39,261評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤纽竣，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后茧泪，有當地人在樹林里發(fā)現了一具尸體蜓氨，經...
沈念sama閱讀 45,722評論 1贊 315
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 37,900評論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年队伟，在試婚紗的時候發(fā)現自己被綠了穴吹。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 40,030評論 1贊 350
活死人
序言：一個原本活蹦亂跳的男人離奇死亡嗜侮，死狀恐怖港令，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情锈颗，我是刑警寧澤顷霹，帶...
沈念sama閱讀 35,737評論 5贊 346
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站击吱，受9級特大地震影響淋淀，放射性物質發(fā)生泄漏。R本人自食惡果不足惜姨拥，卻給世界環(huán)境...
茶點故事閱讀 41,360評論 3贊 330
男人毒藥：我在死后第九天來索命
文/蒙蒙一绅喉、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧叫乌，春花似錦柴罐、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,941評論 0贊 22
一樁弒父案革屠，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至排宰，卻和暖如春似芝，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背板甘。一陣腳步聲響...
開封第一講書人閱讀 33,057評論 1贊 270
情欲美人皮
我被黑心中介騙來泰國打工党瓮，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人盐类。一個月前我還...
沈念sama閱讀 48,237評論 3贊 371
代替公主和親
正文我出身青樓寞奸，卻偏偏與公主長得像呛谜，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子枪萄，可洞房花燭夜當晚...
茶點故事閱讀 44,976評論 2贊 355

網絡爬蟲-入門

Scrapy

什么是scrapy

什么是網絡爬蟲

Scrapy的安裝過程

依次在終端中輸入以下指令

然后輸入scrapy 出現以下界面

scrapy安裝完成之后,便是進行簡單的抓取網頁數據(對簡書的首頁進行數據抓取)

進行簡單的簡書的首頁數據的抓取

-- coding: utf-8 --

Scrapy settings for jianshu project

For simplicity, this file contains only settings considered important or

commonly used. You can find more settings consulting the documentation:

http://doc.scrapy.org/en/latest/topics/settings.html

http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'jianshu (+http://www.yourdomain.com)'

Obey robots.txt rules

Configure maximum concurrent requests performed by Scrapy (default: 16)

CONCURRENT_REQUESTS = 32

Configure a delay for requests for the same website (default: 0)

See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

See also autothrottle settings and docs

DOWNLOAD_DELAY = 3

The download delay setting will honor only one of:

CONCURRENT_REQUESTS_PER_DOMAIN = 16

CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

COOKIES_ENABLED = False

Disable Telnet Console (enabled by default)

TELNETCONSOLE_ENABLED = False

Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',

'Accept-Language': 'en',

}

Enable or disable spider middlewares

See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = {

'jianshu.middlewares.MyCustomSpiderMiddleware': 543,

}

Enable or disable downloader middlewares

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

'jianshu.middlewares.MyCustomDownloaderMiddleware': 543,

}

Enable or disable extensions

See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

EXTENSIONS = {

'scrapy.extensions.telnet.TelnetConsole': None,

}

Configure item pipelines

See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'jianshu.pipelines.SomePipeline': 300,

}

Enable and configure the AutoThrottle extension (disabled by default)

See http://doc.scrapy.org/en/latest/topics/autothrottle.html

AUTOTHROTTLE_ENABLED = True

The initial download delay

AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

AUTOTHROTTLE_DEBUG = False

Enable and configure HTTP caching (disabled by default)

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

HTTPCACHE_IGNORE_HTTP_CODES = []

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

推薦閱讀更多精彩內容