Scrapy
什么是scrapy
- Scrapy是一個為了爬取網站數據屑柔,提取結構性數據而編寫的應用框架。 可以應用在包括數據挖掘,信息處理或存儲歷史數據等一系列的程序中。其最初是為了 頁面抓取 (更確切來說, 網絡抓取 )所設計的笨蚁, 也可以應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。
什么是網絡爬蟲
- 又被稱為網頁蜘蛛趟庄,網絡機器人括细,在FOAF社區(qū)中間,更經常被稱為網頁追逐者)戚啥,是一種按照一定的規(guī)則奋单,自動的抓取萬維網信息的程序或者腳本,已被廣泛應用于互聯網領域猫十。搜索引擎使用網絡爬蟲抓取Web網頁览濒、文檔甚至圖片、音頻拖云、視頻等資源贷笛,通過相應的索引技術組織這些信息,提供給搜索用戶進行查詢宙项。網絡爬蟲也為中小站點的推廣提供了有效的途徑乏苦,網站針對搜索引擎爬蟲的優(yōu)化曾風靡一時.
Scrapy的安裝過程
- 首先在終端中輸入python,ubuntu檢查是否自帶了python 如果沒有自帶則重新安裝
- 在終端輸入python 然后出現以下
- 接著輸入import lxml
- 然后再輸入import OpenSSL
- 如果沒有報錯的話,說明ubuntu自帶了python
依次在終端中輸入以下指令
- sudo apt-get install python-dev
- sudo apt-get install libevent-dev
- sudo apt-get install python-pip
- sudo pip install Scrapy
然后輸入scrapy 出現以下界面
scrapy安裝完成之后,便是進行簡單的抓取網頁數據(對簡書的首頁進行數據抓取)
- 創(chuàng)建新工程:scrapy startproject XXX(例如:scrapy startproject jianshu)
- 顯示文件的樹狀結構
- spiders文件夾下就是要實現爬蟲功能(具體如何爬取數據的代碼),爬蟲的核心尤筐。
- 在spiders文件夾下自己創(chuàng)建一個spider汇荐,用于爬取簡書首頁熱門文章。
- scrapy.cfg是項目的配置文件盆繁。
- settings.py用于設置請求的參數掀淘,使用代理,爬取數據后文件保存等油昂。
- items.py: 項目中的item文件革娄,該文件存放的是抓取的類目倾贰,類似于dict字典規(guī)則
- pipelines.py: 項目中的pipelines文件,該文件為數據抓取后進行數據處理的方法
進行簡單的簡書的首頁數據的抓取
- 在spiders文件夾下創(chuàng)建了文件jianshuSpider.py并且在里面輸入
MicrosoftInternetExplorer4
0
2
DocumentNotSpecified
7.8 磅
Normal
0
@font-face{
font-family:"Times New Roman";
}
@font-face{
font-family:"宋體";
}
@font-face{
font-family:"Calibri";
}
@font-face{
font-family:"Courier New";
}
p.MsoNormal{
mso-style-name:正文;
mso-style-parent:"";
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:Calibri;
mso-fareast-font-family:宋體;
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}
p.MsoPlainText{
mso-style-name:純文本;
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:宋體;
mso-hansi-font-family:'Courier New';
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}
span.msoIns{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:underline;
text-underline:single;
color:blue;
}
span.msoDel{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:line-through;
color:red;
}
@page{mso-page-border-surround-header:no;
mso-page-border-surround-footer:no;}@page Section0{
}
div.Section0{page:Section0;}
#coding=utf-8
import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from jianshu.items import JianshuItem
import urllib
class Jianshu(CrawlSpider):
name='jianshu'
start_urls=['http://www.reibang.com/top/monthly']
url = 'http://www.reibang.com'
def parse(self, response):
item = JianshuItem()
selector = Selector(response)
articles = selector.xpath('//ul[@class="article-list thumbnails"]/li')
for article in articles:
title = article.xpath('div/h4/a/text()').extract()
url = article.xpath('div/h4/a/@href').extract()
author = article.xpath('div/p/a/text()').extract()
# 下載所有熱門文章的縮略圖, 注意有些文章沒有圖片
try:
image = article.xpath("a/img/@src").extract()
urllib.urlretrieve(image[0], '/Users/apple/Documents/images/%s-%s.jpg' %(author[0],title[0]))
except:
print '--no---image--'
listtop = article.xpath('div/div/a/text()').extract()
likeNum = article.xpath('div/div/span/text()').extract()
readAndComment = article.xpath('div/div[@class="list-footer"]')
data = readAndComment[0].xpath('string(.)').extract()[0]
item['title'] = title
item['url'] = 'http://www.reibang.com/'+url[0]
item['author'] = author
item['readNum']=listtop[0]
# 有的文章是禁用了評論的
try:
item['commentNum']=listtop[1]
except:
item['commentNum']=''
item['likeNum']= likeNum
yield item
next_link = selector.xpath('//*[@id="list-container"]/div/button/@data-url').extract()
if len(next_link)==1 :
next_link = self.url+ str(next_link[0])
print "----"+next_link
yield Request(next_link,callback=self.parse)
- 在items.py中
MicrosoftInternetExplorer4
0
2
DocumentNotSpecified
7.8 磅
Normal
0
@font-face{
font-family:"Times New Roman";
}
@font-face{
font-family:"宋體";
}
@font-face{
font-family:"Calibri";
}
@font-face{
font-family:"Courier New";
}
p.MsoNormal{
mso-style-name:正文;
mso-style-parent:"";
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:Calibri;
mso-fareast-font-family:宋體;
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}
p.MsoPlainText{
mso-style-name:純文本;
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:宋體;
mso-hansi-font-family:'Courier New';
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}
span.msoIns{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:underline;
text-underline:single;
color:blue;
}
span.msoDel{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:line-through;
color:red;
}
@page{mso-page-border-surround-header:no;
mso-page-border-surround-footer:no;}@page Section0{
}
div.Section0{page:Section0;}
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item,Field
class JianshuItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = Field()
author = Field()
url = Field()
readNum = Field()
commentNum = Field()
likeNum = Field()
- 在setting.py中
···
MicrosoftInternetExplorer4
0
2
DocumentNotSpecified
7.8 磅
Normal
0
@font-face{
font-family:"Times New Roman";
}
@font-face{
font-family:"宋體";
}
@font-face{
font-family:"Calibri";
}
@font-face{
font-family:"Courier New";
}
p.MsoNormal{
mso-style-name:正文;
mso-style-parent:"";
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:Calibri;
mso-fareast-font-family:宋體;
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}
p.MsoPlainText{
mso-style-name:純文本;
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:宋體;
mso-hansi-font-family:'Courier New';
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}
span.msoIns{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:underline;
text-underline:single;
color:blue;
}
span.msoDel{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:line-through;
color:red;
}
@page{mso-page-border-surround-header:no;
mso-page-border-surround-footer:no;}@page Section0{
}
div.Section0{page:Section0;}
-- coding: utf-8 --
Scrapy settings for jianshu project
For simplicity, this file contains only settings considered important or
commonly used. You can find more settings consulting the documentation:
http://doc.scrapy.org/en/latest/topics/settings.html
http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'jianshu'
SPIDER_MODULES = ['jianshu.spiders']
NEWSPIDER_MODULE = 'jianshu.spiders'
Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'jianshu (+http://www.yourdomain.com)'
Obey robots.txt rules
ROBOTSTXT_OBEY = True
Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32
Configure a delay for requests for the same website (default: 0)
See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16
Disable cookies (enabled by default)
COOKIES_ENABLED = False
Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False
Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
'Accept-Language': 'en',
}
Enable or disable spider middlewares
See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
'jianshu.middlewares.MyCustomSpiderMiddleware': 543,
}
Enable or disable downloader middlewares
See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'jianshu.middlewares.MyCustomDownloaderMiddleware': 543,
}
Enable or disable extensions
See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
}
Configure item pipelines
See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'jianshu.pipelines.SomePipeline': 300,
}
Enable and configure the AutoThrottle extension (disabled by default)
See http://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
The initial download delay
AUTOTHROTTLE_START_DELAY = 5
The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
The average number of requests Scrapy should be sending in parallel to
each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = False
Enable and configure HTTP caching (disabled by default)
See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
FEED_URI=u'/Users/apple/Documents/jianshu-monthly.csv'
FEED_FORMAT='CSV'
···
- 運行結果還有點問題拦惋,CSV保存文件字節(jié)數為0