scrapy
scrapy是一個為了爬取網(wǎng)站數(shù)據(jù),提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應用框架垮兑〗隳牛可以應用在包括數(shù)據(jù)挖掘,信息處理或存儲歷史數(shù)據(jù)等一系列的程序中廉涕。
安裝scrapy
scrapy支持Python2.7和Python3.4泻云。你可以通過pip安裝scrapy。
pip install scrapy
創(chuàng)建scrapy應用
創(chuàng)建scrapy應用非常簡單狐蜕。首先進入到你想創(chuàng)建scrapy工程的目錄宠纯,然后在終端執(zhí)行:
scrapy startproject your-project-name
然后它會創(chuàng)建如下目錄及文件:
your-project-name/
scrapy.cfg # 配置文件
your-project-name/ # python模塊
__init__.py
items.py # model定義文件
middlewares.py # 中間件文件
pipelines.py # 管線文件
settings.py # 設置文件
spiders/ # 防止爬蟲文件的目錄
__init__.py
接著你進入到你創(chuàng)建的工程目錄中執(zhí)行以下命令來創(chuàng)建爬蟲文件
scrapy genspider xxx xxx.xxx.com #第一個參數(shù)為爬蟲文件的名稱,第二個參數(shù)為需要爬取的域名
這樣层释,基本的框架就搭好了婆瓜。
scrapy架構(gòu)
下圖展示了scrapy的架構(gòu),包括組件及在系統(tǒng)中發(fā)生的數(shù)據(jù)流轉(zhuǎn)贡羔。
1.引擎從spider中獲取初始的爬取請求廉白;
2.引擎把這個請求放入到Scheduler組件的隊列中,并且去獲取下一個爬取請求乖寒;
3.Scheduler從隊列中獲得下一個請求返回給引擎猴蹂;
4.引擎通過Downloader中間件把請求發(fā)送給Downloader組件;
5.一旦頁面完成下載楣嘁,Downloader產(chǎn)生一個響應并通過Downloader中間件發(fā)送給引擎磅轻;
6.引擎收到來自Downloader的響應,通過Spider中間件發(fā)送給spider逐虚;
7.spider處理這個響應并返回item聋溜,接著通過Spider中間件發(fā)送下一個爬取請求給引擎;
8.引擎發(fā)送處理過的item給Item Pipelines叭爱,接著向Scheduler索取下一個請求撮躁;
9.程序不停的從步驟一開始重復,直到Scheduler中沒有請求為止涤伐。
scrapy例子
以安居客的二手房為例子馒胆,我們來做一個爬蟲demo。
1.創(chuàng)建一個scrapy工程凝果。
scrapy startproject Anjuke
2.進入工程創(chuàng)建爬蟲文件。
scrapy genspider anjuke hangzhou.anjuke.com
3.創(chuàng)建一個初始request請求睦尽。打開anjuke.py器净,加入以下代碼
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
}
def start_requests(self):
url = 'https://hangzhou.anjuke.com/sale/'
yield scrapy.Request (url=url, headers=self.headers)
4.定義item。打開items.py当凡,定義需要解析的字段(item相當于model)山害。
name = scrapy.Field() ##小區(qū)
address = scrapy.Field() ##位置
totalPrice = scrapy.Field() ##總價
unitPrice = scrapy.Field() ##單價
size = scrapy.Field() ##大小
floor = scrapy.Field() ##樓層
roomNum = scrapy.Field() ##房間數(shù)
buildTime = scrapy.Field() ##建造時間
publisher = scrapy.Field() ##發(fā)布人
5.解析item纠俭。回到anjuke.py浪慌,解析返回的reponse冤荆,并賦值給item。xpath是一種用來確定XML文檔中某部分位置的語言权纤〉黾颍可以通過chrome的開發(fā)者工具中的copy xpath快速給出所需信息的路徑。這邊解析了前面50個頁面汹想。
page = 1
def parse(self, response):
houseList = response.xpath ('//*[@id="houselist-mod-new"]/li')
for div in houseList:
item = {}
houseDetails = div.xpath ('./div[@class="house-details"]')
commAddress = houseDetails.xpath ('./div[3]/span/@title').extract_first ()
item['name'] = commAddress.split ()[0]
item['address'] = commAddress.split ()[1]
item['roomNum'] = houseDetails.xpath('./div[2]/span[1]/text()').extract_first()
item['size'] = houseDetails.xpath('./div[2]/span[2]/text()').extract_first()
item['floor'] = houseDetails.xpath ('./div[2]/span[3]/text()').extract_first ()
item['builtTime'] = houseDetails.xpath ('./div[2]/span[4]/text()').extract_first ()
item['publisher'] = houseDetails.xpath ('./div[2]/span[5]/text()').extract_first ()
proPrice = div.xpath('./div[@class="pro-price"]')
item['totalPrice'] = proPrice.xpath('./span[@class="price-det"]/strong/text()').extract_first() +\
proPrice.xpath('./span[@class="price-det"]/text()').extract_first()
item['unitPrice'] = proPrice.xpath('./span[@class="unit-price"]/text()').extract_first()
yield item
if self.page <= 50:
self.page += 1
next_url = response.xpath('//div[@class="multi-page"]/a[last()]/@href').extract_first()
yield scrapy.Request(url=next_url,callback=self.parse,dont_filter=True)
6.存儲item外邓。這邊使用了mysql來存儲爬取的數(shù)據(jù)。打開pipelines.py古掏。首先在open_spider函數(shù)中打開數(shù)據(jù)庫损话,刪除原先的表,并新建一張houseinfo表槽唾。接著在process_item函數(shù)中處理一下item丧枪,并把數(shù)據(jù)插入到表中,最后返回item庞萍。
def open_spider(self, spider):
self.conn = pymysql.connect(
host="127.0.0.1",
user="root",
password="",
database="anjuke",
charset='utf8',
cursorclass=pymysql.cursors.DictCursor )
self.cursor = self.conn.cursor()
self.cursor.execute("drop table if exists houseinfo")
createsql = """create table houseinfo(name VARCHAR(32) NOT NULL,
address VARCHAR(32) NOT NULL,
totalPrice VARCHAR(32) NOT NULL,
unitPrice VARCHAR(32) NOT NULL,
size VARCHAR(32) NOT NULL,
floor VARCHAR(32) NOT NULL,
roomNum VARCHAR(32) NOT NULL,
buildTime VARCHAR(32) NOT NULL,
publisher VARCHAR(32) NOT NULL,
area varchar(8) not null)"""
self.cursor.execute(createsql)
def close_spider(self, spider):
self.cursor.close()
self.conn.close()
def process_item(self, item, spider):
item['totalPrice'] = item['totalPrice'].split('萬')[0]
item['unitPrice'] = item['unitPrice'].split('元')[0]
item['builtTime'] = item['builtTime'].split('年')[0]
item['size'] = item['size'].split('m')[0]
area = item['address'].split('-')[0]
insertsql = 'insert into houseinfo(name, address,totalPrice, unitPrice, size, floor, roomNum, \
buildTime, publisher,area) \
VALUES ("%s", "%s","%s", "%s","%s", "%s","%s", "%s","%s","%s")' % \
(item['name'], item['address'], item['totalPrice'], item['unitPrice'], item['size'], item['floor'], item['roomNum'], item['builtTime'], item['publisher'],area)
try:
self.cursor.execute(insertsql)
self.conn.commit()
except Exception as e:
self.conn.rollback()
print(e)
return item
7.最后設置一下爬取的策略及配置拧烦。打開settings.py。這邊設置了user_agent挂绰,robotstxt_obey屎篱,download_delay等參數(shù)來進行初步的反扒策略。
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'Anjuke.pipelines.AnjukePipeline': 300,
}
這樣葵蒂,一個簡單的爬蟲程序就完成了交播。詳細的代碼可以查看https://github.com/bigjar/Anjuke.git。