源代碼來自于基于Scrapy的Python3分布式淘寶爬蟲,做了一些改動,對失效路徑進行了更新毁葱,增加了一些內容。使用了隨機User-Agent贰剥,scrapy-redis分布式爬蟲倾剿,使用MySQL數據庫存儲數據。
目錄
第一步 創(chuàng)建并配置scrapy項目
第二步 將數據導出至json文件和MySQL數據庫
第三步 設置隨機訪問頭User-Agent
第四步 配置scrapy-redis實現分布式爬蟲
數據分析部分:2018.7淘寶粉底市場數據分析
開發(fā)環(huán)境
- 電腦系統(tǒng):macOS High Sierra
- Python第三方庫:scrapy鸠澈、pymysql柱告、scrapy-redis、redis笑陈、redis-py
- Python版本:Anaconda 4.5.8 ,集成Python版本 3.6.4
- 數據庫: MySQL 8.0.11际度、redis 4.0.1
第一步 創(chuàng)建scrapy項目
cmd輸入:
scrapy startproject taobao
cd taobao
scrapy genspider -t basic tb taobao.com
1. 爬蟲程序編寫tb.py
- 在源代碼的基礎上添加了銷量、產品描述信息的爬群住乖菱;
- 更新了url分類判斷的方式;
- 抓包取得的評論數網頁格式有變化蓬网,更新了正則表達式窒所。
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.http import Request
from taobao.items import TaobaoItem
import urllib.request
class TbSpider(scrapy.Spider):
name = 'tb'
allowed_domains = ['taobao.com']
start_urls = ['http://taobao.com/']
def parse(self, response):
key = input("請輸入你要爬取的關鍵詞\t")
pages = input("請輸入你要爬取的頁數\t")
print("\n")
print("當前爬取的關鍵詞是",key)
print("\n")
for i in range(0, int(pages)):
url = "https://s.taobao.com/search?q=" + str(key) + "&s=" + str(44*i)
yield Request(url=url, callback=self.page)
pass
#搜索頁
def page(self,response):
body = response.body.decode('utf-8', 'ignore')
pat_id = '"nid":"(.*?)"' #匹配id
pat_now_price = '"view_price":"(.*?)"' #匹配現價格
pat_address = '"item_loc":"(.*?)"' #匹配商家地址
pat_sale = '"view_sales":"(.*?)人付款"' #銷量
all_id = re.compile(pat_id).findall(body)
all_now_price = re.compile(pat_now_price).findall(body)
all_address = re.compile(pat_address).findall(body)
all_sale = re.compile(pat_sale).findall(body)
for i in range(0, len(all_id)):
this_id = all_id[i]
now_price = all_now_price[i]
address = all_address[i]
sale_count = all_sale[i]
url = "https://item.taobao.com/item.htm?id=" + str(this_id)
yield Request(url=url, callback=self.next, meta={ 'now_price': now_price, 'address': address,'sale_count':sale_count})
pass
pass
#詳情頁
def next(self, response):
item = TaobaoItem()
url = response.url
#由于淘寶和天貓的某些信息采用不同方式的Ajax加載,做一個分類
if 'tmall' in url: #天貓帆锋、天貓超市吵取、天貓國際
title = response.xpath("http://html/head/title/text()").extract() #獲取商品名稱
#price = response.xpath("http://span[@class='tm-count']/text()").extract()
#這里獲取商品原價格-但一直抓到的是空值,Xpath在xpath finder里驗證有效锯厢,暫時不知道為什么皮官。。实辑。由于后續(xù)會影響到數據庫的寫入捺氢,暫時隱了
#以下是產品描述信息欄內的信息獲得,檢索文字標簽獲得對應內容:
brand = response.xpath("http://li[@id='J_attrBrandName']/text()").re('品牌:\xa0(.*?)$') #品牌
produce = response.xpath("http://li[contains(text(),'產地')]/text()").re('產地:\xa0(.*?)$') #產地
effect = response.xpath("http://li[contains(text(),'功效')]/text()").re('功效:\xa0(.*?)$') #功效
pat_id = 'id=(.*?)&'
this_id = re.compile(pat_id).findall(url)[0]
pass
else: #淘寶
title = response.xpath("/html/head/title/text()").extract() #獲取商品名稱
#price = response.xpath("http://em[@class = 'tb-rmb-num']/text()").extract()
#獲取商品原價格-和上面保持一致
brand = response.xpath("http://li[contains(text(),'品牌')]/text()").re('品牌:\xa0(.*?)$') #品牌
produce = response.xpath("http://li[contains(text(),'產地')]/text()").re('產地:\xa0(.*?)$') #產地
effect = response.xpath("http://li[contains(text(),'功效')]/text()").re('功效:\xa0(.*?)$') #功效
pat_id = 'id=(.*?)$'
this_id = re.compile(pat_id).findall(url)[0]
pass
#抓取評論總數
comment_url = "https://rate.taobao.com/detailCount.do?callback=jsonp144&itemId="+str(this_id)
comment_data = urllib.request.urlopen(comment_url).read().decode('utf-8', 'ignore')
each_comment = '"count":(.*?)}'
comment = re.compile(each_comment).findall(comment_data)
item['title'] = title
item['link'] = url
#item['price'] = price
item['now_price'] = response.meta['now_price']
item['comment'] = comment
item['address'] = response.meta['address']
item['sale_count'] = response.meta['sale_count']
item['brand']=brand
item['produce']=produce
item['effect']=effect
yield item
2. settings.py配置
設置用戶代理剪撬、不遵循robots.txt協(xié)議摄乒、取消Cookies。
# -*- coding: utf-8 -*-
# Scrapy settings for taobao project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'taobao'
SPIDER_MODULES = ['taobao.spiders']
NEWSPIDER_MODULE = 'taobao.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0' #設置用戶代理值
# Obey robots.txt rules
ROBOTSTXT_OBEY = False #不遵循 robots.txt協(xié)議
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 0.25 #設置訪問延遲
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False #取消Cookies
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'taobao.middlewares.TaobaoSpiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'taobao.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'taobao.pipelines.TaobaoJsonPipeline':300 #導出文json文件
'taobao.pipelines.TaobaoPipeline':200 #導出至Mysql
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
3.在items.py中添加存儲容器對象
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TaobaoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
link = scrapy.Field()
#price = scrapy.Field()
comment = scrapy.Field()
now_price = scrapy.Field()
address = scrapy.Field()
sale_count = scrapy.Field()
brand = scrapy.Field()
produce = scrapy.Field()
effect = scrapy.Field()
pass
第二步 將數據導出并存儲至Mysql數據庫
1. 將數據導出為json
在pipeline.py文件內寫入如下內容,在setting.py文件中開啟(詳見settings.py),
# -*- coding: utf-8 -*-
import json
import codecs
class TaobaoJsonPipeline:
def __init__(self):
self.file=codecs.open('taobao.json','w',encoding='utf-8')
def process_item(self, item, spider):
lines = json.dumps(dict(item), ensure_ascii=False) + '\n'
self.file.write(lines)
return item
def close_spider(self, spider):
self.file.close()
運行爬蟲馍佑,在終端輸入
scrapy crawl tb --nolog
導出后文件自動存儲在爬蟲目錄下:
2.將數據導出至MySQL
1)首先要先下載安裝MySQL數據庫
下載鏈接斋否,dmg格式,一鍵安裝挤茄。(安裝過程中要求設置root用戶的密碼如叼,選擇普通加密,如果選高級加密的話后面會一直連接失敗....)
設置完成后開啟數據庫:
可視化操作安裝Workbentch穷劈,
Workbentch連接數據庫,建立新的數據庫踊沸,并新建表格并設置好字段:
2)在Python中安裝pymysql包
cmd輸入:conda install pymysql
或者直接用pip install pymysql
3)pipelines.py文件設置
這里數據庫存儲使用了異步操作歇终,目的是防止插入數據的速度跟不上網頁的爬取解析速度,造成阻塞逼龟。Python 中提供了 Twisted 框架來實現異步操作评凝,該框架提供了一個連接池,通過連接池可以實現數據插入 MySQL 的異步化腺律。詳細教程參考Scrapy 入門筆記(4) --- 使用 Pipeline 保存數據
在pipeline.py文件中加入以下代碼奕短,并在setting.py中開啟對應pipeline(詳見settings.py),
# -*- coding: utf-8 -*-
import pymysql
import pymysql.cursors
from twisted.enterprise import adbapi
class TaobaoPipeline(object):
#鏈接數據庫
def __init__(self,):
dbparms = dict(
host='127.0.0.1',
db='數據庫名稱',
user='root',
passwd='數據庫密碼',
charset='utf8',
cursorclass=pymysql.cursors.DictCursor,
use_unicode=True,
)
# 指定擦做數據庫的模塊名和數據庫參數參數
self.dbpool = adbapi.ConnectionPool("pymysql", **dbparms)
# 使用twisted將mysql插入變成異步執(zhí)行
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self.do_insert, item)
query.addErrback(self.handle_error, item, spider) #處理異常
#處理異步插入的異常
def handle_error(self, failure, item, spider):
print (failure)
#執(zhí)行具體的插入
def do_insert(self, cursor, item):
#從item中導入
title = item['title'][0]
link = item['link']
#price = item['price'][0]
comment = item['comment'][0]
now_price = item['now_price']
address = item['address']
sale = item['sale_count']
brand=item['brand'][0]
produce=item['produce'][0]
effect = item['effect'][0]
print('商品標題\t', title)
print('商品鏈接\t', link)
#print('商品原價\t', price)
print('商品現價\t', now_price)
print('商家地址\t', address)
print('評論數量\t', comment)
print('銷量\t', sale)
print('品牌\t',brand)
print('產地\t',produce)
print('功效\t',effect)
try:
sql="insert into taobaokh(title,link,comment,now_price,address,sale,brand,produce,effect) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
values=(title,link,comment,now_price,address,sale,brand,produce,effect)
cursor.execute(sql,values)
print('導入成功')
print('------------------------------\n')
return item
except Exception as err:
pass
運行爬蟲:
scrapy crawl tb --nolog
到此,爬蟲基本已經可以正常運轉起來了匀钧。
第三步 設置設置隨機User-Agent
目的是每次請求時通過更換不同的user-agent翎碑,可以更好地偽裝瀏覽器。
1.更新了源碼的ua列表(PC端)之斯,添加到settings.py最后
USER_AGENT_LIST = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/603.2.4 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.4",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0",
"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:54.0) Gecko/20100101 Firefox/54.0",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/603.2.5 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.5",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0",
"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",
"Mozilla/5.0 (iPad; CPU OS 10_3_2 like Mac OS X) AppleWebKit/603.2.4 (KHTML, like Gecko) Version/10.0 Mobile/14F89 Safari/602.1",
"Mozilla/5.0 (Windows NT 6.1; rv:54.0) Gecko/20100101 Firefox/54.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:54.0) Gecko/20100101 Firefox/54.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
"Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/58.0.3029.110 Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.2.5 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.5",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:53.0) Gecko/20100101 Firefox/53.0",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36 OPR/46.0.2597.32",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/59.0.3071.109 Chrome/59.0.3071.109 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 OPR/45.0.2552.898",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 OPR/46.0.2597.39",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:54.0) Gecko/20100101 Firefox/54.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko",
"Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Firefox/52.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36",
]
DOWNLOADER_MIDDLEWARES = {
'taobao.middlewares.ProcessHeaderMidware': 543,
}
github上有人專門寫了一個user-agent 的插件日杈,也可以直接調用,鏈接
2.在middlewares.py文件里添加如下代碼:
# encoding: utf-8
from scrapy.utils.project import get_project_settings
import random
settings = get_project_settings()
class ProcessHeaderMidware():
"""process request add request info"""
def process_request(self, request, spider):
"""
隨機從列表中獲得header佑刷, 并傳給user_agent進行使用
"""
ua = random.choice(settings.get('USER_AGENT_LIST'))
spider.logger.info(msg='now entring download midware')
if ua:
request.headers['User-Agent'] = ua
# Add desired logging message here.
spider.logger.info(u'User-Agent is : {} {}'.format(request.headers.get('User-Agent'), request))
pass
設置完成莉擒。
第四步 使用Scrapy-redis實現分布式爬蟲
為了進一步提高效率和防反爬蟲能力,就要用到多進程和分布式爬蟲了瘫絮。
Scrapy-redis還有一個好處是支持斷點續(xù)傳涨冀,爬的過程中遇到過sracpy卡主住不動的情況,直接重新打開一個終端麦萤,輸入爬蟲指令鹿鳖,又繼續(xù)跑起來~
1. Scrapy-redis環(huán)境搭建:
需要分別安裝redis,scrapy-redis频鉴,和redis-py三個庫:
1)redis
直接使用conda install redis
安裝(或pip install redis
)
2) scrapy-redis
由于anaconda中沒有scrapy-redis的安裝包栓辜,需要下載第三方zip安裝包,下載鏈接垛孔。安裝過程:cmd依次輸入
cd /Users/用戶名/Downloads
unzip scrapy-redis-master.zip -d/Users/用戶名/Downloads/ #解壓文件到指定路徑
cd scrapy-redis-master
python setup.py install #安裝文件
password:***** #輸入密碼
如果不使用Anaconda坝冕,直接在終端pip install scrapy-redis
應該也可以耕蝉。
3) redis-py
裝完redis之后叮贩,運行程序一直報錯"ImportError: No module named redis"关面,搜過之后發(fā)現是Python默認不支持Redis,需要安裝redis-py才能正常調用豌鹤。下載鏈接
安裝方法同上。
2.修改Scrapy項目文件
1)在settings.py中增加以下內容
SCHEDULER = "scrapy_redis.scheduler.Scheduler" #啟用Redis調度存儲請求隊列
SCHEDULER_PERSIST = True #不清除Redis隊列、這樣可以暫停/恢復 爬取
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #確保所有的爬蟲通過Redis去重
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_HOST = '127.0.0.1' # 也可以根據情況改成 localhost
REDIS_PORT = 6379
REDIS_URL = None
2)在items.py中增加以下內容
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst, Join
class TaobaoSpiderLoader(ItemLoader):
default_item_class = TaobaoItem
default_input_processor = MapCompose(lambda s: s.strip())
default_output_processor = TakeFirst()
description_out = Join()
3)對tb.py文件進行更改
import相關包:
from scrapy_redis.spiders import RedisSpider
修改TbSpider類:
class TbSpider(RedisSpider):
name = 'tb'
#allowed_domains = ['taobao.com']
#start_urls = ['http://taobao.com/']
redis_key = 'Taobao:start_urls'
配置完成默怨!
3. 運行分布式爬蟲
1)打開終端,啟動redis服務器redis-server
:
localhost:~ $ redis-server
3708:C 20 Jul 22:42:41.914 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
3708:C 20 Jul 22:42:41.915 # Redis version=4.0.10, bits=64, commit=00000000, modified=0, pid=3708, just started
3708:C 20 Jul 22:42:41.915 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
3708:M 20 Jul 22:42:41.916 * Increased maximum number of open files to 10032 (it was originally set to 256).
_._
_.-``__ ''-._
_.-`` `. `_. ''-._ Redis 4.0.10 (00000000/0) 64 bit
.-`` .-```. ```\/ _.,_ ''-._
( ' , .-` | `, ) Running in standalone mode
|`-._`-...-` __...-.``-._|'` _.-'| Port: 6379
| `-._ `._ / _.-' | PID: 3708
`-._ `-._ `-./ _.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' | http://redis.io
`-._ `-._`-.__.-'_.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' |
`-._ `-._`-.__.-'_.-' _.-'
`-._ `-.__.-' _.-'
`-._ _.-'
`-.__.-'
3708:M 20 Jul 22:42:41.920 # Server initialized
3708:M 20 Jul 22:42:41.920 * DB loaded from disk: 0.000 seconds
3708:M 20 Jul 22:42:41.920 * Ready to accept connections
看到這個界面就證明服務器開啟骤素,關掉窗口匙睹。
2)打開一個新的終端,運行爬蟲:
scrapy crawl tb --nolog
此時爬蟲處于等待狀態(tài)济竹,需要設置start_url痕檬。
3)再打開一個新的終端,輸入:
redis-cli
127.0.0.1:6379>LPUSH Taobao:start_urls http://taobao.com
(integer) 1
返回(integer) 1 則表示設置成功送浊。(指令中的Taobao:start_urls
對應tb.py文件中的設置redis_key = 'Taobao:start_urls'
)
4)此時梦谜,爬蟲開始運行....MacOS不會像windows一樣,彈出多個終端袭景,只在一個終端里跑唁桩,但明顯速度加快了好多。
5)如果要中途停止爬蟲耸棒,按ctrl+c荒澡。
停止后再輸入 scrapy crawl taobao –nolog
運行的話,程序會斷點續(xù)傳榆纽,原因是在setting.py中設置了 SCHEDULER_PERSIST = True
仰猖。
如果想取消這個功能,要把True改為False奈籽。
6)爬取完畢后饥侵,要清除redis緩存
127.0.0.1:6379>flushdb
ok
完畢!
總結:
通過Python3.6和scrapy構建了一個淘寶商品的爬蟲衣屏,通過scrapy-redis實現了分布式爬蟲躏升,最后用MySQL來存儲數據。
問題
- tmall鏈接下的商品原價格一直抓取失敗狼忱,xpath在xpath finder驗證可行膨疏,運行后一直是空值,猜測可能是網頁有異步加載钻弄,待研究佃却。
- tmall鏈接抓取過程中,很多鏈接進行了重定向(301窘俺、302)導致數據無法抓取饲帅,應該是跳轉登錄之類的反爬措施。
(聲明:此文章僅作為學習交流,不做為其它用途)