前景提要
- 請求網(wǎng)站: urllib/requests/selenium/scrapy
- 解析源碼: lxml/bs4/re/scrapy(xpath)
- 存儲: MySQL, mongo
- 反爬: woff纤怒、user-agent雹洗、ip幕帆、ajax祥诽、cookie诸典、referer
- 反反爬: 大眾點評(字體woff)、貓眼漠酿、fake_useragent憔鬼、ip代理池、分析js观话、驗證碼
框架scrapy
- 框架scrapy == >>twisted(異步)
1. 安裝scrapy
- pip install scrapy --- 報錯 安裝twisted
- easy_install pywin32-221.win-amd64-py3.6.exe
- twisted安裝:pip install Twisted-19.2.1-cp37-cp37m-win_amd64.whl
- 在安裝scrapy - pip install scrapy
2. 初建項目
- scrapy startproject TestSpider - 創(chuàng)建項目
- cd TestSpider
- scrapy genspider qidian www.qidian.com - 創(chuàng)建爬蟲文件文件名是qidian.py予借, 爬取的網(wǎng)站是www.qidian.com
- scrapy crawl qidian - 啟動爬蟲
3. 爬取豆瓣電影
-
scrapy執(zhí)行過程
"""__author__= 雍新有"""
from scrapy import Selector, Spider, Request
class DouBanSpider(Spider):
# 爬蟲名
name = 'douban'
# 爬取地址, 爬蟲默認(rèn)從start_urls列表中取地址進行爬取,
# 寫一個parse不指定該解析哪個response,所有分開寫
# start_urls = ['正在上映url', '即將上映url', '即將上映全部電影url']
# 正在上映url
nowplaying_url = 'https://movie.douban.com/cinema/nowplaying/chengdu/'
# 即將上映url
later_url = 'https://movie.douban.com/cinema/later/chengdu/'
# 即將上映全部電影url
coming_url = 'https://movie.douban.com/coming'
def start_requests(self):
# 自定義發(fā)送的請求频蛔,請求地址的響應(yīng)通過callback參數(shù)來指定
yield Request(url=self.nowplaying_url,
callback=self.parse_nowplaying)
yield Request(url=self.coming_url,
callback=self.parse_coming)
yield Request(url=self.later_url,
callback=self.parse_later)
def parse_nowplaying(self, response):
sel = Selector(response)
# 拿到電影列表
nowplaying_movies = sel.xpath('//*[@id="nowplaying"]/div[2]/ul/li')
for movie in nowplaying_movies:
# 第一個a標(biāo)簽灵迫,電影鏈接
href = movie.xpath('./ul/li/a/@href').extract_first()
yield Request(url=href, callback=self.parse_detail)
def parse_coming(self, response):
sel = Selector(response)
# 即將上映的電影列表
coming_movies = sel.xpath('//*[@id="content"]/div/div[1]/table/tbody/tr')
for movie in coming_movies:
href = movie.xpath('./td[2]/a/@href').extract_first()
yield Request(url=href, callback=self.parse_detail)
def parse_later(self, response):
sel = Selector(response)
later_movies = sel.xpath('//*[@id="showing-soon"]/div')
for movie in later_movies:
href = movie.xpath('./a/@href').extract_first('')
yield Request(url=href, callback=self.parse_detail)
def parse_detail(self, response):
# 回調(diào)用于解析電影詳情內(nèi)容
sel = Selector(response)
# 電影名稱
name = sel.xpath('//*[@id="content"]/h1/span[1]/text()').extract_first()
# 上映時間
coming_time = sel.xpath('//*[@property="v:initialReleaseDate"]/text()').extract_first()
print(f'{name}上映時間為{coming_time}')
item = TestspiderItem()
item['name'] = name
item['coming_time'] = coming_time
# 這里的yield是把數(shù)據(jù)返回給通道然后存在數(shù)據(jù)庫
yield item
- 爬取多頁的網(wǎng)站
name = 'jobs'
boss_url = 'https://www.zhipin.com/c101270100/?query=python&page=%s&ka=page-%s'
def start_requests(self):
for i in range(1, 6):
print(self.boss_url % (i, i))
yield Request(url=self.boss_url % (i, i),
callback=self.parse_boss)
name = 'guazi'
guazi_urls = 'https://www.guazi.com/cd/buy/o{page}/#bread'
def start_requests(self):
for i in range(1, 51):
print(self.guazi_urls.format(page=i))
yield Request(url=self.guazi_urls.format(page=i),
callback=self.parse_guazi)
1.3.1 要改的settings參數(shù)
- 19行
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
為了改ip
- 改settings
- 55行
DOWNLOADER_MIDDLEWARES = {
# 'TestSpider.middlewares.TestspiderDownloaderMiddleware': 543,
'TestSpider.middlewares.ProxyMiddleware': 543,
}
- 改middlewares
class ProxyMiddleware():
def process_request(self, request, spider):
request.meta['proxy'] = 'http://213.178.38.246:51967'
# res = request.get('127.0.0.1:500/get')
# request.meta['proxy'] = 'http://' + res
# 返回None,表示繼續(xù)執(zhí)行請求
return None
4. item實體的定義和json格式數(shù)據(jù)的導(dǎo)出
- item就相當(dāng)于模型model
items中
import scrapy
class TestspiderItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
coming_time = scrapy.Field()
- 在douabn文件的最后添加
item = TestspiderItem()
item['name'] = name
item['coming_time'] = coming_time
# 這里的yield是把數(shù)據(jù)返回給通道然后存在數(shù)據(jù)庫
yield item
后臺運行 -- 導(dǎo)入json數(shù)據(jù)
- cd TestSpider
- scrapy crawl douban -o douban.json
- settings最后添加下列代碼 - 設(shè)置存儲中文到j(luò)son文件中的格式
FEED_EXPORT_ENCODING='utf-8'