Day07回顧
selenium+phantomjs/chrome/firefox
- 特點(diǎn)
1掠手、簡(jiǎn)單筏勒,無(wú)需去詳細(xì)抓取分析網(wǎng)絡(luò)數(shù)據(jù)包,使用真實(shí)瀏覽器
2允睹、需要等待頁(yè)面元素加載峻呕,需要時(shí)間糜烹,效率低
- 使用流程
from selenium import webdriver
# 創(chuàng)建瀏覽器對(duì)象
browser = webdriver.Firefox()
# get()方法會(huì)等待頁(yè)面加載完全后才會(huì)繼續(xù)執(zhí)行下面語(yǔ)句
browser.get('https://www.jd.com/')
# 查找節(jié)點(diǎn)
node = browser.find_element_by_xpath('')
node.send_keys('')
node.click()
# 獲取節(jié)點(diǎn)文本內(nèi)容
content = node.text
# 關(guān)閉瀏覽器
browser.quit()
- 設(shè)置無(wú)界面模式(chromedriver | firefox)
options = webdriver.ChromeOptions()
options.add_argument('--headless')
browser = webdriver.Chrome(options=options)
browser.get(url)
- browser執(zhí)行JS腳本
browser.execute_script(
'window.scrollTo(0,document.body.scrollHeight)'
)
- selenium常用操作
# 1娶靡、鍵盤(pán)操作
from selenium.webdriver.common.keys import Keys
node.send_keys(Keys.SPACE)
node.send_keys(Keys.CONTROL, 'a')
node.send_keys(Keys.CONTROL, 'c')
node.send_keys(Keys.CONTROL, 'v')
node.send_keys(Keys.ENTER)
# 2茎芭、鼠標(biāo)操作
from selenium.webdriver import ActionChains
mouse_action = ActionChains(browser)
mouse_action.move_to_element(node)
mouse_action.perform()
# 3揖膜、切換句柄
all_handles = browser.window_handles
browser.switch_to.window(all_handles[1])
# 4、iframe子框架
browser.switch_to.iframe(iframe_element)
# 5梅桩、Web客戶端驗(yàn)證
url = 'http://用戶名:密碼@正常地址'
execjs模塊使用
# 1壹粟、安裝
sudo pip3 install pyexecjs
# 2、使用
with open('file.js','r') as f:
js = f.read()
obj = execjs.compile(js_data)
result = obj.eval('string')
Day08筆記
scrapy框架
- 定義
異步處理框架,可配置和可擴(kuò)展程度非常高,Python中使用最廣泛的爬蟲(chóng)框架
- 安裝
# Ubuntu安裝
1宿百、安裝依賴包
1趁仙、sudo apt-get install libffi-dev
2、sudo apt-get install libssl-dev
3垦页、sudo apt-get install libxml2-dev
4雀费、sudo apt-get install python3-dev
5、sudo apt-get install libxslt1-dev
6痊焊、sudo apt-get install zlib1g-dev
7盏袄、sudo pip3 install -I -U service_identity
2、安裝scrapy框架
1宋光、sudo pip3 install Scrapy
# Windows安裝
cmd命令行(管理員): python -m pip install Scrapy
- Scrapy框架五大組件
1貌矿、引擎(Engine) :整個(gè)框架核心
2、調(diào)度器(Scheduler) :維護(hù)請(qǐng)求隊(duì)列
3罪佳、下載器(Downloader):獲取響應(yīng)對(duì)象
4逛漫、爬蟲(chóng)文件(Spider) :數(shù)據(jù)解析提取
5、項(xiàng)目管道(Pipeline):數(shù)據(jù)入庫(kù)處理
**********************************
# 下載器中間件(Downloader Middlewares) : 引擎->下載器,包裝請(qǐng)求(隨機(jī)代理等)
# 蜘蛛中間件(Spider Middlewares) : 引擎->爬蟲(chóng)文件,可修改響應(yīng)對(duì)象屬性
- scrapy爬蟲(chóng)工作流程
# 爬蟲(chóng)項(xiàng)目啟動(dòng)
1赘艳、由引擎向爬蟲(chóng)程序索要第一個(gè)要爬取的URL,交給調(diào)度器去入隊(duì)列
2酌毡、調(diào)度器處理請(qǐng)求后出隊(duì)列,通過(guò)下載器中間件交給下載器去下載
3、下載器得到響應(yīng)對(duì)象后,通過(guò)蜘蛛中間件交給爬蟲(chóng)程序
4蕾管、爬蟲(chóng)程序進(jìn)行數(shù)據(jù)提燃咸ぁ:
1、數(shù)據(jù)交給管道文件去入庫(kù)處理
2掰曾、對(duì)于需要繼續(xù)跟進(jìn)的URL,再次交給調(diào)度器入隊(duì)列旭蠕,依次循環(huán)
- scrapy常用命令
# 1、創(chuàng)建爬蟲(chóng)項(xiàng)目
scrapy startproject 項(xiàng)目名
# 2、創(chuàng)建爬蟲(chóng)文件
scrapy genspider 爬蟲(chóng)名 域名
# 3掏熬、運(yùn)行爬蟲(chóng)
scrapy crawl 爬蟲(chóng)名
- scrapy項(xiàng)目目錄結(jié)構(gòu)
Baidu # 項(xiàng)目文件夾
├── Baidu # 項(xiàng)目目錄
│ ├── items.py # 定義數(shù)據(jù)結(jié)構(gòu)
│ ├── middlewares.py # 中間件
│ ├── pipelines.py # 數(shù)據(jù)處理
│ ├── settings.py # 全局配置
│ └── spiders
│ ├── baidu.py # 爬蟲(chóng)文件
└── scrapy.cfg # 項(xiàng)目基本配置文件
- 全局配置文件settings.py詳解
# 1佑稠、定義User-Agent
USER_AGENT = 'Mozilla/5.0'
# 2、是否遵循robots協(xié)議旗芬,一般設(shè)置為False
ROBOTSTXT_OBEY = False
# 3舌胶、最大并發(fā)量,默認(rèn)為16
CONCURRENT_REQUESTS = 32
# 4疮丛、下載延遲時(shí)間
DOWNLOAD_DELAY = 1
# 5幔嫂、請(qǐng)求頭,此處也可以添加User-Agent
DEFAULT_REQUEST_HEADERS={}
# 6誊薄、項(xiàng)目管道
ITEM_PIPELINES={
'項(xiàng)目目錄名.pipelines.類名':300
}
- 創(chuàng)建爬蟲(chóng)項(xiàng)目步驟
1履恩、新建項(xiàng)目 :scrapy startproject 項(xiàng)目名
2、cd 項(xiàng)目文件夾
3呢蔫、新建爬蟲(chóng)文件 :scrapy genspider 文件名 域名
4似袁、明確目標(biāo)(items.py)
5、寫(xiě)爬蟲(chóng)程序(文件名.py)
6咐刨、管道文件(pipelines.py)
7、全局配置(settings.py)
8扬霜、運(yùn)行爬蟲(chóng) :scrapy crawl 爬蟲(chóng)名
- pycharm運(yùn)行爬蟲(chóng)項(xiàng)目
1定鸟、創(chuàng)建begin.py(和scrapy.cfg文件同目錄)
2、begin.py中內(nèi)容:
from scrapy import cmdline
cmdline.execute('scrapy crawl maoyan'.split())
小試牛刀
- 目標(biāo)
打開(kāi)百度首頁(yè)著瓶,把 '百度一下联予,你就知道' 抓取下來(lái),從終端輸出
- 實(shí)現(xiàn)步驟
- 創(chuàng)建項(xiàng)目Baidu 和 爬蟲(chóng)文件baidu
1材原、scrapy startproject Baidu
2沸久、cd Baidu
3、scrapy genspider baidu www.baidu.com
- 編寫(xiě)爬蟲(chóng)文件baidu.py余蟹,xpath提取數(shù)據(jù)
# -*- coding: utf-8 -*-
import scrapy
class BaiduSpider(scrapy.Spider):
name = 'baidu'
allowed_domains = ['www.baidu.com']
start_urls = ['http://www.baidu.com/']
def parse(self, response):
result = response.xpath('/html/head/title/text()').extract_first()
print('*'*50)
print(result)
print('*'*50)
- 全局配置settings.py
USER_AGENT = 'Mozilla/5.0'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
- 創(chuàng)建begin.py(和scrapy.cfg同目錄)
from scrapy import cmdline
cmdline.execute('scrapy crawl baidu'.split())
- 啟動(dòng)爬蟲(chóng)
直接運(yùn)行 begin.py 文件即可
思考運(yùn)行過(guò)程
貓眼電影案例
- 目標(biāo)
URL: 百度搜索 -> 貓眼電影 -> 榜單 -> top100榜
內(nèi)容:電影名稱卷胯、電影主演、上映時(shí)間
- 實(shí)現(xiàn)步驟
- 創(chuàng)建項(xiàng)目和爬蟲(chóng)文件
# 創(chuàng)建爬蟲(chóng)項(xiàng)目
scrapy startproject Maoyan
cd Maoyan
# 創(chuàng)建爬蟲(chóng)文件
scrapy genspider maoyan maoyan.com
- 定義要爬取的數(shù)據(jù)結(jié)構(gòu)(items.py)
name = scrapy.Field()
star = scrapy.Field()
time = scrapy.Field()
- 編寫(xiě)爬蟲(chóng)文件(maoyan.py)
1威酒、基準(zhǔn)xpath,匹配每個(gè)電影信息節(jié)點(diǎn)對(duì)象列表
dd_list = response.xpath('//dl[@class="board-wrapper"]/dd')
2窑睁、for dd in dd_list:
電影名稱 = dd.xpath('./a/@title')
電影主演 = dd.xpath('.//p[@class="star"]/text()')
上映時(shí)間 = dd.xpath('.//p[@class="releasetime"]/text()')
代碼實(shí)現(xiàn)一
# -*- coding: utf-8 -*-
import scrapy
from ..items import MaoyanItem
class MaoyanSpider(scrapy.Spider):
# 爬蟲(chóng)名
name = 'maoyan'
# 允許爬取的域名
allowed_domains = ['maoyan.com']
offset = 0
# 起始的URL地址
start_urls = ['https://maoyan.com/board/4?offset=0']
def parse(self, response):
# 基準(zhǔn)xpath,匹配每個(gè)電影信息節(jié)點(diǎn)對(duì)象列表
dd_list = response.xpath('//dl[@class="board-wrapper"]/dd')
# dd_list : [<element dd at xxx>,<...>]
for dd in dd_list:
# 創(chuàng)建item對(duì)象
item = MaoyanItem()
# [<selector xpath='' data='霸王別姬'>]
# dd.xpath('')結(jié)果為[選擇器1,選擇器2]
# .extract() 把[選擇器1,選擇器2]所有選擇器序列化為unicode字符串
# .extract_first() : 取第一個(gè)字符串
item['name'] = dd.xpath('./a/@title').extract_first().strip()
item['star'] = dd.xpath('.//p[@class="star"]/text()').extract()[0].strip()
item['time'] = dd.xpath('.//p[@class="releasetime"]/text()').extract()[0]
yield item
# 此方法不推薦,效率低
self.offset += 10
if self.offset <= 90:
url = 'https://maoyan.com/board/4?offset={}'.format(str(self.offset))
yield scrapy.Request(
url=url,
callback=self.parse
)
代碼實(shí)現(xiàn)二
# -*- coding: utf-8 -*-
import scrapy
from ..items import MaoyanItem
class MaoyanSpider(scrapy.Spider):
# 爬蟲(chóng)名
name = 'maoyan2'
# 允許爬取的域名
allowed_domains = ['maoyan.com']
# 起始的URL地址
start_urls = ['https://maoyan.com/board/4?offset=0']
def parse(self, response):
for offset in range(0,91,10):
url = 'https://maoyan.com/board/4?offset={}'.format(str(offset))
# 把地址交給調(diào)度器入隊(duì)列
yield scrapy.Request(
url=url,
callback=self.parse_html
)
def parse_html(self,response):
# 基準(zhǔn)xpath,匹配每個(gè)電影信息節(jié)點(diǎn)對(duì)象列表
dd_list = response.xpath(
'//dl[@class="board-wrapper"]/dd')
# dd_list : [<element dd at xxx>,<...>]
for dd in dd_list:
# 創(chuàng)建item對(duì)象
item = MaoyanItem()
# [<selector xpath='' data='霸王別姬'>]
# dd.xpath('')結(jié)果為[選擇器1,選擇器2]
# .extract() 把[選擇器1,選擇器2]所有選擇器序列化為
# unicode字符串
# .extract_first() : 取第一個(gè)字符串
item['name'] = dd.xpath('./a/@title').extract_first().strip()
item['star'] = dd.xpath('.//p[@class="star"]/text()').extract()[0].strip()
item['time'] = dd.xpath('.//p[@class="releasetime"]/text()').extract()[0]
yield item
代碼實(shí)現(xiàn)三
# 重寫(xiě)start_requests()方法,直接把多個(gè)地址都交給調(diào)度器去處理
# -*- coding: utf-8 -*-
import scrapy
from ..items import MaoyanItem
class MaoyanSpider(scrapy.Spider):
# 爬蟲(chóng)名
name = 'maoyan_requests'
# 允許爬取的域名
allowed_domains = ['maoyan.com']
def start_requests(self):
for offset in range(0,91,10):
url = 'https://maoyan.com/board/4?offset={}'.format(str(offset))
# 把地址交給調(diào)度器入隊(duì)列
yield scrapy.Request(url=url,callback=self.parse_html )
def parse_html(self,response):
# 基準(zhǔn)xpath,匹配每個(gè)電影信息節(jié)點(diǎn)對(duì)象列表
dd_list = response.xpath('//dl[@class="board-wrapper"]/dd')
# dd_list : [<element dd at xxx>,<...>]
for dd in dd_list:
# 創(chuàng)建item對(duì)象
item = MaoyanItem()
# [<selector xpath='' data='霸王別姬'>]
# dd.xpath('')結(jié)果為[選擇器1,選擇器2]
# .extract() 把[選擇器1,選擇器2]所有選擇器序列化為
# unicode字符串
# .extract_first() : 取第一個(gè)字符串
item['name'] = dd.xpath('./a/@title').get()
item['star'] = dd.xpath('.//p[@class="star"]/text()').extract()[0].strip()
item['time'] = dd.xpath('.//p[@class="releasetime"]/text()').extract()[0]
yield item
- 定義管道文件(pipelines.py)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from . import settings
class MaoyanPipeline(object):
def process_item(self, item, spider):
print('*' * 50)
print(dict(item))
print('*' * 50)
return item
# 新建管道類,存入mysql
class MaoyanMysqlPipeline(object):
# 開(kāi)啟爬蟲(chóng)時(shí)執(zhí)行,只執(zhí)行一次
def open_spider(self,spider):
print('我是open_spider函數(shù)')
# 一般用于開(kāi)啟數(shù)據(jù)庫(kù)
self.db = pymysql.connect(
settings.MYSQL_HOST,
settings.MYSQL_USER,
settings.MYSQL_PWD,
settings.MYSQL_DB,
charset = 'utf8'
)
self.cursor = self.db.cursor()
def process_item(self,item,spider):
ins = 'insert into film(name,star,time) ' \
'values(%s,%s,%s)'
L = [
item['name'].strip(),
item['star'].strip(),
item['time'].strip()
]
self.cursor.execute(ins,L)
# 提交到數(shù)據(jù)庫(kù)執(zhí)行
self.db.commit()
return item
# 爬蟲(chóng)結(jié)束時(shí),只執(zhí)行一次
def close_spider(self,spider):
# 一般用于斷開(kāi)數(shù)據(jù)庫(kù)連接
print('我是close_spider函數(shù)')
self.cursor.close()
self.db.close()
- 全局配置文件(settings.py)
USER_AGENT = 'Mozilla/5.0'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
ITEM_PIPELINES = {
'Maoyan.pipelines.MaoyanPipeline': 300,
}
- 創(chuàng)建并運(yùn)行文件(begin.py)
from scrapy import cmdline
cmdline.execute('scrapy crawl maoyan'.split())
知識(shí)點(diǎn)匯總
- 節(jié)點(diǎn)對(duì)象.xpath('')
1葵孤、列表,元素為選擇器 ['<selector data='A'>]
2担钮、列表.extract() :序列化列表中所有選擇器為Unicode字符串 ['A','B','C']
3、列表.extract_first() 或者 get() :獲取列表中第1個(gè)序列化的元素(字符串)
- pipelines.py中必須有1個(gè)函數(shù)叫process_item
def process_item(self,item,spider):
return item ( )
# 此處必須返回 item尤仍,此返回值會(huì)傳給下一個(gè)管道的此函數(shù)繼續(xù)處理
- 日志變量及日志級(jí)別(settings.py)
# 日志相關(guān)變量
LOG_LEVEL = ''
LOG_FILE = '文件名.log'
# 日志級(jí)別
5 CRITICAL :嚴(yán)重錯(cuò)誤
4 ERROR :普通錯(cuò)誤
3 WARNING :警告
2 INFO :一般信息
1 DEBUG :調(diào)試信息
# 注意: 只顯示當(dāng)前級(jí)別的日志和比當(dāng)前級(jí)別日志更嚴(yán)重的
- 管道文件使用
1箫津、在爬蟲(chóng)文件中為items.py中類做實(shí)例化,用爬下來(lái)的數(shù)據(jù)給對(duì)象賦值
from ..items import MaoyanItem
item = MaoyanItem()
2、管道文件(pipelines.py)
3苏遥、開(kāi)啟管道(settings.py)
ITEM_PIPELINES = { '項(xiàng)目目錄名.pipelines.類名':優(yōu)先級(jí) }
數(shù)據(jù)持久化存儲(chǔ)(MySQL)
實(shí)現(xiàn)步驟
1饼拍、在setting.py中定義相關(guān)變量
2、pipelines.py中導(dǎo)入settings模塊
def open_spider(self,spider):
# 爬蟲(chóng)開(kāi)始執(zhí)行1次,用于數(shù)據(jù)庫(kù)連接
def close_spider(self,spider):
# 爬蟲(chóng)結(jié)束時(shí)執(zhí)行1次,用于斷開(kāi)數(shù)據(jù)庫(kù)連接
3暖眼、settings.py中添加此管道
ITEM_PIPELINES = {'':200}
# 注意 :process_item() 函數(shù)中一定要 return item ***
練習(xí)
把貓眼電影數(shù)據(jù)存儲(chǔ)到MySQL數(shù)據(jù)庫(kù)中
保存為csv惕耕、json文件
- 命令格式
scrapy crawl maoyan -o maoyan.csv
scrapy crawl maoyan -o maoyan.json
# settings.py 中設(shè)置導(dǎo)出編碼
FEED_EXPORT_ENCONDING='utf-8'
盜墓筆記小說(shuō)抓取案例(三級(jí)頁(yè)面)
- 目標(biāo)
# 抓取目標(biāo)網(wǎng)站中盜墓筆記1-8中所有章節(jié)的所有小說(shuō)的具體內(nèi)容,保存到本地文件
1诫肠、網(wǎng)址 :http://www.daomubiji.com/
- 準(zhǔn)備工作xpath
1司澎、一級(jí)頁(yè)面xpath://li[contains(@class,"menu-item-20")]/a/@href
/html/body/section/article/a/@href
2、二級(jí)頁(yè)面xpath:
基準(zhǔn)xpath ://article
for 循環(huán)遍歷后:
name=article.xpath('./a/text()').get()
link=article.xpath('./a/@href').get()
3栋豫、三級(jí)頁(yè)面xpath:response.xpath('//article[@class="article-content"]//p/text()').extract()
- 項(xiàng)目實(shí)現(xiàn)
- 創(chuàng)建項(xiàng)目及爬蟲(chóng)文件
創(chuàng)建項(xiàng)目 :Daomu
創(chuàng)建爬蟲(chóng) :daomu www.daomubiji.com
- 定義要爬取的數(shù)據(jù)結(jié)構(gòu)(把數(shù)據(jù)交給管道)
import scrapy
class DaomuItem(scrapy.Item):
# 卷名
juan_name = scrapy.Field()
# 章節(jié)數(shù)
zh_num = scrapy.Field()
# 章節(jié)名
zh_name = scrapy.Field()
# 章節(jié)鏈接
zh_link = scrapy.Field()
# 小說(shuō)內(nèi)容
zh_content = scrapy.Field()
- 爬蟲(chóng)文件實(shí)現(xiàn)數(shù)據(jù)抓取
# -*- coding: utf-8 -*-
import scrapy
from ..items import DaomuItem
class DaomuSpider(scrapy.Spider):
name = 'daomu'
allowed_domains = ['www.daomubiji.com']
start_urls = ['http://www.daomubiji.com/']
# 解析一級(jí)頁(yè)面,提取 盜墓筆記1 2 3 ... 鏈接
def parse(self, response):
one_link_list = response.xpath('//ul[@class="sub-menu"]/li/a/@href').extract()
print(one_link_list)
# 把鏈接交給調(diào)度器入隊(duì)列
for one_link in one_link_list:
yield scrapy.Request(url=one_link,callback=self.parse_two_link,dont_filter=True)
# 解析二級(jí)頁(yè)面
def parse_two_link(self,response):
# 基準(zhǔn)xpath,匹配所有章節(jié)對(duì)象列表
article_list = response.xpath('/html/body/section/div[2]/div/article')
# 依次獲取每個(gè)章節(jié)信息
for article in article_list:
# 創(chuàng)建item對(duì)象
item = DaomuItem()
info = article.xpath('./a/text()').extract_first().split()
# info : ['七星魯王','第一章','血尸']
item['juan_name'] = info[0]
item['zh_num'] = info[1]
item['zh_name'] = info[2]
item['zh_link'] = article.xpath('./a/@href').extract_first()
# 把章節(jié)鏈接交給調(diào)度器
yield scrapy.Request(
url=item['zh_link'],
# 把item傳遞到下一個(gè)解析函數(shù)
meta={'item':item},
callback=self.parse_three_link,
dont_filter=True
)
# 解析三級(jí)頁(yè)面
def parse_three_link(self,response):
item = response.meta['item']
# 獲取小說(shuō)內(nèi)容
item['zh_content'] = '\n'.join(response.xpath(
'//article[@class="article-content"]//p/text()'
).extract())
yield item
# '\n'.join(['第一段','第二段','第三段'])
- 管道文件實(shí)現(xiàn)數(shù)據(jù)處理
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class DaomuPipeline(object):
def process_item(self, item, spider):
filename = '/home/tarena/aid1902/{}-{}-{}.txt'.format(
item['juan_name'],
item['zh_num'],
item['zh_name']
)
f = open(filename,'w')
f.write(item['zh_content'])
f.close()
return item
今日作業(yè)
1挤安、scrapy框架有哪幾大組件?以及各個(gè)組件之間是如何工作的丧鸯?
2蛤铜、Daomu錯(cuò)誤調(diào)一下(看規(guī)律,做條件判斷)
3、騰訊招聘嘗試改寫(xiě)為scrapy
response.text :獲取頁(yè)面響應(yīng)內(nèi)容
4丛肢、豆瓣電影嘗試改為scrapy