Day06回顧
多線程寫入同一文件
注意使用線程鎖
from threading import Lock
lock = Lock()
f = open('xxx.txt','a')
lock.acquire()
f.write(string)
lock.release()
f.close()
cookie模擬登陸
1、適用網(wǎng)站類型: 爬取網(wǎng)站頁面時(shí)需要登錄后才能訪問,否則獲取不到頁面的實(shí)際響應(yīng)數(shù)據(jù)
2迷殿、方法1(利用cookie)
1、先登錄成功1次,獲取到攜帶登陸信息的Cookie(處理headers)
2挖帘、利用處理的headers向URL地址發(fā)請(qǐng)求
3梭纹、方法2(利用session會(huì)話保持)
1、實(shí)例化session對(duì)象
session = requests.session()
2励七、先post : session.post(post_url,data=post_data,headers=headers)
1智袭、登陸,找到POST地址: form -> action對(duì)應(yīng)地址
2掠抬、定義字典吼野,創(chuàng)建session實(shí)例發(fā)送請(qǐng)求
# 字典key :<input>標(biāo)簽中name的值(email,password)
# post_data = {'email':'','password':''}
3、再get : session.get(url,headers=headers)
三個(gè)池子
1两波、User-Agent池
2瞳步、代理IP池
3闷哆、cookie池
解析模塊匯總
re、lxml+xpath单起、json
# re
import re
pattern = re.compile(r'',re.S)
r_list = pattern.findall(html)
# lxml+xpath
from lxml import etree
parse_html = etree.HTML(html)
r_list = parse_html.xpath('')
# json
# 響應(yīng)內(nèi)容由json轉(zhuǎn)為python
html = json.loads(res.text)
# 所抓數(shù)據(jù)保存到j(luò)son文件
with open('xxx.json','a') as f:
json.dump(item_list,f,ensure_ascii=False)
# 或
f = open('xxx.json','a')
json.dump(item_list,f,ensure_ascii=False)
f.close()
selenium+phantomjs/chrome/firefox
- 特點(diǎn)
1抱怔、簡單,無需去詳細(xì)抓取分析網(wǎng)絡(luò)數(shù)據(jù)包嘀倒,使用真實(shí)瀏覽器
2野蝇、需要等待頁面元素加載,需要時(shí)間括儒,效率低
- 安裝
1绕沈、下載、解壓
2帮寻、添加到系統(tǒng)環(huán)境變量
# windows: 拷貝到Python安裝目錄的Scripts目錄中
# Linux : 拷貝到/usr/bin目錄中
3乍狐、Linux中修改權(quán)限
# sudo -i
# cd /usr/bin/
# chmod +x phantomjs
- 使用流程
from selenium import webdriver
# 1、創(chuàng)建瀏覽器對(duì)象
# 2固逗、輸入網(wǎng)址
# 3浅蚪、查找節(jié)點(diǎn)
# 4、做對(duì)應(yīng)操作
# 5烫罩、關(guān)閉瀏覽器
- 重要知識(shí)點(diǎn)
1惜傲、browser.page_source
2、browser.page_source.find('')
3贝攒、node.send_keys('')
4盗誊、node.click()
5、find_element AND find_elements
6隘弊、browser.execute_script('javascript')
7哈踱、browser.quit()
Day07筆記
京東爬蟲案例
- 目標(biāo)
1、目標(biāo)網(wǎng)址 :https://www.jd.com/
2梨熙、抓取目標(biāo) :商品名稱开镣、商品價(jià)格、評(píng)價(jià)數(shù)量咽扇、商品商家
- 實(shí)現(xiàn)步驟
1邪财、找節(jié)點(diǎn)
1、首頁搜索框 : //*[@id="key"]
2质欲、首頁搜索按鈕 ://*[@id="search"]/div/div[2]/button
3树埠、商品頁的 商品信息節(jié)點(diǎn)對(duì)象列表 ://*[@id="J_goodsList"]/ul/li
2、執(zhí)行JS腳本把敞,獲取動(dòng)態(tài)加載數(shù)據(jù)
browser.execute_script(
'window.scrollTo(0,document.body.scrollHeight)'
)
3弥奸、代碼實(shí)現(xiàn)
from selenium import webdriver
import time
class JdSpider(object):
def __init__(self):
self.browser = webdriver.Chrome()
self.url = 'https://www.jd.com/'
self.i = 0
# 獲取商品頁面
def get_page(self):
self.browser.get(self.url)
# 找2個(gè)節(jié)點(diǎn)
self.browser.find_element_by_xpath('//*[@id="key"]').send_keys('爬蟲書籍')
self.browser.find_element_by_xpath('//*[@id="search"]/div/div[2]/button').click()
time.sleep(2)
# 解析頁面
def parse_page(self):
# 把下拉菜單拉到底部,執(zhí)行JS腳本
self.browser.execute_script(
'window.scrollTo(0,document.body.scrollHeight)'
)
time.sleep(2)
# 匹配所有商品節(jié)點(diǎn)對(duì)象列表
li_list = self.browser.find_elements_by_xpath('//*[@id="J_goodsList"]/ul/li')
item = {}
for li in li_list:
item['name'] = li.find_element_by_xpath('.//div[@class="p-name"] | .//div[@class="p-name p-name-type-2"]').text
item['price'] = li.find_element_by_xpath('.//div[@class="p-price"]').text
item['commit'] = li.find_element_by_xpath('.//div[@class="p-commit"]').text
item['shopname'] = li.find_element_by_xpath('.//div[@class="p-shopnum"]/a | .//div[@class="p-shopnum"]/span').text
print(item)
self.i += 1
# 主函數(shù)
def main(self):
self.get_page()
while True:
self.parse_page()
# 判斷是否該點(diǎn)擊下一頁,沒有找到說明不是最后一頁
if self.browser.page_source.find('pn-next disabled') == -1:
self.browser.find_element_by_class_name('pn-next').click()
time.sleep(2)
else:
break
print(self.i)
if __name__ == '__main__':
spider = JdSpider()
spider.main()
chromedriver設(shè)置無界面模式
from selenium import webdriver
options = webdriver.ChromeOptions()
# 添加無界面參數(shù)
options.add_argument('--headless')
browser = webdriver.Chrome(options=options)
browser.get('http://www.baidu.com/')
browser.save_screenshot('baidu.png')
selenium - 鍵盤操作
示例
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
browser.get('http://www.baidu.com/')
# 1榨惠、在搜索框中輸入"selenium"
browser.find_element_by_id('kw').send_keys('趙麗穎')
# 2奋早、輸入空格
browser.find_element_by_id('kw').send_keys(Keys.SPACE)
# 3盛霎、Ctrl+a 模擬全選
browser.find_element_by_id('kw').send_keys(Keys.CONTROL, 'a')
# 4、Ctrl+c 模擬復(fù)制
browser.find_element_by_id('kw').send_keys(Keys.CONTROL, 'c')
# 5耽装、Ctrl+v 模擬粘貼
browser.find_element_by_id('kw').send_keys(Keys.CONTROL, 'v')
# 6愤炸、輸入回車,代替 搜索 按鈕
browser.find_element_by_id('kw').send_keys(Keys.ENTER)
selenium - 鼠標(biāo)操作
示例
from selenium import webdriver
# 導(dǎo)入鼠標(biāo)事件類
from selenium.webdriver import ActionChains
driver = webdriver.Chrome()
driver.get('http://www.baidu.com/')
#輸入selenium 搜索
driver.find_element_by_id('kw').send_keys('趙麗穎')
driver.find_element_by_id('su').click()
#移動(dòng)到 設(shè)置,perform()是真正執(zhí)行操作掉奄,必須有
element = driver.find_element_by_name('tj_settingicon')
ActionChains(driver).move_to_element(element).perform()
#單擊规个,彈出的Ajax元素,根據(jù)鏈接節(jié)點(diǎn)的文本內(nèi)容查找
driver.find_element_by_link_text('高級(jí)搜索').click()
selenium - 切換頁面
- 適用網(wǎng)站
頁面中點(diǎn)開鏈接出現(xiàn)新的頁面姓建,但是瀏覽器對(duì)象browser還是之前頁面的對(duì)象
- 應(yīng)對(duì)方案
# 獲取當(dāng)前所有句柄(窗口)
all_handles = browser.window_handles
# 切換browser到新的窗口诞仓,獲取新窗口的對(duì)象
browser.switch_to_window(all_handles[1])
- 民政部網(wǎng)站案例
目標(biāo)
將民政區(qū)劃代碼爬取到數(shù)據(jù)庫中,按照層級(jí)關(guān)系(分表 -- 省表速兔、市表墅拭、縣表)
數(shù)據(jù)庫中建表
# 建庫
create database govdb charset utf8;
use govdb;
# 建表
create table province(
p_name varchar(20),
p_code varchar(20)
)charset=utf8;
create table city(
c_name varchar(20),
c_code varchar(20),
c_father_code varchar(20)
)charset=utf8;
create table county(
x_name varchar(20),
x_code varchar(20),
x_father_code varchar(20)
)charset=utf8;
思路
1、selenium+Chrome打開一級(jí)頁面涣狗,并提取二級(jí)頁面最新鏈接
2谍婉、增量爬取: 和數(shù)據(jù)庫version表中進(jìn)行比對(duì),確定之前是否爬過(是否有更新)
3镀钓、如果沒有更新穗熬,直接提示用戶,無須繼續(xù)爬取
4丁溅、如果有更新唤蔗,則刪除之前表中數(shù)據(jù),重新爬取并插入數(shù)據(jù)庫表
5窟赏、最終完成后: 斷開數(shù)據(jù)庫連接措译,關(guān)閉瀏覽器
代碼實(shí)現(xiàn)
from selenium import webdriver
import time
import pymysql
class GovementSpider(object):
def __init__(self):
self.browser = webdriver.Chrome()
self.one_url = 'http://www.mca.gov.cn/article/sj/xzqh/2019/'
# 創(chuàng)建數(shù)據(jù)庫相關(guān)變量
self.db = pymysql.connect(
'localhost','root','123456','govdb',charset='utf8'
)
self.cursor = self.db.cursor()
# 定義3個(gè)列表,為了excutemany()
self.province_list = []
self.city_list = []
self.county_list = []
# 獲取首頁,并提取二級(jí)頁面鏈接(虛假鏈接)
def get_false_url(self):
self.browser.get(self.one_url)
# 提取二級(jí)頁面鏈接 + 點(diǎn)擊該節(jié)點(diǎn)
td_list = self.browser.find_elements_by_xpath(
'//td[@class="arlisttd"]/a[contains(@title,"代碼")]'
)
if td_list:
# 找節(jié)點(diǎn)對(duì)象,因?yàn)橐猚lick()
two_url_element = td_list[0]
# 增量爬取,取出鏈接和數(shù)據(jù)庫version表中做比對(duì)
two_url = two_url_element.get_attribute('href')
sel = 'select * from version where link=%s'
self.cursor.execute(sel,[two_url])
result = self.cursor.fetchall()
if result:
print('數(shù)據(jù)已最新,無需爬取')
else:
# 點(diǎn)擊
two_url_element.click()
time.sleep(5)
# 切換browser
all_handles = self.browser.window_handles
self.browser.switch_to.window(all_handles[1])
# 數(shù)據(jù)抓取
self.get_data()
# 結(jié)束之后把two_url插入到version表中
ins = 'insert into version values(%s)'
self.cursor.execute(ins,[two_url])
self.db.commit()
# 二級(jí)頁面中提取行政區(qū)劃代碼
def get_data(self):
# 基準(zhǔn)xpath
tr_list = self.browser.find_elements_by_xpath(
'//tr[@height="19"]'
)
for tr in tr_list:
code = tr.find_element_by_xpath('./td[2]').text.strip()
name = tr.find_element_by_xpath('./td[3]').text.strip()
print(name,code)
# 判斷層級(jí)關(guān)系,添加到對(duì)應(yīng)的數(shù)據(jù)庫表中(對(duì)應(yīng)表中字段)
# province: p_name p_code
# city : c_name c_code c_father_code
# county : x_name x_code x_father_code
if code[-4:] == '0000':
self.province_list.append([name, code])
# 單獨(dú)判斷4個(gè)直轄市放到city表中
if name in ['北京市', '天津市', '上海市', '重慶市']:
city = [name, code, code]
self.city_list.append(city)
elif code[-2:] == '00':
city = [name, code, code[:2] + '0000']
self.city_list.append(city)
else:
# 四個(gè)直轄市區(qū)縣的上一級(jí)為: xx0000
if code[:2] in ['11','12','31','50']:
county = [name,code,code[:2]+'0000']
# 普通省市區(qū)縣上一級(jí)為: xxxx00
else:
county = [name, code, code[:4] + '00']
self.county_list.append(county)
# 和for循環(huán)同縮進(jìn),所有數(shù)據(jù)爬完后統(tǒng)一excutemany()
self.insert_mysql()
def insert_mysql(self):
# 更新時(shí)一定要先刪除表記錄
del_province = 'delete from province'
del_city = 'delete from city'
del_county = 'delete from county'
self.cursor.execute(del_province)
self.cursor.execute(del_city)
self.cursor.execute(del_county)
# 插入新的數(shù)據(jù)
ins_province = 'insert into province values(%s,%s)'
ins_city = 'insert into city values(%s,%s,%s)'
ins_county = 'insert into county values(%s,%s,%s)'
self.cursor.executemany(ins_province,self.province_list)
self.cursor.executemany(ins_city,self.city_list)
self.cursor.executemany(ins_county,self.county_list)
self.db.commit()
print('數(shù)據(jù)抓取完成,成功存入數(shù)據(jù)庫')
def main(self):
self.get_false_url()
# 所有數(shù)據(jù)處理完成后斷開連接
self.cursor.close()
self.db.close()
# 關(guān)閉瀏覽器
self.browser.quit()
if __name__ == '__main__':
spider = GovementSpider()
spider.main()
SQL命令練習(xí)
# 1. 查詢所有省市縣信息(多表查詢實(shí)現(xiàn))
select province.p_name,city.c_name,county.x_name from province,city,county where province.p_code=city.c_father_code and city.c_code=county.x_father_code;
# 2. 查詢所有省市縣信息(連接查詢實(shí)現(xiàn))
select province.p_name,city.c_name,county.x_name from province inner join city on province.p_code=city.c_father_code inner join county on city.c_code=county.x_father_code;
selenium - Web客戶端驗(yàn)證
彈窗中的用戶名和密碼如何輸入?
不用輸入饰序,在URL地址中填入就可以
示例: 爬取某一天筆記
from selenium import webdriver
url = 'http://tarenacode:code_2013@code.tarena.com.cn/AIDCode/aid1904/15-spider/spider_day06_note.zip'
browser = webdriver.Chrome()
browser.get(url)
selenium - iframe子框架
特點(diǎn)
網(wǎng)頁中嵌套了網(wǎng)頁领虹,先切換到iframe子框架,然后再執(zhí)行其他操作
方法
browser.switch_to.iframe(iframe_element)
示例 - 登錄qq郵箱
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://mail.qq.com/')
# 切換到iframe子框架
login_frame = driver.find_element_by_id('login_frame')
driver.switch_to.frame(login_frame)
# 用戶名+密碼+登錄
driver.find_element_by_id('u').send_keys('2621470058')
driver.find_element_by_id('p').send_keys('密碼')
driver.find_element_by_id('login_button').click()
# 預(yù)留頁面記載時(shí)間
time.sleep(5)
# 提取數(shù)據(jù)
ele = driver.find_element_by_id('useralias')
print(ele.text)
百度翻譯破解案例
目標(biāo)
破解百度翻譯接口求豫,抓取翻譯結(jié)果數(shù)據(jù)
實(shí)現(xiàn)步驟
-
1塌衰、F12抓包,找到j(luò)son的地址,觀察查詢參數(shù)
1、POST地址: https://fanyi.baidu.com/v2transapi 2蝠嘉、Form表單數(shù)據(jù)(多次抓取在變的字段) from: zh to: en sign: 54706.276099 #這個(gè)是如何生成的最疆? token: a927248ae7146c842bb4a94457ca35ee # 基本固定,但也想辦法獲取
-
2、抓取相關(guān)JS文件
右上角 - 搜索 - sign: - 找到具體JS文件(index_c8a141d.js) - 格式化輸出
3蚤告、在JS中尋找sign的生成代碼
1努酸、在格式化輸出的JS代碼中搜索: sign: 找到如下JS代碼:sign: m(a),
2、通過設(shè)置斷點(diǎn)杜恰,找到m(a)函數(shù)的位置获诈,即生成sign的具體函數(shù)
# 1. a 為要翻譯的單詞
# 2. 鼠標(biāo)移動(dòng)到 m(a) 位置處仍源,點(diǎn)擊可進(jìn)入具體m(a)函數(shù)代碼塊
4、生成sign的m(a)函數(shù)具體代碼如下(在一個(gè)大的define中)
function a(r) {
if (Array.isArray(r)) {
for (var o = 0, t = Array(r.length); o < r.length; o++)
t[o] = r[o];
return t
}
return Array.from(r)
}
function n(r, o) {
for (var t = 0; t < o.length - 2; t += 3) {
var a = o.charAt(t + 2);
a = a >= "a" ? a.charCodeAt(0) - 87 : Number(a),
a = "+" === o.charAt(t + 1) ? r >>> a : r << a,
r = "+" === o.charAt(t) ? r + a & 4294967295 : r ^ a
}
return r
}
function e(r) {
var o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g);
if (null === o) {
var t = r.length;
t > 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10))
} else {
for (var e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++)
"" !== e[C] && f.push.apply(f, a(e[C].split(""))),
C !== h - 1 && f.push(o[C]);
var g = f.length;
g > 30 && (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join(""))
}
// var u = void 0
// , l = "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107);
// u = null !== i ? i : (i = window[l] || "") || "";
// 斷點(diǎn)調(diào)試,然后從網(wǎng)頁源碼中找到 window.gtk的值
var u = '320305.131321201'
for (var d = u.split("."), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) {
var A = r.charCodeAt(v);
128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)),
S[c++] = A >> 18 | 240,
S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224,
S[c++] = A >> 6 & 63 | 128),
S[c++] = 63 & A | 128)
}
for (var p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++)
p += S[b],
p = n(p, F);
return p = n(p, D),
p ^= s,
0 > p && (p = (2147483647 & p) + 2147483648),
p %= 1e6,
p.toString() + "." + (p ^ m)
}
-
5舔涎、直接將代碼寫入本地js文件,利用pyexecjs模塊執(zhí)行js代碼進(jìn)行調(diào)試
import execjs with open('node.js','r') as f: js_data = f.read() # 創(chuàng)建對(duì)象 exec_object = execjs.compile(js_data) sign = exec_object.eval('e("hello")') print(sign)
-
獲取token
# 在js中 token: window.common.token # 在響應(yīng)中想辦法獲取此值 token_url = 'https://fanyi.baidu.com/?aldtype=16047' regex: "token: '(.*?)'"
-
具體代碼實(shí)現(xiàn)
import requests import re import execjs class BaiduTranslateSpider(object): def __init__(self): self.token_url = 'https://fanyi.baidu.com/?aldtype=16047' self.post_url = 'https://fanyi.baidu.com/v2transapi' self.headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', # 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'zh-CN,zh;q=0.9', 'cache-control': 'no-cache', 'cookie': 'BAIDUID=52920E829C1F64EE98183B703F4E37A9:FG=1; BIDUPSID=52920E829C1F64EE98183B703F4E37A9; PSTM=1562657403; to_lang_often=%5B%7B%22value%22%3A%22en%22%2C%22text%22%3A%22%u82F1%u8BED%22%7D%2C%7B%22value%22%3A%22zh%22%2C%22text%22%3A%22%u4E2D%u6587%22%7D%5D; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; delPer=0; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BCLID=6890774803653935935; BDSFRCVID=4XAsJeCCxG3DLCbwbJrKDGwjNA0UN_I3KhXZ3J; H_BDCLCKID_SF=tRk8oIDaJCvSe6r1MtQ_M4F_qxby26nUQ5neaJ5n0-nnhnL4W46bqJKFLtozKMoI3C7fotJJ5nololIRy6CKjjb-jaDqJ5n3bTnjstcS2RREHJrg-trSMDCShGRGWlO9WDTm_D_KfxnkOnc6qJj0-jjXqqo8K5Ljaa5n-pPKKRAaqD04bPbZL4DdMa7HLtAO3mkjbnczfn02OP5P5lJ_e-4syPRG2xRnWIvrKfA-b4ncjRcTehoM3xI8LNj405OTt2LEoDPMJKIbMI_rMbbfhKC3hqJfaI62aKDs_RCMBhcqEIL4eJOIb6_w5gcq0T_HttjtXR0atn7ZSMbSj4Qo5pK95p38bxnDK2rQLb5zah5nhMJS3j7JDMP0-4rJhxby523i5J6vQpnJ8hQ3DRoWXPIqbN7P-p5Z5mAqKl0MLIOkbC_6j5DWDTvLeU7J-n8XbI60XRj85-ohHJrFMtQ_q4tehHRMBUo9WDTm_DoTttt5fUj6qJj855jXqqo8KMtHJaFf-pPKKRAashnzWjrkqqOQ5pj-WnQr3mkjbn5yfn02OpjPX6joht4syPRG2xRnWIvrKfA-b4ncjRcTehoM3xI8LNj405OTt2LEoC0XtIDhMDvPMCTSMt_HMxrKetJyaR0JhpjbWJ5TEPnjDUOdLPDW-46HBM3xbKQw5CJGBf7zhpvdWhC5y6ISKx-_J68Dtf5; ZD_ENTRY=baidu; PSINO=2; H_PS_PSSID=26525_1444_21095_29578_29521_28518_29098_29568_28830_29221_26350_29459; locale=zh; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1563426293,1563996067; from_lang_often=%5B%7B%22value%22%3A%22zh%22%2C%22text%22%3A%22%u4E2D%u6587%22%7D%2C%7B%22value%22%3A%22en%22%2C%22text%22%3A%22%u82F1%u8BED%22%7D%5D; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1563999768; yjs_js_security_passport=2706b5b03983b8fa12fe756b8e4a08b98fb43022_1563999769_js', 'pragma': 'no-cache', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36', } # 獲取token def get_token(self): token_url = 'https://fanyi.baidu.com/?aldtype=16047' # 定義請(qǐng)求頭 r = requests.get(self.token_url,headers=self.headers) token = re.findall(r"token: '(.*?)'",r.text) window_gtk = re.findall(r"window.*?gtk = '(.*?)';</script>",r.text) if token: return token[0],window_gtk[0] # 獲取sign def get_sign(self,word): with open('百度翻譯.js','r') as f: js_data = f.read() exec_object = execjs.compile(js_data) sign = exec_object.eval('e("{}")'.format(word)) return sign # 主函數(shù) def main(self,word,fro,to): token,gtk = self.get_token() sign = self.get_sign(word) # 找到form表單數(shù)據(jù)如下,sign和token需要想辦法獲取 form_data = { 'from': fro, 'to': to, 'query': word, 'transtype': 'realtime', 'simple_means_flag': '3', 'sign': sign, 'token': token } r = requests.post(self.post_url,data=form_data,headers=self.headers) print(r.json()['trans_result']['data'][0]['dst']) if __name__ == '__main__': spider = BaiduTranslateSpider() menu = '1. 英譯漢 2. 漢譯英' choice = input('1. 英譯漢 2. 漢譯英 : ') word = input('請(qǐng)輸入要翻譯的單詞:') if choice == '1': fro,to = 'en','zh' elif choice == '2': fro,to = 'zh','en' spider.main(word,fro,to)
scrapy框架
- 定義
異步處理框架,可配置和可擴(kuò)展程度非常高,Python中使用最廣泛的爬蟲框架
- 安裝
# Ubuntu安裝
1笼踩、安裝依賴包
1、sudo apt-get install libffi-dev
2亡嫌、sudo apt-get install libssl-dev
3嚎于、sudo apt-get install libxml2-dev
4、sudo apt-get install python3-dev
5挟冠、sudo apt-get install libxslt1-dev
6于购、sudo apt-get install zlib1g-dev
7、sudo pip3 install -I -U service_identity
2知染、安裝scrapy框架
1价涝、sudo pip3 install Scrapy
# Windows安裝
cmd命令行(管理員): python -m pip install Scrapy
- Scrapy框架五大組件
1、引擎(Engine) :整個(gè)框架核心
2持舆、調(diào)度器(Scheduler) :維護(hù)請(qǐng)求隊(duì)列
3色瘩、下載器(Downloader):獲取響應(yīng)對(duì)象
4、爬蟲文件(Spider) :數(shù)據(jù)解析提取
5逸寓、項(xiàng)目管道(Pipeline):數(shù)據(jù)入庫處理
**********************************
# 下載器中間件(Downloader Middlewares) : 引擎->下載器,包裝請(qǐng)求(隨機(jī)代理等)
# 蜘蛛中間件(Spider Middlewares) : 引擎->爬蟲文件,可修改響應(yīng)對(duì)象屬性
- scrapy爬蟲工作流程
# 爬蟲項(xiàng)目啟動(dòng)
1铜幽、由引擎向爬蟲程序索要第一個(gè)要爬取的URL,交給調(diào)度器去入隊(duì)列
2芽腾、調(diào)度器處理請(qǐng)求后出隊(duì)列,通過下載器中間件交給下載器去下載
3植锉、下載器得到響應(yīng)對(duì)象后,通過蜘蛛中間件交給爬蟲程序
4礼烈、爬蟲程序進(jìn)行數(shù)據(jù)提取:
1勋篓、數(shù)據(jù)交給管道文件去入庫處理
2吧享、對(duì)于需要繼續(xù)跟進(jìn)的URL,再次交給調(diào)度器入隊(duì)列,依次循環(huán)
- scrapy常用命令
# 1譬嚣、創(chuàng)建爬蟲項(xiàng)目
scrapy startproject 項(xiàng)目名
# 2钢颂、創(chuàng)建爬蟲文件
scrapy genspider 爬蟲名 域名
# 3、運(yùn)行爬蟲
scrapy crawl 爬蟲名
- scrapy項(xiàng)目目錄結(jié)構(gòu)
Baidu # 項(xiàng)目文件夾
├── Baidu # 項(xiàng)目目錄
│ ├── items.py # 定義數(shù)據(jù)結(jié)構(gòu)
│ ├── middlewares.py # 中間件
│ ├── pipelines.py # 數(shù)據(jù)處理
│ ├── settings.py # 全局配置
│ └── spiders
│ ├── baidu.py # 爬蟲文件
└── scrapy.cfg # 項(xiàng)目基本配置文件
- 全局配置文件settings.py詳解
# 1拜银、定義User-Agent
USER_AGENT = 'Mozilla/5.0'
# 2殊鞭、是否遵循robots協(xié)議,一般設(shè)置為False
ROBOTSTXT_OBEY = False
# 3尼桶、最大并發(fā)量操灿,默認(rèn)為16
CONCURRENT_REQUESTS = 32
# 4、下載延遲時(shí)間
DOWNLOAD_DELAY = 1
# 5泵督、請(qǐng)求頭趾盐,此處也可以添加User-Agent
DEFAULT_REQUEST_HEADERS={}
# 6、項(xiàng)目管道
ITEM_PIPELINES={
'項(xiàng)目目錄名.pipelines.類名':300
}
- 創(chuàng)建爬蟲項(xiàng)目步驟
1、新建項(xiàng)目 :scrapy startproject 項(xiàng)目名
2救鲤、cd 項(xiàng)目文件夾
3久窟、新建爬蟲文件 :scrapy genspider 文件名 域名
4、明確目標(biāo)(items.py)
5蜒简、寫爬蟲程序(文件名.py)
6瘸羡、管道文件(pipelines.py)
7漩仙、全局配置(settings.py)
8搓茬、運(yùn)行爬蟲 :scrapy crawl 爬蟲名
- pycharm運(yùn)行爬蟲項(xiàng)目
1、創(chuàng)建begin.py(和scrapy.cfg文件同目錄)
2队他、begin.py中內(nèi)容:
from scrapy import cmdline
cmdline.execute('scrapy crawl maoyan'.split())
小試牛刀
- 目標(biāo)
打開百度首頁卷仑,把 '百度一下,你就知道' 抓取下來麸折,從終端輸出
- 實(shí)現(xiàn)步驟
- 創(chuàng)建項(xiàng)目Baidu 和 爬蟲文件baidu
1锡凝、scrapy startproject Baidu
2、cd Baidu
3垢啼、scrapy genspider baidu www.baidu.com
- 編寫爬蟲文件baidu.py窜锯,xpath提取數(shù)據(jù)
# -*- coding: utf-8 -*-
import scrapy
class BaiduSpider(scrapy.Spider):
name = 'baidu'
allowed_domains = ['www.baidu.com']
start_urls = ['http://www.baidu.com/']
def parse(self, response):
result = response.xpath('/html/head/title/text()').extract_first()
print('*'*50)
print(result)
print('*'*50)
- 全局配置settings.py
USER_AGENT = 'Mozilla/5.0'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
- 創(chuàng)建begin.py(和scrapy.cfg同目錄)
from scrapy import cmdline
cmdline.execute('scrapy crawl baidu'.split())
- 啟動(dòng)爬蟲
直接運(yùn)行 begin.py 文件即可
思考運(yùn)行過程
今日作業(yè)
1、熟記如下問題
1芭析、scrapy框架有哪幾大組件锚扎?
2、各個(gè)組件之間是如何工作的馁启?
2驾孔、Windows安裝scrapy
Windows :python -m pip install Scrapy
# Error:Microsoft Visual C++ 14.0 is required
# 解決 :下載安裝 Microsoft Visual C++ 14.0