設(shè)備軟件準(zhǔn)備---環(huán)境安裝配置
編寫(xiě)工具---pycharm、sublime text3等
運(yùn)行環(huán)境—Python3.X
虛擬環(huán)境---virtualenv (主要是可以區(qū)別各個(gè)環(huán)境不同的需求包版本的問(wèn)題)
爬蟲(chóng)程序編寫(xiě)---普通爬蟲(chóng)文件
請(qǐng)求方法---requests歼指、urlopen等
解析庫(kù)---xpath近范、bs4、pyquery等
URL鏈接地址:
1.根據(jù)自己的實(shí)際爬取需求設(shè)置分頁(yè)的時(shí)候?qū)τ幸?guī)律的URL進(jìn)行分析
(1)以python為例,for循環(huán),每循環(huán)一次的數(shù)值轉(zhuǎn)化為string字符串 拼接到url中,如果不確定是否請(qǐng)求成功則可以進(jìn)行驗(yàn)證后再提取數(shù) 據(jù)交汤;例如下圖:
for i in range(1, 1000):
url = "https://www.kuaidaili.com/free/inha/" + str(i) + "/"
請(qǐng)求頭參數(shù):
1.UA標(biāo)識(shí)(某些網(wǎng)站可能根據(jù)UA進(jìn)行反爬)
2.設(shè)置Cookie參數(shù)(一般情況下只需要登錄一次之后拿到有效的cookie并且填入請(qǐng)求頭即可
3.設(shè)置網(wǎng)頁(yè)壓縮格式 (請(qǐng)求頭參數(shù)設(shè)置此項(xiàng)為了避免網(wǎng)頁(yè)爬取后的數(shù)據(jù)為中文亂碼,大部分是由于網(wǎng)頁(yè)遠(yuǎn)嗎被壓縮了,所以我們需要解壓縮
還有其他的一些參數(shù)就不羅列了芙扎,基本使用的就是這些
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
"Cookie": "channelid=bdtg_a3_a3a2; sid=1588040045318406; _ga=GA1.2.1750935230.1588040048; _gid=GA1.2.9785359.1588040048; Hm_lvt_7ed65b1cc4b810e9fd37959c9bb51b31=1588040048; Hm_lpvt_7ed65b1cc4b810e9fd37959c9bb51b31=1588040059",
}
數(shù)據(jù)持久化—存儲(chǔ)
存儲(chǔ)數(shù)據(jù)---MySQL星岗,Redis,MongoDB戒洼,txt俏橘,csv,doc圈浇,xlsx
1.數(shù)據(jù)庫(kù)的話(huà)推薦使用redis數(shù)據(jù)庫(kù)進(jìn)行存儲(chǔ)寥掐,基于內(nèi)存,讀取效率高磷蜀,相對(duì)于MySQL的效率要高一些粥血。并且沒(méi)有復(fù)雜的建表建庫(kù)操作进陡,具钥,直接可以在Python交互的時(shí)候設(shè)置好先關(guān)參數(shù)就可以建庫(kù)并且存入數(shù)據(jù)
2.如果是商品信息的話(huà)可以將數(shù)據(jù)存儲(chǔ)到csv和xlsx侦锯,也便于后期進(jìn)行分析操作
IP代理
問(wèn)題:IP代理這里最后說(shuō)出來(lái),庶弃,衫贬,很重要,歇攻,現(xiàn)在的網(wǎng)站幾乎都會(huì)進(jìn)行一個(gè)封ip的操作固惯,就是對(duì)請(qǐng)求頻率過(guò)高的用戶(hù)暫時(shí)封禁IP地址訪(fǎng)問(wèn)網(wǎng)站。
解決方案:
1.在python中去設(shè)置代理的使用在代理ip有效的情況下進(jìn)行請(qǐng)求數(shù)據(jù)缴守,
2.代理的ip一般有許多免費(fèi)的缝呕,但是避免不了去定期維護(hù)檢測(cè)它的可用性,以便爬蟲(chóng)程序正常進(jìn)行
下面是具體案例
def dailisss():
try:
requests.get('https://www.baidu.com ', proxies={"http": zongdaili})
except:
print('connect failed')
else:
print('success')
con.rpush(zongdaili, 'ip_port')
這里我是使用了異常捕獲斧散;在驗(yàn)證后有效的則進(jìn)行存儲(chǔ)redis,否則不會(huì)存儲(chǔ)摊聋,能夠使用的代理則會(huì)立刻被爬蟲(chóng)程序進(jìn)行使用鸡捐,具體的使用如下:
爬取代理網(wǎng)站部分
def save_data(html):
global zongdaili
global daili, daili_port
datalist = etree.HTML(html)
datas = datalist.xpath('//*[@id="list"]')
# items_list = []
for data in datas:
dailis = data.xpath('./table/tbody/tr/td[1]/text()')
# print(dailis)
daili_ports = data.xpath('./table/tbody/tr/td[2]/text()')
for daili in dailis:
continue
for daili_port in daili_ports:
continue
zongdaili = daili + ":" + daili_port
return zongdaili
實(shí)際的爬蟲(chóng)程序使用案例
def start_requests(): # 起始請(qǐng)求
url = "https://www.qidian.com/search?kw=的"
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
"Cookie": "gender=male; _csrfToken=ELmDEqeeVHqs1gkfYhqhJuHFHOwB2yIUkxzEDrdO; newstatisticUUID=1587804570_1531161900; tf=1; _qda_uuid=dc63dbaa-1d85-c654-e645-62d1d4be4e45; e1=%7B%22pid%22%3A%22qd_P_Searchresult%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_Searchresult%22%2C%22eid%22%3A%22qd_S05%22%2C%22l1%22%3A3%7D",
"Accept-Encoding": "gzip, deflate, br", # 防止網(wǎng)頁(yè)進(jìn)行壓縮導(dǎo)致的爬取后網(wǎng)頁(yè)文字亂碼
}
response = requests.get(url=url, headers=headers, proxies={"http": zongdaili}) # 拿到列表頁(yè)
if response.status_code == 200:
html = response.text
save_datas(html) # 將列表頁(yè)網(wǎng)頁(yè)源碼傳遞給save_data函數(shù)
天貓爬蟲(chóng)文件---數(shù)據(jù)抓取
天貓商城會(huì)對(duì)請(qǐng)求頻率過(guò)高的用戶(hù)彈出滑塊驗(yàn)證碼,對(duì)此我們可以借助上述提到的使用IP代理池進(jìn)行爬取麻裁,也可以同時(shí)去設(shè)置請(qǐng)求的間隔時(shí)間箍镜,比如停止五秒請(qǐng)求一次,這一次便可請(qǐng)求商品列表頁(yè)的其中一頁(yè)
簡(jiǎn)單代碼如下:
import requests
from lxml import etree
from time import sleep
def start_requests():
"""
曾經(jīng)滄海難為水煎源,色迂,,爬蟲(chóng)spider配天貓
:return:
"""
url = "https://list.tmall.com/search_product.htm?spm=a220m.1000858.0.0.65a43cdbn9Hln5&s=60&q=%C0%B2%C0%B2%C0%B2&sort=s&style=g&from=mallfp..pc_1_searchbutton&active=2&type=pc#J_Filter"
headers = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15",
"Cookie":"isg=BHZ2m9vaNG5aosCKRF7ACNetxap4l7rRQxbNtuBddtnLIxS9SCW64TYVO39PkLLp; l=eBP5sIUmQkd9eQsaBO5BEurza779uQA0zkPzaNbMiIHca1kl6aOmgNQcxCZJJdtfgt5bYetrue0RedFHWXULStGDbLTqUWqSDGv68Bf..; cq=ccp%3D1; _med=dw:1920&dh:1080&pw:1920&ph:1080&ist:0; pnm_cku822=098%23E1hvnvvUvbpvUvCkvvvvvjiPn2LZlj3WP25OQjthPmPOQjinRFLWtjtEPFdpljiniQhvCvvvpZptvpvhvvCvpvyCvhQvAqAGj70fdigXaZmOD46Od3ODNrBl5F%2FAdcwu%2BE7reTTJwyNZeb8rV8t%2Bm7zvdigM%2B3%2B%2BafmAdXKKNB3rl8gL%2BulQbfmxdBkK5FtfKphv8vvvpHwvvUmivvCH4pvv9fwvvhNjvvvmjvvvBGwvvUjuvvCH4pvv9x4EvpvV39CmpwLhuphvmvvvpLCzimdovphvC9mvphvvvvGCvvpvvPMM; res=scroll%3A1324*5419-client%3A1324*879-offset%3A1324*5419-screen%3A1920*1080; tk_trace=1; _tb_token_=58e58f774635d; cookie2=131f479ba88a52baff9dc2251421fe6e; csg=97afaf69; dnk=t_1478447548675_0; enc=4H74Y0r44SGPk5Wz56qVq1SNCjfXbNZIpUgAGnFvo%2BnuMbh8lla0QOPlfSDDwyumkwvmCu8w8l1oW%2FXR9F%2BX0g%3D%3D; lgc=t_1478447548675_0; lid=t_1478447548675_0; sgcookie=Eg58fDyEYk1ICwiZr1rwa; t=880253ed7dfa4bf082072960264a00fb; tracknick=t_1478447548675_0; uc1=cookie16=UIHiLt3xCS3yM2h4eKHS9lpEOw%3D%3D&cookie15=VFC%2FuZ9ayeYq2g%3D%3D&pas=0&cookie14=UoTUMtUJHXDCsg%3D%3D&cookie21=WqG3DMC9EdFmJgke4t0pDw%3D%3D&existShop=false; uc3=id2=UUpniZ1PwvL0Hg%3D%3D&nk2=F6k3HMWzu19fNT832586Rbk%3D&vt3=F8dBxGXNshiZNPQH4%2Bg%3D&lg2=V32FPkk%2Fw0dUvg%3D%3D; uc4=nk4=0%40FbMocpOBNkY7DhFu8FNFi1vK8W6a7WWqnAZhpQ%3D%3D&id4=0%40U2gtGPf5WP7MCXHzaAWz1AslnFcb; cna=kUI2F9A6ZnUCAW/Bcl/c+rlO",
"Accept-Encoding":"br, gzip, deflate",
}
sleep(5)
response = requests.get(url=url,headers=headers)
if response.status_code == 200:
html = response.text
save_data(html)
def save_data(html):
datalist = etree.HTML(html)
datas = datalist.xpath('//div[@class="product-iWrap"]')
items = {}
for data in datas:
items['標(biāo)題'] = data.xpath('./p[@class="productTitle"]/a/@title')
items['價(jià)格'] = data.xpath('./p[@class="productPrice"]/em/@title')
print(items)
if __name__ == '__main__':
start_requests()
至此天貓爬蟲(chóng)程序分析完畢手销,整個(gè)過(guò)程還是相當(dāng)順利和簡(jiǎn)單的歇僧,只是我們要大膽地去嘗試,不能因?yàn)槟承┤苏f(shuō)阿里的反爬很強(qiáng)就退后不進(jìn)啦,這是不對(duì)的诈悍;我們要勇于在困難中逆水而行祸轮。