序言
“人生之路是不可逆的,任何人都不可能重新來(lái)過(guò)碉纺、重新選擇揍诽。”
生活中浸遗,每個(gè)人都在用不同的方式在成長(zhǎng)在成熟,誰(shuí)也不比誰(shuí)更輕松箱亿。
實(shí)戰(zhàn)
爬蟲大致思路
第一步:請(qǐng)求網(wǎng)絡(luò)鏈接先獲取到網(wǎng)站返回?cái)?shù)據(jù)
第二步:這里我選用了正則表達(dá)式結(jié)合xpath進(jìn)行數(shù)據(jù)解析
第三步:持久化 保存數(shù)據(jù)
源文件總覽
這是我很久之前寫的代碼跛锌;測(cè)試了一下還可以用。大家根據(jù)我寫的代碼可以自行查找一下 届惋,還是老規(guī)矩髓帽,通過(guò)F12抓包工具菠赚,分析網(wǎng)頁(yè)結(jié)構(gòu),獲取數(shù)據(jù) 郑藏。
import re
import requests
from lxml import etree
import time
menu = {1:'旗幟',2:'新知',3:'旅行',4:'體育',5:'生活',6:'科技',7:'娛樂(lè)',8:'汽車',9:'美食',10:'音樂(lè)'}
def request(url,r_url='https://www.pearvideo.com/'):
ua = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64',
'Referer': r_url}
r = requests.get(url, headers=ua)
return r
def analysis(r):
soup = etree.HTML(r)
list_1 = soup.xpath('//*[@id="listvideoListUl"]/li')
list_2 = soup.xpath('//*[@id="categoryList"]/li')
spider(list_1)
spider(list_2)
def spider(list):
for i in list:
r_url = 'https://www.pearvideo.com/' + i.xpath('./div/a/@href')[0]
title = i.xpath('./div/a/div[2]/text()')[0]
id = str(i.xpath('./div/a/@href')[0]).replace('video_','')
video_url = 'https://www.pearvideo.com/videoStatus.jsp?contId=' + id + '&mrd=0.27731227756239263'
l = request(video_url,r_url).text
try:
time.sleep(1)
url = re.findall('"srcUrl":"(.*?)"',l)[0]
url = url.replace(re.findall('/(162.*?)-',url)[0],'cont-'+id)
video = request(url,r_url).content
write(title,video)
print(f'正在爬取{title},爬取成功衡查!')
except:
print(url)
continue
def spider_2(num,page):
for i in range(12,12*page+1,12):
url = 'https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=' + num + '&start=' + str(i) + '&mrd=0.9948502649054862'
soup = etree.HTML(request(url).text)
list = soup.xpath('/html/body/li')
spider(list)
def write(title,video):
with open("梨_短視頻/"+title+'.mp4','wb') as f:
f.write(video)
if __name__ == '__main__':
for key,value in menu.items():
print(f'{key}:{value}',end=' ')
num = input('\n請(qǐng)選擇要爬取的類型:')
page = eval(input('請(qǐng)輸入爬取頁(yè)數(shù)(一頁(yè)12個(gè)視頻):'))
spider_2(num,page)