現(xiàn)在爬蟲這么流行袖牙,學(xué)了點(diǎn)正則表達(dá)式的我就想著用(.*?)去實(shí)現(xiàn)一切偷懶的事。前兩天看上了電影天堂這個(gè)網(wǎng)站,于是開始想:要不一次性爬取一頁的視頻下載鏈接試試衫冻。下面是這個(gè)網(wǎng)站的簡圖煤杀,接下來的任務(wù)就是把最右邊紅色方框的電影鏈接全給爬出來眷蜈,然后可以直接可以用迅雷打包再一個(gè)文件夾下載。
image
在360極速瀏覽器里面我們右鍵選擇查看源代碼沈自,定位到第一部影片<<遺傳厄運(yùn)>>酌儒,其網(wǎng)頁代碼如下:
image
很明顯,這一頁并沒有下載鏈接枯途,必須點(diǎn)擊進(jìn)去這個(gè)電影的網(wǎng)址鏈接才能繼續(xù)爬蟲獲取其ftp下載鏈接忌怎。于是我們點(diǎn)進(jìn)去觀察其新網(wǎng)頁鏈接為:
而上面那個(gè)html解析代碼第523行有:
<a href="/i/99901.html" class="ulink" title="2018年美國7.6分恐怖片《遺傳厄運(yùn)》BD中英雙字">
反復(fù)觀察幾個(gè)后確認(rèn)每個(gè)電影點(diǎn)開后的新鏈接為:
https://www.dy2018.com/i/ + 每個(gè)電影自己的獨(dú)立id + .html
接下來很容易就解析出電影的ftp下載鏈接和磁力鏈接:
image
理論部分講解完成后籍滴,接下來的Python實(shí)現(xiàn)代碼如下:
# -*- coding:utf-8 -*-
import urllib
import urllib2
import re
import requests
import time
import requests
import requests_cache
# User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64)
# AppleWebKit/537.36 (KHTML, like Gecko)
# Chrome/65.0.3325.181 Safari/537.36 OPR/52.0.2871.64
requests_cache.install_cache('demo_cache')
global fp
url = 'https://www.dy2018.com/html/gndy/dyzz/index.html'
# url = 'https://www.dy2018.com/i/99901.html'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
headers = {'User-Agent': user_agent}
try:
r = requests.get(url)
print type(r)
print r.status_code
print r.encoding
html = requests.get(url, headers=headers).text
html = html.encode(r.encoding)
html = html.decode("gbk")
content = html
# print content
fp = open(unicode("temp_pachong.txt", 'utf-8'), 'w') # 文件名不亂碼
fp.write(content.encode('utf-8'))
fp.close()
# <a href="/i/99901.html" class="ulink" title="2018年美國7.6分恐怖片《遺傳厄運(yùn)》BD中英雙字">2018年美國7.6分恐怖片《遺傳厄運(yùn)》BD中英雙字</a>
pattern = re.compile('<b>.*?<a href="/i/(.*?).html" class="ulink" title="(.*?)">.*?</a>.*?</b>', re.S)
items = re.findall(pattern, content)
fp = open(unicode("電影天堂爬蟲.txt",'utf-8'),'w') # 文件名不亂碼
localtime=time.strftime('%Y-%m-%d-%H:%M:%S', time.localtime(time.time()))
count=0
fp.write("********************" + localtime +"********************".encode('utf-8') + '\n')
print '本頁總資源數(shù)為:' + str(len(items))
for item in items:
count=count+1
temp=str(count) + ": " + item[1]
print temp
fp.write(temp.encode('utf-8')+ '\n')
temp='https://www.dy2018.com/i/' + item[0] + '.html'
print temp
#獲取下載鏈接
url = temp
r = requests.get(url)
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
headers = {'User-Agent': user_agent}
html = requests.get(url, headers=headers).text
html = html.encode(r.encoding)
html = html.decode("gbk")
content = html
# print content
link_temp = re.compile('<td style=".*?"><a href="(.*?)">.*?</a></td>', re.S)
link = re.findall(link_temp, content)
print link[0]
fp.write(link[0].encode('utf-8') + '\n')
fp.write("********************" + localtime +"********************".encode('utf-8'))
fp.close()
except urllib2.URLError, e:
if hasattr(e, "code"):
print e.code
if hasattr(e, "reason"):
print e.reason
fp.close()
實(shí)際效果如下:
view-source_https___www.dy2018.com_i_99901.html.png