一、緣 起
要買房啰脚,但是大西安現(xiàn)在可謂是一房難求殷蛇,大家都爭先恐后地排隊(duì)交資料、搖號(hào)橄浓。截止到現(xiàn)在粒梦,筆者已經(jīng)參與過6個(gè)樓盤的搖號(hào)/選房,但種種原因荸实,依然沒買到合適的房子匀们,無奈,一首 涼~ 涼~ 回蕩在心~
價(jià)格公示泪勒,每次都會(huì)在買房群里熱議昼蛀,因?yàn)榭吹叫鹿嫉膬r(jià)格就意味著有更多房源即將開盤,大家買房的熱情又會(huì)被重新點(diǎn)燃~
閃念圆存,有一次的一個(gè)念想叼旋,如果能實(shí)時(shí)監(jiān)控物價(jià)局官網(wǎng)、自動(dòng)下載壓縮包并提醒我沦辙,這樣豈不快哉夫植,于是之前了解的一點(diǎn)點(diǎn)網(wǎng)絡(luò)爬蟲就派上用場~
于是,開干!
二详民、思 路
首先延欠,用selenium(PhantomJS)爬取網(wǎng)站的下載鏈接
然后,用Python request模塊的urlretrieve() 方法下載壓縮包
接著沈跨,用Python zipfile模塊的extractall()方法解壓
最后由捎,定時(shí)運(yùn)行腳本,有下載更新時(shí)候彈出提示
三饿凛、說 明
軟件模塊
- Python3.6 + PhantomJS + .vbs腳本 + .bat腳本
- PhantomJS:因?yàn)橐鰧?shí)時(shí)監(jiān)控狞玛,所以爬取網(wǎng)頁鏈接必須在后臺(tái)運(yùn)行, 而PhantomJS正可謂是無界面版的Selenium(僅個(gè)人通俗的理解涧窒,對(duì)PhantomJS更準(zhǔn)確的說明和用法請(qǐng)自行百度)
- 定時(shí)運(yùn)行心肪,首先想到的是Jenkins持續(xù)集成,但鑒于筆者對(duì)Jenkins不甚熟悉纠吴,并且目前也只想基于Windows運(yùn)行硬鞍,故選擇了較為簡單的方法,即在Windows下添加定時(shí)任務(wù)戴已,具體方法請(qǐng)自行百度固该,很簡單
- 因?yàn)?em>.bat批處理腳本調(diào)用Python腳本每次都會(huì)打開cmd命令框,所以用.vbs腳本來調(diào)用Python腳本
- 更新提醒恭陡,可以發(fā)送郵件蹬音,但是筆者用了最簡單的方法上煤,即有更新時(shí)候調(diào)用openFolder.bat批處理腳本自動(dòng)打開house_prices文件夾
四休玩、實(shí) 現(xiàn)
先上兩張物價(jià)局官網(wǎng)
一級(jí)下載頁面
二級(jí)下載頁面
- downloadZip.py
- 主程序分為兩部分,get_url()函數(shù)用于從一級(jí)下載頁面爬取鏈接劫狠,初次運(yùn)行腳本會(huì)將一級(jí)下載頁面首頁的所有鏈接爬取下來并寫入txt拴疤,后面如果再有更新,只會(huì)將最新的一級(jí)下載頁面鏈接追加到txt里面独泞,當(dāng)get_url()函數(shù)返回值為0呐矾,表示無更新,返回值為1表示初次運(yùn)行腳本懦砂,否則表示有部分更新
......
from selenium import webdriver
def get_url():
"""
獲取url并寫入txt
:return:download_flag
"""
# 頁面為iframe框架
driver.switch_to.frame('iframecenter')
date_list = driver.find_elements_by_xpath('.//*[@id="tablelist"]/tbody/tr/td[3]/span')
fw = open(cur_path + "/house_prices/url_list.txt", 'a')
fr = open(cur_path + "/house_prices/url_list.txt", 'r')
download_flag = 0
# print(fr.readlines())
f_list = fr.readlines()
# print(f_list)
if len(f_list) == 0:
for i in reversed(range(len(date_list))): # 寫入順序?yàn)闀r(shí)間逆序
fw.writelines(driver.find_elements_by_id('linkId')[i].get_attribute('href'))
fw.write('\n')
download_flag = 1
else:
# 獲取txt文件里面最新的一級(jí)下載頁面鏈接的trid
f_latest_num = int(f_list[-1].split('=')[1])
for i in reversed(range(0, 5)):
# 獲取當(dāng)前網(wǎng)站最新的5條一級(jí)下載頁面鏈接及trid蜒犯,如果當(dāng)前網(wǎng)站的trid > txt最新的trid,則追加到txt
latest_url = driver.find_elements_by_id('linkId')[i].get_attribute('href')
latest_url_num = int(latest_url.split('=')[1])
if f_latest_num < latest_url_num:
fw.writelines(latest_url)
fw.write('\n')
download_flag = download_flag + 1
fw.close()
return download_flag
- 主程序第二部分荞膘,download_zip()函數(shù)根據(jù)get_url()函數(shù)返回值判斷罚随,是否要進(jìn)入二級(jí)下載頁面下載壓縮包,并且只有在有下載發(fā)生時(shí)候才調(diào)用openFolder.bat打開house_prices文件夾羽资,做以提示淘菩,否則不執(zhí)行下載函數(shù)不打開文件夾
......
from urllib import request
import zipfile
import os
......
def download_zip():
"""
從txt中讀取url并下載zip
:return:
"""
flag = get_url()
if flag == 1:
fr = open(cur_path + "/house_prices/url_list.txt", 'r')
all_lines = fr.readlines()
for line_url in all_lines:
driver.get(line_url)
driver.implicitly_wait(15)
# 頁面為iframe框架
driver.switch_to.frame('showconent1')
download_url = driver.find_element_by_partial_link_text('商品住房價(jià)格')
download_url = download_url.get_attribute('href')
zipname = cur_path + '/house_prices/' + download_url.split('/')[6]
filename = zipname.split('.')[0]
request.urlretrieve(download_url, zipname)
# 解壓并刪除壓縮包
try:
with zipfile.ZipFile(zipname) as zfile:
zfile.extractall(path=filename)
if os.path.exists(zipname):
os.remove(zipname)
except zipfile.BadZipFile as e:
print(filename + " is a bad zip file ,please check!")
# 有更新數(shù)據(jù)打開文件夾
os.system(cur_path + "/openFolder.bat")
elif flag != 1 and flag != 0:
fr = open(cur_path + "/house_prices/url_list.txt", 'r')
all_lines = fr.readlines()
for line_url in range(flag):
driver.get(all_lines[-line_url-1])
driver.implicitly_wait(15)
driver.switch_to.frame('showconent1')
# download_url = driver.find_element_by_xpath('/html/body/div/div[3]/p[3]/a')
download_url = driver.find_element_by_partial_link_text('商品住房價(jià)格')
download_url = download_url.get_attribute('href')
zipname = cur_path + '/house_prices/' + download_url.split('/')[6]
filename = zipname.split('.')[0]
request.urlretrieve(download_url, zipname)
# 解壓并刪除壓縮包
try:
with zipfile.ZipFile(zipname) as zfile:
zfile.extractall(path=filename)
if os.path.exists(zipname):
os.remove(zipname)
except zipfile.BadZipFile as e:
print(filename + " is a bad zip file ,please check!")
# 有更新數(shù)據(jù)打開文件夾
os.system(cur_path + "/openFolder.bat")
-
autoDownloadZip.vbs
用于調(diào)用Python腳本且不會(huì)打開cmd命令框
currentpath = createobject("Scripting.FileSystemObject").GetFile(Wscript.ScriptFullName).ParentFolder.Path
createobject("wscript.shell").run currentpath + "\downloadZip.py",0
添加定時(shí)任務(wù)腳本
設(shè)置定時(shí)任務(wù)
-
openFolder.bat
用于打開house_prices文件夾
start %~dp0\house_prices
五、最 后
Git地址:https://gitee.com/freedomlidi/autoDownloadZip.git