Python 2.7
IDE Pycharm 5.0.3
具體Selenium及PhantomJS請(qǐng)看Python+Selenium+PIL+Tesseract真正自動(dòng)識(shí)別驗(yàn)證碼進(jìn)行一鍵登錄
一些自動(dòng)化應(yīng)用實(shí)例請(qǐng)看Selenium+PhantomJS自動(dòng)續(xù)借圖書(shū)館書(shū)籍
至于GUI的入門使用請(qǐng)看Python基于Tkinter的二輸入規(guī)則器(乞丐版)
比較綜合的GUI例子請(qǐng)看基于Python的參考文獻(xiàn)生成器1.0
BTW渐尿,更新進(jìn)階下篇已出
python自定義豆瓣電影種類醉途,排行,點(diǎn)評(píng)的爬取與存儲(chǔ)(進(jìn)階下)
想了想砖茸,還是稍微人性化一點(diǎn)隘擎,做個(gè)成品GUI出來(lái)
起因
沒(méi)辦法,在知乎預(yù)告了要做個(gè)GUI出來(lái)凉夯,吹的牛逼總得自己填坑货葬,下次一定要慎重啊,話說(shuō)也復(fù)習(xí)了一下GUI操作劲够。震桶。。征绎。其實(shí)就是以前寫的改改蹲姐,換換輸入輸出而已,so 人柿,don't worry柴墩,Let's do this!
目的
1.在Python自定義豆瓣電影種類凫岖,排行拐邪,點(diǎn)評(píng)的爬取與存儲(chǔ)(初級(jí))的基礎(chǔ)上,增加了GUI界面(我嘴欠的)隘截,減少自己的鍵盤輸入扎阶,多選用點(diǎn)擊操作。
2.保留1特性的基礎(chǔ)上婶芭,選擇加載評(píng)論選項(xiàng)东臀,把短評(píng)和長(zhǎng)評(píng)都放在一起,修改了代碼結(jié)構(gòu)犀农,擴(kuò)展性更好(自認(rèn)為)惰赋,方便以后增加爬取主題時(shí)候的規(guī)范性制定
3.當(dāng)然最后還是要打包成exe啦,不然怎么造福小伙伴呢呵哨,如何打包還是請(qǐng)見(jiàn)如何將python打包成exe文件
方案
使用Tkinter+PhantomJS+Selenium+Firefox實(shí)現(xiàn)
實(shí)現(xiàn)過(guò)程
1.get到首頁(yè)后赁濒,根據(jù)選擇,點(diǎn)擊種類孟害,然后根據(jù)輸入需求拒炎,進(jìn)行排序 --這里的輸入時(shí)listbox中值的點(diǎn)擊鍵入
2.抓取每個(gè)電影及超鏈接,進(jìn)入超鏈接后挨务,抓取當(dāng)前電影的熱評(píng)及長(zhǎng)評(píng)
3.當(dāng)用戶所要求TOP數(shù)目大于第一頁(yè)的20個(gè)時(shí)候击你,點(diǎn)擊加載更多,再出現(xiàn)20個(gè)電影谎柄,重復(fù)2操作丁侄。
4.將輸出寫入輸出框架中,寫入txt中等操作
實(shí)現(xiàn)效果
py文件實(shí)現(xiàn)效果--TV目前還未實(shí)現(xiàn)朝巫,即使點(diǎn)了也是電影
打包成exe文件執(zhí)行效果
如果不想要cmd窗口鸿摇,只要GUI,那么在進(jìn)行打包的時(shí)候請(qǐng)使用-w參數(shù)劈猿,
pyinstaller -F -w Selenium_PhantomJS_doubanMvGUI.py
具體操作可以看如何將python打包成exe文件
程序框架
直接上那么長(zhǎng)的程序可能蒙圈拙吉,所以畫了個(gè)簡(jiǎn)圖
至于內(nèi)部如何嵌套,懶得畫圖了糙臼,這幾個(gè)模塊大概知道就可以讀程序了庐镐,程序很簡(jiǎn)單的。变逃。必逆。
代碼
# -*- coding: utf-8 -*-
#Author:哈士奇說(shuō)喵
#爬豆瓣高分電影及hot影評(píng)GUI版本
from selenium import webdriver
import selenium.webdriver.support.ui as ui
import time
from Tkinter import *
print "---------------system loading...please wait...---------------"
#獲取電影名及URL
def getURL_Title():
global save_name
SUMRESOURCES=0
url="https://movie.douban.com/"
driver_item=webdriver.Firefox()
wait = ui.WebDriverWait(driver_item,15)
#構(gòu)建對(duì)應(yīng)字典,方便鍵入值得對(duì)應(yīng)關(guān)系查找
Kind_Dict={'Hot':1,'Newest':2,'Classics':3,'Playable':4,'High Scores':5,
'Wonderful but not popular':6,'Chinese film':7,'Hollywood':8,
'Korea':9,'Japan':10,'Action movies':11,'Comedy':12,'Love story':13,
'Science fiction':14,'Thriller':15,'Horror film':16,'Whatever':17}
#最后一個(gè)電影老是在變啊揽乱,艸
Sort_Dict={'Sort by hot':1,'Sort by time':2,'Sort by score':3}
Ask_Dict={'No film reviews':0,'I like film reviews':1}
#鍵入的值對(duì)應(yīng)
kind=Kind_Dict[Kind_Select.get(Kind_Select.curselection()).encode('utf-8')]
sort = Sort_Dict[Sort_Select.get(Sort_Select.curselection()).encode('utf-8')]
number = int(input_Top.get())
ask_comments = Ask_Dict[Comment_Select.get(Comment_Select.curselection()).encode('utf-8')]
save_name=input_SN.get()
Ans.insert(END,"#####################################################################")
Ans.insert(END," Reloading ")
Ans.insert(END,",#####################################################################")
Ans.insert(END,"---------------------------------------system loading...please wait...------------------------------------------")
Ans.insert(END,"----------------------------------------------crawling----------------------------------------------")
Write_txt('\n##########################################################################################','\n##########################################################################################',save_name)
print "---------------------crawling...---------------------"
##############################################################################
#進(jìn)行網(wǎng)頁(yè)get后名眉,先進(jìn)行電影種類選擇的模擬點(diǎn)擊操作,然后再是排序方式的選擇
#最后等待一會(huì)凰棉,元素都加載完了损拢,才能開(kāi)始爬電影,不然元素隱藏起來(lái)撒犀,不能被獲取
#wait.until是等待元素加載完成福压!
##############################################################################
#選完參數(shù)后掏秩,開(kāi)始爬操作
driver_item.get(url)
wait.until(lambda driver: driver.find_element_by_xpath("http://div[@class='fliter-wp']/div/form/div/div/label[%s]"%kind))
driver_item.find_element_by_xpath("http://div[@class='fliter-wp']/div/form/div/div/label[%s]"%kind).click()
wait.until(lambda driver: driver.find_element_by_xpath("http://div[@class='fliter-wp']/div/form/div[3]/div/label[%s]"%sort))
driver_item.find_element_by_xpath("http://div[@class='fliter-wp']/div/form/div[3]/div/label[%s]"%sort).click()
num=number+1#比如輸入想看的TOP22,那需要+1在進(jìn)行操作荆姆,細(xì)節(jié)問(wèn)題
time.sleep(2)
#打開(kāi)幾次“加載更多”
num_time = num/20+1
wait.until(lambda driver: driver.find_element_by_xpath("http://div[@class='list-wp']/a[@class='more']"))
for times in range(1,num_time):
time.sleep(2)
driver_item.find_element_by_xpath("http://div[@class='list-wp']/a[@class='more']").click()
#print '點(diǎn)擊\'加載更多\'一次'
#使用wait.until使元素全部加載好能定位之后再操作蒙幻,相當(dāng)于try/except再套個(gè)while把
for i in range(1,num):
wait.until(lambda driver: driver.find_element_by_xpath("http://div[@class='list']/a[%d]"%num))
list_title=driver_item.find_element_by_xpath("http://div[@class='list']/a[%d]"%i)
print '----------------------------------------------'+'NO' + str(SUMRESOURCES +1)+'----------------------------------------------'
print u'電影名: ' + list_title.text
print u'鏈接: ' + list_title.get_attribute('href')
#print unicode碼自動(dòng)轉(zhuǎn)換為utf-8的
#list_title.text是unicode碼,需要重新編碼再寫入txt
list_title_wr=list_title.text.encode('utf-8')
list_title_url_wr=list_title.get_attribute('href')
#寫入gui的輸出框中
Ans.insert(END,'\n------------------------------------------------'+'NO' + str(SUMRESOURCES +1)+'----------------------------------------------',list_title_wr,list_title_url_wr)
#寫入txt中
Write_txt('\n----------------------------------------------'+'NO' + str(SUMRESOURCES +1)+'----------------------------------------------','',save_name)
Write_txt(list_title_wr,list_title_url_wr,save_name)
SUMRESOURCES = SUMRESOURCES +1
#獲取具體內(nèi)容和評(píng)論胆筒。href是每個(gè)超鏈接也就是資源單獨(dú)的url
try:
getDetails(str(list_title.get_attribute('href')),ask_comments)
except:
print 'can not get the details!'
#爬完數(shù)據(jù)后關(guān)閉瀏覽器邮破,只保留GUI進(jìn)行下一步操作
driver_item.quit()
##############################################################################
#當(dāng)選擇一部電影后,進(jìn)入這部電影的超鏈接仆救,然后才能獲取
#同時(shí)別忽視元素加載的問(wèn)題
#在加載長(zhǎng)評(píng)論的時(shí)候抒和,注意模擬點(diǎn)擊一次小三角,不然可能會(huì)使內(nèi)容隱藏
##############################################################################
def getDetails(url,comments):
driver_detail = webdriver.PhantomJS(executable_path="phantomjs.exe")
wait1 = ui.WebDriverWait(driver_detail,15)
driver_detail.get(url)
wait1.until(lambda driver: driver.find_element_by_xpath("http://div[@id='link-report']/span"))
drama = driver_detail.find_element_by_xpath("http://div[@id='link-report']/span")
print u"劇情簡(jiǎn)介:"+drama.text
drama_wr=drama.text.encode('utf-8')
#寫入gui的輸出框中
Ans.insert(END,drama_wr)
#寫入到txt
Write_txt(drama_wr,'',save_name)
#加載評(píng)論
if comments == 1:
print "--------------------------------------------Hot comments TOP----------------------------------------------"
#加載四個(gè)短評(píng)
for i in range(1,5):
try:
comments_hot = driver_detail.find_element_by_xpath("http://div[@id='hot-comments']/div[%s]/div/p"%i)
print u"最新熱評(píng):"+comments_hot.text
comments_hot_wr=comments_hot.text.encode('utf-8')
Ans.insert(END,"--------------------------------------------Hot comments TOP%d----------------------------------------------"%i,comments_hot_wr)
Write_txt("--------------------------------------------Hot comments TOP%d----------------------------------------------"%i,'',save_name)
Write_txt(comments_hot_wr,'',save_name)
except:
print 'can not caught the comments!'
#嘗試加載長(zhǎng)評(píng)
try:
driver_detail.find_element_by_xpath("http://img[@class='bn-arrow']").click()
#wait.until(lambda driver: driver.find_element_by_xpath("http://div[@class='review-bd']/div[2]/div/div"))
time.sleep(1)
#解決加載長(zhǎng)評(píng)會(huì)提示劇透問(wèn)題導(dǎo)致無(wú)法加載
comments_get = driver_detail.find_element_by_xpath("http://div[@class='review-bd']/div[2]/div")
if comments_get.text.encode('utf-8')=='提示: 這篇影評(píng)可能有劇透':
comments_deep=driver_detail.find_element_by_xpath("http://div[@class='review-bd']/div[2]/div[2]")
else:
comments_deep = comments_get
print "--------------------------------------------long-comments---------------------------------------------"
print u"深度長(zhǎng)評(píng):"+comments_deep.text
comments_deep_wr=comments_deep.text.encode('utf-8')
#寫入gui的輸出框中
Ans.insert(END,"--------------------------------------------long-comments---------------------------------------------\n",comments_deep_wr)
Write_txt("--------------------------------------------long-comments---------------------------------------------\n",'',save_name)
Write_txt(comments_deep_wr,'',save_name)
except:
print 'can not caught the deep_comments!'
##############################################################################
#將print輸出的寫入txt中查看彤蔽,也可以在cmd中查看摧莽,換行符是為了美觀
##############################################################################
def Write_txt(text1='',text2='',title='douban.txt'):
with open(title,"a") as f:
for i in text1:
f.write(i)
f.write("\n")
for j in text2:
f.write(j)
f.write("\n")
def Clea():#清空函數(shù)
input_Top.delete(0,END)#這里entry的delect用0
input_SN.delete(0,END)
Ans.delete(0,END)#text中的用0.0
root=Tk()
root.title('豆瓣影視抓取器beta--by哈士奇說(shuō)喵')
#------------------------------------------輸入框架--------------------------------------
frame_select=Frame(root)
title_label=Label(root,text='豆瓣影視TOP抓取器')
title_label.pack()
#---------------選擇電影/電視劇-------------------
Mov_Tv=Listbox(frame_select,exportselection=False,width=9,height=4)
list_item1 = ['Movies','TV']
for i in list_item1:
Mov_Tv.insert(END,i)
scr_MT = Scrollbar(frame_select)
Mov_Tv.configure(yscrollcommand = scr_MT.set)
scr_MT['command']=Mov_Tv.yview
#---------------選擇電影/電視劇 種類-------------------
Kind_Select=Listbox(frame_select,exportselection=False,width=12,height=4)
list_item2 = ['Hot','Newest','Classics','Playable','High Scores',
'Wonderful but not popular','Chinese film','Hollywood',
'Korea','Japan','Action movies','Comedy','Love story',
'Science fiction','Thriller','Horror film','Whatever']
for i in list_item2:
Kind_Select.insert(END,i)
scr_Kind = Scrollbar(frame_select)
Kind_Select.configure(yscrollcommand = scr_Kind.set)
scr_Kind['command']=Kind_Select.yview
#---------------選擇電影/電視劇 排序方式-------------------
Sort_Select=Listbox(frame_select,exportselection=False,width=12,height=4)
list_item3 = ['Sort by hot','Sort by time','Sort by score']
for i in list_item3:
Sort_Select.insert(END,i)
scr_Sort = Scrollbar(frame_select)
Sort_Select.configure(yscrollcommand = scr_Sort.set)
scr_Sort['command']=Sort_Select.yview
#---------------選擇電影/電視劇 是否加載評(píng)論-------------------
Comment_Select=Listbox(frame_select,exportselection=False,width=16,height=4)
list_item4 = ['No film reviews','I like film reviews']
for i in list_item4:
Comment_Select.insert(END,i)
scr_Com = Scrollbar(frame_select)
Comment_Select.configure(yscrollcommand = scr_Com.set)
scr_Com['command']=Comment_Select.yview
#---------------選擇電影/電視劇 選擇TOP數(shù)-------------------
Label_TOP=Label(frame_select, text='TOP(xx)', font=('',10))
var_Top = StringVar()
input_Top = Entry(frame_select, textvariable=var_Top,width=8)
#---------------選擇電影/電視劇 保存路徑-------------------
Label_SN=Label(frame_select, text='SAVE_NAME(xx.txt)', font=('',10))
var_SN = StringVar()
input_SN = Entry(frame_select, textvariable=var_SN,width=8)
#----------------------------------------------輸出框架-----------------------------------------
frame_output=Frame(root)
out_label=Label(frame_output,text='Details')
Ans = Listbox(frame_output,selectmode=MULTIPLE, height=15,width=80)#text也可以,Listbox好處在于換行
#點(diǎn)擊crawl_button就進(jìn)入getURL_Title()铆惑,點(diǎn)擊clear_button就進(jìn)入Clea()
crawl_button = Button(frame_output,text='crawl', command=getURL_Title)
clear_button = Button(frame_output,text='clear', command=Clea)
#縱向拖拽
scr_Out_y = Scrollbar(frame_output)
Ans.configure(yscrollcommand = scr_Out_y.set)
scr_Out_y['command']=Ans.yview
#橫向拖拽
scr_Out_x = Scrollbar(frame_output,orient='horizontal')#ans x
Ans.configure(xscrollcommand = scr_Out_x.set)
scr_Out_x['command']=Ans.xview
#----------------------------------------------顯示框架-----------------------------------------
#----------------選擇框架顯示--------------
frame_select.pack()
#影視框架顯示
Mov_Tv.pack(side=LEFT)
scr_MT.pack(side=LEFT)
#種類框架顯示
Kind_Select.pack(side=LEFT)
scr_Kind.pack(side=LEFT)
#排序框架顯示
Sort_Select.pack(side=LEFT)
scr_Sort.pack(side=LEFT)
#評(píng)論框架顯示
Comment_Select.pack(side=LEFT)
scr_Com.pack(side=LEFT)
#TOP輸入
Label_TOP.pack()
input_Top.pack()
#SAVE NAME輸入
Label_SN.pack()
input_SN.pack()
#----------------輸出框架顯示--------------
frame_output.pack()
out_label.pack()
crawl_button.pack(side=LEFT)
clear_button.pack(side=RIGHT)
scr_Out_y.pack(side=RIGHT)
Ans.pack()
scr_Out_x.pack()
#----------------根框架顯示--------------
root.mainloop()
代碼就不解釋了范嘱。好好看下備注就ok了
問(wèn)題及解決&Tips
1.在Python自定義豆瓣電影種類,排行员魏,點(diǎn)評(píng)的爬取與存儲(chǔ)(初級(jí))的文章中丑蛤,漏了簡(jiǎn)介的輸出,這里補(bǔ)上撕阎,是我大意了受裹。。添加如下代碼即可(上篇已修復(fù))
drama_wr=drama.text.encode('utf-8')
Write_txt(drama_wr,'',save_name)
2.出現(xiàn)“提示:這篇影評(píng)可能劇透”虏束,獲取長(zhǎng)評(píng)失斆奕摹(上篇已修復(fù)),如圖
問(wèn)題出在這條語(yǔ)句上
comments_deep=driver_detail.find_element_by_xpath("http://div[@class='review-bd']/div[2]/div
")
2.解決方案镇匀,分析網(wǎng)頁(yè)元素照藻,看看到底誰(shuí)在搞鬼;
ok汗侵,一看就知道幸缕,是我們的標(biāo)簽平白無(wú)故多了個(gè)div,好辦晰韵,直接寫個(gè)判斷語(yǔ)句填上
#解決加載長(zhǎng)評(píng)會(huì)提示劇透問(wèn)題導(dǎo)致無(wú)法加載
comments_get = driver_detail.find_element_by_xpath("http://div[@class='review-bd']/div[2]/div")
if comments_get.text.encode('utf-8')=='提示: 這篇影評(píng)可能有劇透':
comments_deep=driver_detail.find_element_by_xpath("http://div[@class='review-bd']/div[2]/div[2]")
else:
comments_deep = comments_get
3.Kind有17個(gè)選項(xiàng)類別发乔,一個(gè)個(gè)寫if語(yǔ)句好心煩,好冗余雪猪,比如這樣栏尚,要寫17個(gè)
if Kind_Select.get(Kind_Select.curselection()).encode('utf-8')=='Movies':
kind = 1
3.解決方案;用字典爸缓蕖R胝獭L洹!9啪ⅰ3飧场!鍵值對(duì)應(yīng)的除了字典還有更好的么产艾??滑绒?以Kind鍵入為例
#構(gòu)建對(duì)應(yīng)字典闷堡,方便鍵入值得對(duì)應(yīng)關(guān)系查找
Kind_Dict={'Hot':1,'Newest':2,'Classics':3,'Playable':4,'High Scores':5,
'Wonderful but not popular':6,'Chinese film':7,'Hollywood':8,
'Korea':9,'Japan':10,'Action movies':11,'Comedy':12,'Love story':13,
'Science fiction':14,'Thriller':15,'Horror film':16,'Whatever':17}
#最后一個(gè)電影老是在變啊,艸
kind=Kind_Dict[Kind_Select.get(Kind_Select.curselection()).encode('utf-8')]
4.目前只做了電影的抓取疑故,電視劇那個(gè)還沒(méi)做杠览,我只是放在上面而已,所以請(qǐng)測(cè)試時(shí)候不要點(diǎn)擊“Tv”選項(xiàng)纵势,里面沒(méi)東西的踱阿,我以后,要是有空钦铁,應(yīng)該會(huì)把它補(bǔ)全的软舌。挖坑挖坑0.0
一個(gè)奇怪的問(wèn)題
打包之后的文件,對(duì)某些電影抓不了長(zhǎng)評(píng)牛曹,我已排除程序問(wèn)題佛点,原打包程序在py環(huán)境下運(yùn)行可用,但是exe有的長(zhǎng)評(píng)就抓不了黎比。超营。。目前無(wú)解
請(qǐng)看圖阅虫,以盜夢(mèng)空間為例
但是一樣的程序演闭,在py下運(yùn)行時(shí)可以抓到長(zhǎng)評(píng)的
這個(gè)我實(shí)在無(wú)解,可能是pyinstaller的bug吧
EXE成品文件下載
里面包含了上個(gè)cmd版本的源文件颓帝,是個(gè)合集
基于python豆瓣自定義電影抓取GUI版本
最后
測(cè)試時(shí)間花了挺多的米碰,主要是selenium效率有點(diǎn)低,而且firefox資源占用太高躲履,對(duì)海量數(shù)據(jù)抓取不是十分有利啊见间。有誰(shuí)知道怎么抓海量動(dòng)態(tài)數(shù)據(jù)么?知道的話請(qǐng)留言一下咯
PS
各省被水淹沒(méi)工猜,哈爾濱也終于下大雨了米诉,大家出行注意安全--話說(shuō)我還回去的家么。篷帅。史侣。
致謝
@MrLevo520--偽解決Selenium中調(diào)用PhantomJS無(wú)法模擬點(diǎn)擊(click)操作
@MrLevo520--Python輸出(print)內(nèi)容寫入txt中保存
@MrLevo520--解決網(wǎng)頁(yè)元素?zé)o法定位(NoSuchElementException: Unable to locate element)的幾種方法
@Eastmount--[Python爬蟲(chóng)] Selenium+Phantomjs動(dòng)態(tài)獲取CSDN下載資源信息和評(píng)論
@Eastmount--[Python爬蟲(chóng)] 在Windows下安裝PIP+Phantomjs+Selenium
@MrLevo520--解決Selenium彈出新頁(yè)面無(wú)法定位元素問(wèn)題(Unable to locate element)
@MrLevo520--Python自定義豆瓣電影種類拴泌,排行,點(diǎn)評(píng)的爬取與存儲(chǔ)(初級(jí))
@MrLevo520--Python基于Tkinter的二輸入規(guī)則器(乞丐版)
@MrLevo520--基于Python的參考文獻(xiàn)生成器1.0