自動(dòng)化測試——Selenium
What is Selenium?
Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) be automated as well.
Selenium has the support of some of the largest browser vendors who have taken (or are taking) steps to make Selenium a native part of their browser. It is also the core technology in countless other browser automation tools, APIs and frameworks.
應(yīng)用背景
在許多場景下,測試人員需要自動(dòng)化測試工具來提高測試效率民褂,Selenium 就是一款專為瀏覽器自動(dòng)化測試服務(wù)的工具既绩。它可以完全模擬瀏覽器的各種操作,以此把程序員從繁重的 cookie、 header森渐、 request 等等中解放出來趾娃。
為什么我要用到 Selenium ?在小燈神的心愿上接了個(gè)活喂很,學(xué)妹要求爬取 IEEEXplore 網(wǎng)站上某個(gè)學(xué)者的所有論文(標(biāo)題惜颇、來源、關(guān)鍵詞)少辣,而這個(gè)網(wǎng)站又是異步加載的官还,所以普通的爬蟲根本爬不到數(shù)據(jù),在網(wǎng)上搜索了一下毒坛,需要抓去 js 包望伦,然而我?guī)缀鯖]怎么學(xué)過 js,放棄這個(gè)方法煎殷,聽說還可以用 Selenium 自動(dòng)化獲取屯伞,于是開始學(xué)習(xí) Selenium。
環(huán)境搭建
在 Selenium 官網(wǎng)上下載對(duì)應(yīng)瀏覽器的 driver 豪直,比如我用的是 chrome 瀏覽器劣摇,就下載 chromedriver,下載地址:https://sites.google.com/a/chromium.org/chromedriver/downloads弓乙∧┤冢可能需要FQ,自行備梯子暇韧,或者去找國內(nèi)鏡像勾习。
把 chromedriver.exe 放在項(xiàng)目根目錄下即可,接下來看看要如何操作這個(gè)驅(qū)動(dòng)懈玻。
-
官網(wǎng)有 getting start:https://sites.google.com/a/chromium.org/chromedriver/getting-started巧婶,放上 Python 版本的代碼:
# Python: import time from selenium import webdriver import selenium.webdriver.chrome.service as service service = service.Service('/path/to/chromedriver') service.start() capabilities = {'chrome.binary': '/path/to/custom/chrome'} driver = webdriver.Remote(service.service_url, capabilities) driver.get('http://www.google.com/xhtml'); time.sleep(5) # Let the user actually see something! driver.quit()
-
實(shí)際上不需要官方教程那么復(fù)雜,如下代碼可以直接打開受自動(dòng)化工具控制的 chrome:
from selenium import webdriver driver = webdriver.Chrome(executable_path='chromedriver.exe')
運(yùn)行上面兩行代碼涂乌,且 exe 文件位于同一文件夾下艺栈,則可以看到 chrome 瀏覽器 打開:
20171118-auto 至此,環(huán)境搭建成功湾盒。
Selenium 基礎(chǔ)操作
有人做了 doc 中文文檔湿右,可以參閱一下:http://python-selenium-zh.readthedocs.io/zh_CN/latest/
-
打開某個(gè)網(wǎng)頁:
driver.get("http://www.baidu.com")
其中 driver.get 方法會(huì)打開請(qǐng)求的URL,WebDriver 會(huì)等待頁面完全加載完成之后才會(huì)返回罚勾,即程序會(huì)等待頁面的所有內(nèi)容加載完成毅人,JS渲染完畢之后才繼續(xù)往下執(zhí)行漾唉。注意:如果這里用到了特別多的 Ajax 的話,程序可能不知道是否已經(jīng)完全加載完畢堰塌。
-
尋找某個(gè)網(wǎng)頁元素:
find_element_by_id find_element_by_name find_element_by_xpath find_element_by_link_text find_element_by_partial_link_text find_element_by_tag_name find_element_by_class_name find_element_by_css_selector
尋找某組網(wǎng)頁元素:
find_elements_by_name find_elements_by_xpath find_elements_by_link_text find_elements_by_partial_link_text find_elements_by_tag_name find_elements_by_class_name find_elements_by_css_selector
假設(shè)有這樣一個(gè)輸入框:
<input type="text" name="passwd" id="passwd-id" />
以下幾種方法都可以找到它(但不一定是唯一的):
element = driver.find_element_by_id("passwd-id") element = driver.find_element_by_name("passwd") element = driver.find_elements_by_tag_name("input") element = driver.find_element_by_xpath("http://input[@id='passwd-id']")
-
獲取元素后赵刑,元素本身并沒有價(jià)值,它包含的文本或者鏈接才有價(jià)值:
text = element.text link = element.get_attribute('href')
-
獲取了元素之后场刑,下一步當(dāng)然就是向文本輸入內(nèi)容了般此,可以利用下面的方法
element.send_keys("some text")
同樣你還可以利用 Keys 這個(gè)類來模擬點(diǎn)擊某個(gè)按鍵。
element.send_keys("and some", Keys.ARROW_DOWN)
輸入的文本都會(huì)在原來的基礎(chǔ)上繼續(xù)輸入牵现。你可以用下面的方法來清除輸入文本的內(nèi)容铐懊。
element.clear()
-
下拉選項(xiàng)框可以利用 Select 方法:
from selenium.webdriver.support.ui import Select select = Select(driver.find_element_by_name('name')) select.select_by_index(index) select.select_by_visible_text("text") select.select_by_value(value) select.deselect_all() all_selected_options = select.all_selected_options
-
提交表單:
driver.find_element_by_id("submit").click()
-
Cookie 處理:
cookie = {‘name’ : ‘foo’, ‘value’ : ‘bar’} driver.add_cookie(cookie) driver.get_cookies()
-
頁面等待:
這是非常重要的一部分,現(xiàn)在的網(wǎng)頁越來越多采用了 Ajax 技術(shù)瞎疼,這樣程序便不能確定何時(shí)某個(gè)元素完全加載出來了科乎。這會(huì)讓元素定位困難而且會(huì)提高產(chǎn)生 ElementNotVisibleException 的概率。
所以 Selenium 提供了兩種等待方式贼急,一種是隱式等待茅茂,一種是顯式等待。
隱式等待是等待特定的時(shí)間:
driver.implicitly_wait(10) # seconds
顯式等待是指定某一條件直到這個(gè)條件成立時(shí)繼續(xù)執(zhí)行太抓,常用的判斷條件:
title_is 標(biāo)題是某內(nèi)容 title_contains 標(biāo)題包含某內(nèi)容 presence_of_element_located 元素加載出空闲,傳入定位元組,如(By.ID, 'p') visibility_of_element_located 元素可見走敌,傳入定位元組 visibility_of 可見碴倾,傳入元素對(duì)象 presence_of_all_elements_located 所有元素加載出 text_to_be_present_in_element 某個(gè)元素文本包含某文字 text_to_be_present_in_element_value 某個(gè)元素值包含某文字 frame_to_be_available_and_switch_to_it frame加載并切換 invisibility_of_element_located 元素不可見 element_to_be_clickable 元素可點(diǎn)擊 staleness_of 判斷一個(gè)元素是否仍在DOM,可判斷頁面是否已經(jīng)刷新 element_to_be_selected 元素可選擇掉丽,傳元素對(duì)象 element_located_to_be_selected 元素可選擇跌榔,傳入定位元組 element_selection_state_to_be 傳入元素對(duì)象以及狀態(tài),相等返回True捶障,否則返回False element_located_selection_state_to_be 傳入定位元組以及狀態(tài)僧须,相等返回True,否則返回False alert_is_present 是否出現(xiàn)Alert
官方 API :http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.support.expected_conditions
-
瀏覽器的前進(jìn)和后退:
driver.back() driver.forward()
IEEEXplore 實(shí)戰(zhàn)
-
20171118-zhangbo
它顯示了學(xué)者:Zhang Bo 的所有文章列表(分為兩頁)残邀,我們要爬取的首先是論文標(biāo)題皆辽,這個(gè)比較簡單柑蛇,來源也比較簡單芥挣,比如上圖的第一篇文章標(biāo)題為:Smale Horseshoes and Symbolic Dynamics in the Buck–Boost DC–DC Converter,來源為:IEEE Transactions on Industrial Electronics耻台。
-
可以通過 find_elements_by_css_selector 來找到這樣的一組元素:
article_name_ele_list = driver.find_elements_by_css_selector("h2 a.ng-binding.ng-scope") # 獲取該頁面所有文章標(biāo)題的元素 for article_name_ele in article_name_ele_list: # 對(duì)每個(gè)文章標(biāo)題元素空免,提取標(biāo)題文本(字符串),以及文章 url article_name = article_name_ele.text article_link = article_name_ele.get_attribute('href') article_names.append(article_name) print("article_name = ", article_name) article_links.append(article_link) print("article_link = ", article_link) article_source_ele_list = driver.find_elements_by_css_selector("div.description.u-mb-1 a.ng-binding.ng-scope") # 獲取該頁面所有文章來源的元素 for article_source_ele in article_source_ele_list: # 對(duì)每個(gè)文章來源元素盆耽,提取來源文本(字符串) article_source = article_source_ele.text article_sources.append(article_source) print("article_source =", article_source)
-
它的翻頁操作比較蛋疼蹋砚,底部雖然有頁碼工具條扼菠,但是都用到了 on-click 方法,然后方法內(nèi)傳入一個(gè)自定義的函數(shù)坝咐,這又是 js 的內(nèi)容循榆,有點(diǎn)麻煩。后來我注意到 url 地址變化的規(guī)律墨坚。
入口(也就是第一頁)是這樣的:
http://ieeexplore.ieee.org/search/searchresult.jsp?queryText=(%22Authors%22:Zhang%20Bo)&refinements=4224983357&matchBoolean=true&searchField=Search_All
第二頁是這樣的:
http://ieeexplore.ieee.org/search/searchresult.jsp?queryText=(%22Authors%22:Zhang%20Bo)&refinements=4224983357&matchBoolean=true&pageNumber=2&searchField=Search_All
也就多了一個(gè) pageNumber 的參數(shù)秧饮,如果手動(dòng)輸入 pageNumber 是3的話,是什么樣的呢泽篮?
20171118-notfound -
這樣我就根本不用管頁碼工具條盗尸,靠 url 跳轉(zhuǎn)就可以實(shí)現(xiàn)翻頁的效果。
pageNumber = 1 while(True): driver.get( 'http://ieeexplore.ieee.org/search/searchresult.jsp?queryText=(%22Authors%22:Zhang%20Bo)&refinements=4224983357&matchBoolean=true&pageNumber=' + str(pageNumber) + '&searchField=Search_All') time.sleep(5) print("start to check if this is the last page !!!") try: driver.find_element_by_css_selector("p.List-results-none--lg.u-mb-0") # if this is NOT the last page, this will raise exception except Exception as e: print("This page is good to go !!!") else: print("The last page !!!") break article_name_ele_list = driver.find_elements_by_css_selector("h2 a.ng-binding.ng-scope") for article_name_ele in article_name_ele_list: article_name = article_name_ele.text article_link = article_name_ele.get_attribute('href') article_names.append(article_name) print("article_name = ", article_name) article_links.append(article_link) print("article_link = ", article_link) article_source_ele_list = driver.find_elements_by_css_selector("div.description.u-mb-1 a.ng-binding.ng-scope") for article_source_ele in article_source_ele_list: article_source = article_source_ele.text article_sources.append(article_source) print("article_source =", article_source) pageNumber += 1
-
解釋:
首先初始化為第一頁帽撑,然后進(jìn)入 while 循環(huán)泼各,首先會(huì)檢查當(dāng)前頁面是否是 notfound 頁面,如果是亏拉,則證明上一頁已經(jīng)是最后一頁了扣蜻,跳出循環(huán)。如果不是才獲取文章標(biāo)題及塘、文章鏈接弱贼、文章來源,最后另 pageNumber 加一即可磷蛹。
獲取文章關(guān)鍵詞
-
好的吮旅,萬事開頭難,我們已經(jīng)有這位學(xué)者20篇論文的鏈接了味咳,我們要一一打開這些鏈接庇勃,獲取其中的關(guān)鍵詞。但是我們打開第一篇文章的鏈接槽驶,發(fā)現(xiàn)默認(rèn)可以看到“Abstract”责嚷,還需要點(diǎn)擊“Keywords”才行
20171118-abstract_url20171118-Keywords_url 但是觀察 url,真是天助我也掂铐,只需要加入‘/keywrods’就好了罕拂。
但是這些關(guān)鍵詞要在怎么獲取呢?值得一提的是全陨,這篇文章的關(guān)鍵詞有兩類:IEEE Keywords, Author Keywords爆班。有的文章不止這兩類,還有可能有:INSPEC: Controlled Indexing, INSPEC: Non-Controlled Indexing辱姨。
就算獲取到了這四個(gè)柿菩,但是關(guān)鍵詞并不是固定的,看上去雨涛,唯一和這些關(guān)鍵詞種類有關(guān)系的就是它們的層級(jí)結(jié)構(gòu)了枢舶。
-
接下來,需要介紹一下 xpath 這個(gè)東西了懦胞。
XPath即為XML路徑語言(XML Path Language),它是一種用來確定XML文檔中某部分位置的語言凉泄。
XPath基于XML的樹狀結(jié)構(gòu)躏尉,提供在數(shù)據(jù)結(jié)構(gòu)樹中找尋節(jié)點(diǎn)的能力。起初XPath的提出的初衷是將其作為一個(gè)通用的后众、介于XPointer與XSL間的語法模型醇份。但是XPath很快的被開發(fā)者采用來當(dāng)作小型查詢語言。在這里吼具,可以看到每個(gè)關(guān)鍵詞是屬于某個(gè)關(guān)鍵詞種類的下一組結(jié)點(diǎn)的僚纷,所以可以用 following-sibling 的屬性來獲取到這組關(guān)鍵詞元素。
-
上文已經(jīng)通過 article_link 存儲(chǔ)了所有文章的 url拗盒,這里還需要通過正則表達(dá)式判斷文章的 article_id:
# get into articles page for article_link in article_links: driver.get(article_link + "keywords") article_id = re.findall("[0-9]+", article_link)[0] time.sleep(3)
-
創(chuàng)建四個(gè)字典怖竭,用來存儲(chǔ)四個(gè)關(guān)鍵詞種類:
# get into keywords page dic = {} dic['IEEE Keywords'] = [] dic['INSPEC: Controlled Indexing'] = [] dic['INSPEC: Non-Controlled Indexing'] = [] dic['Author Keywords'] = []
-
首先找到關(guān)鍵詞種類的元素,然后用 following-sibling 找到其下的具體關(guān)鍵詞:
keywords_type_list = driver.find_elements_by_css_selector("li.doc-keywords-list-item.ng-scope strong.ng-binding") # ['IEEE Keywords', 'INSPEC: Controlled Indexing', 'INSPEC: Non-Controlled Indexing', 'Author Keywords'] for i in range(len(keywords_type_list)): # 定位每個(gè)關(guān)鍵字種類陡蝇,然后提取該關(guān)鍵字種類下的所有關(guān)鍵字 li = [] keywords_ele_list = driver.find_elements_by_xpath( ".//*[@id=" + article_id + "]/div/ul/li[" + str(i+1) +"]/strong/following-sibling::*/li/a") for j in keywords_ele_list: li.append(j.text) dic[keywords_type_list[i].text] = li article_keywords.append(dic)
-
最后輸出成 csv 文件即可:
# already get all data, now output to the csv file pprint(article_keywords) with open("ieee_zhangbo_.csv", "w", newline="")as f: csvwriter = csv.writer(f, dialect=("excel")) csvwriter.writerow(['article_name', 'article_source', 'article_link', 'IEEE Keywords', 'INSPEC: Controlled Indexing', 'INSPEC: Non-Controlled Indexing', 'Author Keywords']) for i in range(len(article_names)): csvwriter.writerow([article_names[i], article_sources[i], article_links[i], article_keywords[i]['IEEE Keywords'], article_keywords[i]['INSPEC: Controlled Indexing'], article_keywords[i]['INSPEC: Non-Controlled Indexing'], article_keywords[i]['Author Keywords']]
-
輸出:
"C:\Program Files\Python36\python.exe" D:/PythonProject/immoc/IEEEXplorer_get_article.py start to check if this is the last page !!! This page is good to go !!! article_name = Smale Horseshoes and Symbolic Dynamics in the Buck–Boost DC–DC Converter article_link = http://ieeexplore.ieee.org/document/7926377/ article_name = A Novel Single-Input–Dual-Output Impedance Network Converter article_link = http://ieeexplore.ieee.org/document/7827092/ article_name = A Z-Source Half-Bridge Converter article_link = http://ieeexplore.ieee.org/document/6494636/ article_name = Design of Analogue Chaotic PWM for EMI Suppression article_link = http://ieeexplore.ieee.org/document/5590287/ article_name = A novel H5-D topology for transformerless photovoltaic grid-connected inverter application article_link = http://ieeexplore.ieee.org/document/7512376/ article_name = A Common Grounded Z-Source DC–DC Converter With High Voltage Gain article_link = http://ieeexplore.ieee.org/document/7378484/ article_name = Frequency Splitting Phenomena of Magnetic Resonant Coupling Wireless Power Transfer article_link = http://ieeexplore.ieee.org/document/6971783/ article_name = Modeling and analysis of the stable power supply based on the magnetic flux leakage transformer article_link = http://ieeexplore.ieee.org/document/7037927/ article_name = On Thermal Impact of Chaotic Frequency Modulation SPWM Techniques article_link = http://ieeexplore.ieee.org/document/7736981/ article_name = Extended Switched-Boost DC-DC Converters Adopting Switched-Capacitor/Switched-Inductor Cells for High Step-up Conversion article_link = http://ieeexplore.ieee.org/document/7790823/ article_source = IEEE Transactions on Industrial Electronics article_source = IEEE Journal of Emerging and Selected Topics in Power Electronics article_source = IEEE Transactions on Industrial Electronics article_source = IEEE Transactions on Electromagnetic Compatibility article_source = 2016 IEEE 8th International Power Electronics and Motion Control Conference (IPEMC-ECCE Asia) article_source = IEEE Transactions on Industrial Electronics article_source = IEEE Transactions on Magnetics article_source = 2014 International Power Electronics and Application Conference and Exposition article_source = IEEE Transactions on Industrial Electronics article_source = IEEE Journal of Emerging and Selected Topics in Power Electronics start to check if this is the last page !!! This page is good to go !!! article_name = Common-Mode Electromagnetic Interference Calculation Method for a PV Inverter With Chaotic SPWM article_link = http://ieeexplore.ieee.org/document/7120165/ article_name = Stability Analysis of the Coupled Synchronous Reluctance Motor Drives article_link = http://ieeexplore.ieee.org/document/7460928/ article_name = A modified AGREE reliability allocation method research in power converter article_link = http://ieeexplore.ieee.org/document/7107251/ article_name = A single-switch high step-up converter without coupled inductor article_link = http://ieeexplore.ieee.org/document/7512635/ article_name = Hybrid Z-Source Boost DC–DC Converters article_link = http://ieeexplore.ieee.org/document/7563395/ article_name = A study of hybrid control algorithms for buck-boost converter based on fixed switching frequency article_link = http://ieeexplore.ieee.org/document/6566548/ article_name = Bifurcation and Border Collision Analysis of Voltage-Mode-Controlled Flyback Converter Based on Total Ampere-Turns article_link = http://ieeexplore.ieee.org/document/5729352/ article_name = Frequency, Impedance Characteristics and HF Converters of Two-Coil and Four-Coil Wireless Power Transfer article_link = http://ieeexplore.ieee.org/document/6783963/ article_name = Sneak circuit analysis for a DCM flyback DC-DC converter considering parasitic parameters article_link = http://ieeexplore.ieee.org/document/7512450/ article_name = Detecting bifurcation types in DC-DC switching converters by duplicate symbolic sequence article_link = http://ieeexplore.ieee.org/document/6572495/ article_source = IEEE Transactions on Magnetics article_source = IEEE Transactions on Circuits and Systems II: Express Briefs article_source = 2014 10th International Conference on Reliability, Maintainability and Safety (ICRMS) article_source = 2016 IEEE 8th International Power Electronics and Motion Control Conference (IPEMC-ECCE Asia) article_source = IEEE Transactions on Industrial Electronics article_source = 2013 IEEE 8th Conference on Industrial Electronics and Applications (ICIEA) article_source = IEEE Transactions on Circuits and Systems I: Regular Papers article_source = IEEE Journal of Emerging and Selected Topics in Power Electronics article_source = 2016 IEEE 8th International Power Electronics and Motion Control Conference (IPEMC-ECCE Asia) article_source = 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013) start to check if this is the last page !!! The last page !!!
-
csv 文件:
20171118-zhangbocsv