網(wǎng)頁抓取
根據(jù)鏈接
從入口頁面開始抓取出所有鏈接福也,支持proxy、支持定義深度抓取攀圈、鏈接去重等暴凑,尚未做并發(fā)處理
code如下
import urlparse
import urllib2
import re
import Queue
#頁面下載
def page_download(url,num_retry=2,user_agent='zhxfei',proxy=None):
#print 'downloading ' , url
headers = {'User-agent':user_agent}
request = urllib2.Request(url,headers = headers)
opener = urllib2.build_opener()
if proxy:
proxy_params = {urlparse(url).scheme:proxy}
opener.add_handler(urllib2.ProxyHandler(proxy_params))
try:
html = urllib2.urlopen(request).read() #try : download the page
except urllib2.URLError as e: #except :
print 'Download error!' , e.reason #URLError
html = None
if num_retry > 0: # retry download when time>0
if hasattr(e, 'code') and 500 <=e.code <=600:
return page_download(url,num_retry-1)
if html is None:
print '%s Download failed' % url
else:
print '%s has Download' % url
return html
#使用正則表達式匹配出頁面中的鏈接
def get_links_by_html(html):
webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
return webpage_regex.findall(html)
#判斷抓取的鏈接和入口頁面是否為同站
def same_site(url1,url2):
return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netloc
def link_crawler(seed_url,link_regex,max_depth=-1):
crawl_link_queue = Queue.deque([seed_url])
seen = {seed_url:0} # seen means page had download
depth = 0
while crawl_link_queue:
url = crawl_link_queue.pop()
depth = seen.get(url)
if seen.get(url) > max_depth:
continue
links = []
html = page_download(url)
links.extend(urlparse.urljoin(seed_url, x) for x in get_links_by_html(html) if re.match(link_regex, x))
for link in links:
if link not in seen:
seen[link]= depth + 1
if same_site(link, seed_url):
crawl_link_queue.append(link)
#print seen.values()
print '----All Done----' , len(seen)
return seen
if __name__ == '__main__':
all_links = link_crawler('http://www.zhxfei.com',r'/.*',max_depth=1)
運行結(jié)果:
http://www.zhxfei.com/archives has Download
http://www.zhxfei.com/2016/08/04/lvs/ has Download
...
...
http://www.zhxfei.com/2016/07/22/app-store-審核-IPv6-Olny/#more has Download
http://www.zhxfei.com/archives has Download
http://www.zhxfei.com/2016/07/22/HDFS/#comments has Download
----All Done----
根據(jù)sitmap
sitemap是相當于網(wǎng)站的地圖,于其相關的還有robots.txt赘来,一般都是在網(wǎng)站的根目錄下專門提供給各種spider现喳,使其更加友好的被搜索引擎收錄,定義了一些正規(guī)爬蟲的抓取規(guī)則
所有也可以這樣玩犬辰,將xml文件中的url拿出來拿穴,根據(jù)url去直接抓取網(wǎng)站,這是最方便的做法(雖然別人不一定希望我們這么做)
#!/usr/bin/env python
# _*_encoding:utf-8 _*_
# description: this modlue is load crawler By SITEMAP
import re
from download import page_download
def load_crawler(url):
#download the sitemap
sitemap = page_download(url)
links = re.findall('<loc>(.*?)</loc>',sitemap)
for link in links:
page_download(link)
if link == links[-1]:
print 'All links has Done'
# print links
load_crawler('http://example.webscraping.com/sitemap.xml')
小結(jié)
好了忧风,現(xiàn)在爬蟲已經(jīng)具備了抓取網(wǎng)頁的能力默色,然而他并沒有做什么事情,只是將網(wǎng)頁download下來狮腿,所以我們還要進行數(shù)據(jù)處理腿宰。也就是需要在網(wǎng)頁中抓取出我們想要的信息。
數(shù)據(jù)提取
使用Lxml提取
抓取網(wǎng)頁中的信息常用的的三種方法:
- 使用正則表達式解析缘厢,re模塊吃度,這是最快的解決方案,并且默認的情況下它會緩存搜索的結(jié)果(可以借助
re.purge()
來講緩存清除)贴硫,當然也是最復雜的方案(不針對你是一只老鳥) - 使用Beautifulsoup進行解析椿每,這是最人性化的選擇,因為它處理起來很簡單英遭,然而處理大量數(shù)據(jù)的時候很慢间护,所以當抓取很多頁面的時候,一般不推薦使用
- 使用Lxml挖诸,這是相對比較中性的做法汁尺,使用起來也比較簡單,這里我們選擇它對抓取的頁面進行處理
Lxml的使用有兩種方式:Xpath和cssselect多律,都是使用起來比較簡單的痴突,Xpath可以和bs一樣,使用find和find_all匹配parten(匹配模式)狼荞,用鏈型的結(jié)構描述DOM和數(shù)據(jù)的位置辽装。而cssselct直接是用了jQuery的選擇器來進行匹配,這樣對有前端功底的同學更加友好相味。
先給個demo試下:即將抓取的網(wǎng)頁http://example.webscraping.com/places/view/United-Kingdom-239 has Download
網(wǎng)頁中有個表格<table>
,我們想要的信息都是存在body的表格中拾积,可以使用瀏覽器的開發(fā)者工具來省查元素,也可以使用firebug(Firefox上面的一款插件)來查看DOM結(jié)構
import lxml.html
import cssselect
from download import page_download
example_url = 'http://example.webscraping.com/places/view/United-Kingdom-239'
def demo():
html = page_download(example_url, num_retry=2)
result = lxml.html.fromstring(html)
print type(result)
td = result.cssselect('tr#places_area__row > td.w2p_fw')
print type(td)
print len(td)
css_element = td[0]
print type(css_element)
print css_element.text_content()
執(zhí)行結(jié)果:
http://example.webscraping.com/places/view/United-Kingdom-239 has Download
<class 'lxml.html.HtmlElement'>
<type 'list'>
1
<class 'lxml.html.HtmlElement'>
244,820 square kilometres
可以看到,使用cssselect進行選擇器是拿到了一個長度是1的列表殷勘,當然列表的長度顯然和我定義的選擇器的模式有關此再,這個列表中每一項都是一個HtmlElement
,他有一個text_content
方法可以返回這個節(jié)點的內(nèi)容玲销,這樣我們就拿到了我們想要的數(shù)據(jù)输拇。
回調(diào)處理
接下來我們就可以為上面的爬蟲增加定義一個回調(diào)函數(shù)
,在我們每下載一個頁面的時候贤斜,做一些小的操作策吠。
顯然應該修改link_crawler
函數(shù),并在其參數(shù)傳遞回調(diào)函數(shù)的引用瘩绒,這樣就可以針對不同頁面來進行不同的回調(diào)處理如:
def link_crawler(seed_url,link_regex,max_depth=-1,scrape_callback=None):
...
html = page_download(url) #這行和上面一樣
if scrape_callback:
scrape_callback(url,html)
links.extend(urlparse.urljoin(seed_url, x) for x in get_links_by_html(html) if re.match(link_regex, x)) #這行和上面一樣
...
接下來編寫回調(diào)函數(shù)猴抹,由于python的面向?qū)ο蠛軓姶螅赃@里使用回調(diào)類來完成锁荔,由于我們需要調(diào)用回調(diào)類的實例蟀给,所以需要重寫它的__call__
方法,并實現(xiàn)在調(diào)用回調(diào)類的實例的時候阳堕,將拿到的數(shù)據(jù)以csv
格式保存跋理,這個格式可以用wps打開表格。當然你也可以將其寫入到數(shù)據(jù)庫中恬总,這個之后再提
import csv
class ScrapeCallback():
def __init__(self):
self.writer = csv.writer(open('contries.csv','w+'))
self.rows_name = ('area','population','iso','country','capital','tld','currency_code','currency_name','phone','postal_code_format','postal_code_regex','languages','neighbours')
self.writer.writerow(self.rows_name)
def __call__(self,url,html):
if re.search('/view/', url):
tree = lxml.html.fromstring(html)
rows = []
for row in self.rows_name:
rows.append(tree.cssselect('#places_{}__row > td.w2p_fw'.format(row))[0].text_content())
self.writer.writerow(rows)
可以看到回調(diào)類有三個屬性:
self.rows_name
這個屬性保存了我們的想要抓取數(shù)據(jù)的信息
self.writer
這個類似文件句柄一樣的存在
self.writer.writerow
這個屬性方法是將數(shù)據(jù)寫入csv格式表格
好了前普,這樣就可以將我們的數(shù)據(jù)持久化保存起來
修改下link_crawler
的define:def link_crawler(seed_url,link_regex,max_depth=-1,scrape_callback=ScrapeCallback()):
運行看下結(jié)果:
zhxfei@zhxfei-HP-ENVY-15-Notebook-PC:~/桌面/py_tran$ python crawler.py
http://example.webscraping.com has Download
http://example.webscraping.com/index/1 has Download # /index 在__call__中的/view 所以不會進行數(shù)據(jù)提取
http://example.webscraping.com/index/2 has Download
http://example.webscraping.com/index/0 has Download
http://example.webscraping.com/view/Barbados-20 has Download
http://example.webscraping.com/view/Bangladesh-19 has Download
http://example.webscraping.com/view/Bahrain-18 has Download
...
...
http://example.webscraping.com/view/Albania-3 has Download
http://example.webscraping.com/view/Aland-Islands-2 has Download
http://example.webscraping.com/view/Afghanistan-1 has Download
----All Done---- 35
zhxfei@zhxfei-HP-ENVY-15-Notebook-PC:~/桌面/py_tran$ ls
contries.csv crawler.py
打開這個csv,就可以看到數(shù)據(jù)都保存了:
完整代碼在這里:
#!/usr/bin/env python
# _*_encoding:utf-8 _*_
import urlparse
import urllib2
import re
import time
import Queue
import lxml.html
import csv
class ScrapeCallback():
def __init__(self):
self.writer = csv.writer(open('contries.csv','w+'))
self.rows_name = ('area','population','iso','country','capital','tld','currency_code','currency_name','phone','postal_code_format','postal_code_regex','languages','neighbours')
self.writer.writerow(self.rows_name)
def __call__(self,url,html):
if re.search('/view/', url):
tree = lxml.html.fromstring(html)
rows = []
for row in self.rows_name:
rows.append(tree.cssselect('#places_{}__row > td.w2p_fw'.format(row))[0].text_content())
self.writer.writerow(rows)
def page_download(url,num_retry=2,user_agent='zhxfei',proxy=None):
#print 'downloading ' , url
headers = {'User-agent':user_agent}
request = urllib2.Request(url,headers = headers)
opener = urllib2.build_opener()
if proxy:
proxy_params = {urlparse(url).scheme:proxy}
opener.add_handler(urllib2.ProxyHandler(proxy_params))
try:
html = urllib2.urlopen(request).read() #try : download the page
except urllib2.URLError as e: #except :
print 'Download error!' , e.reason #URLError
html = None
if num_retry > 0: # retry download when time>0
if hasattr(e, 'code') and 500 <=e.code <=600:
return page_download(url,num_retry-1)
if html is None:
print '%s Download failed' % url
else:
print '%s has Download' % url
return html
def same_site(url1,url2):
return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netloc
def get_links_by_html(html):
webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE) #理解正則表達式
return webpage_regex.findall(html)
def link_crawler(seed_url,link_regex,max_depth=-1,scarape_callback=ScrapeCallback()):
crawl_link_queue = Queue.deque([seed_url])
# seen contain page had find and it's depth,example first time:{'seed_page_url_find','depth'}
seen = {seed_url:0}
depth = 0
while crawl_link_queue:
url = crawl_link_queue.pop()
depth = seen.get(url)
if seen.get(url) > max_depth:
continue
links = []
html = page_download(url)
links.extend(urlparse.urljoin(seed_url, x) for x in get_links_by_html(html) if re.match(link_regex, x))
for link in links:
if link not in seen:
seen[link]= depth + 1
if same_site(link, seed_url):
crawl_link_queue.append(link)
#print seen.values()
print '----All Done----' , len(seen)
return seen
if __name__ == '__main__':
all_links = link_crawler('http://example.webscraping.com', '/(index|view)',max_depth=2)