課程作業(yè)
- 選擇第二次課程作業(yè)中選中的網(wǎng)址
- 爬取該頁面中的所有可以爬取的元素宝与,至少要求爬取文章主體內(nèi)容
- 可以嘗試用lxml爬取
作業(yè)網(wǎng)址
http://www.reibang.com/p/e0bd6bfad10b
網(wǎng)頁爬取
分別用Beautiful Soup和lxml做了爬蓉抑觥:
- 主頁面所有鏈接湿酸,寫到 _all_links.txt文件
- 分別抓取各鏈接帕膜,獲取文章主體內(nèi)容和title, 并保存主體內(nèi)容到以title命名
的文件 - 對(duì)于無title或無主體內(nèi)容的鏈接切厘,將url寫到Title_Is_None.txt文件中
最后的輸出圖:
框架結(jié)構(gòu):
spidercomm.py: 定義一些公用函數(shù)珊皿,如寫文件网缝,下載頁面等,這些函數(shù)獨(dú)立于實(shí)際使用的爬取方式
spiderbs4.py: 定義用BeautifulSoup實(shí)現(xiàn)需要的一些函數(shù)蟋定,如爬取鏈接寄疏,爬取頁面內(nèi)容
spiderlxml.py: 定義用lxml實(shí)現(xiàn)需要的一些函數(shù),如爬取鏈接愉老,爬取頁面內(nèi)容
bs4.py: 用BeautifulSoup實(shí)現(xiàn)的爬取客戶端
lxml.py: 用lxml實(shí)現(xiàn)的爬取客戶端
文件夾結(jié)果圖:
BeautifulSoup 實(shí)現(xiàn)
BeautifulSoup實(shí)現(xiàn)代碼
1 導(dǎo)入 spidercomm 和 BeautifulSoup 模塊
# -*- coding: utf-8 -*-
import spidercomm as common
from bs4 import BeautifulSoup
2 使用BeautifulSoup 找到所有tag a,spidercomm 模塊的爬取所有鏈接的方法會(huì)用到其返回值豆瘫。
# get all tags a from a single url
def a_links(url_seed,attrs={}):
html = common.download(url_seed)
soup = BeautifulSoup(html,'html.parser')
alinks= soup.find_all('a',attrs)
return alinks
3 使用BeautifulSoup 爬取某個(gè)url的頁面俊扳,主要關(guān)注title和文章主體。
def crawled_page(crawled_url):
html = common.download(crawled_url)
soup = BeautifulSoup(html,'html.parser')
title = soup.find('h1',{'class':'title'})
if title== None:
return "Title_Is_None",crawled_url
content = soup.find('div',{'class':'show-content'})
if content == None:
return title.text, "Content_Is_None"
return title.text,content.text
4 判斷是否分頁
def isMultiPaged(url):
html_page1 = common.download(url % 1)
soup = BeautifulSoup(html_page1,'html.parser')
body1 = soup.find('body')
body1.script.decompose()
html_page2 = common.download(url % 2)
if html_page2 == None:
return False
soup = BeautifulSoup(html_page2,"html.parser")
body2 = soup.find('body')
#print [x.extract() for x in body2.findAll('script') ]
body2.script.decompose()
if str(body1) == str(body2):
return False
else:
return True
5 獲取所有分頁數(shù)
def getNumberOfPages(url):
count = 1
flag = True
if (isMultiPaged(url)):
while flag:
url= url % count
# print "url: %s" % url
count += 1
html = common.download(url)
if html==None:
break
return count
BeautifulSoup客戶端代碼
1 導(dǎo)入spidercomm 和 spiderbs4 模塊
# -*- coding: utf-8 -*-
import os
import spiderbs4 as bs4
import spidercomm as common
2 設(shè)置所需的變量促王,創(chuàng)建輸出結(jié)果路徑
# set up
url_root = 'http://www.reibang.com/'
url_seed = 'http://www.reibang.com/p/e0bd6bfad10b?page=%d'
spider_path='spider_res/bs4/'
if os.path.exists(spider_path) == False:
os.makedirs(spider_path)
3 調(diào)用spidercomm的getNumberOfPages方法犀盟,判斷要爬取的頁面是否分頁
# get total number of pages
print "url %s has multiple pages? %r" % (url_seed,bs4.isMultiPaged(url_seed))
page_count = bs4.getNumberOfPages(url_seed)
print "page_count is %s" % page_count
4 調(diào)用spidercomm的to_be_crawled_links方法,對(duì)所要爬取的頁面獲取所有鏈接蝇狼,如果有分頁阅畴,則分別獲取各頁并歸并獲取的鏈接,并最終將所有鏈接寫入_all_links.txt文件
# get all links to be crawled and write to file
links_to_be_crawled=set()
for count in range(page_count):
links = common.to_be_crawled_links(bs4.a_links(url_seed % count),count,url_root,url_seed)
print "Total number of all links is %d" % len(links)
links_to_be_crawled = links_to_be_crawled | links
with open(spider_path+"_all_links.txt",'w+') as file:
file.write("\n".join(unicode(link).encode('utf-8',errors='ignore') for link in links_to_be_crawled))
5 循環(huán)所獲取的鏈接列表迅耘,爬取每個(gè)鏈接的頁面title和頁面文章內(nèi)容贱枣,分別寫入以title命名的文件中,如無title颤专,則寫入Title_Is_None.txt文件中纽哥。
# capture desired contents from crawled_urls
if len(links_to_be_crawled) >= 1:
for link in links_to_be_crawled:
title,content=bs4.crawled_page(link)
# print "title is %s" % title
file_name = spider_path + title +'.txt'
common.writePage(file_name,content)
lxml 實(shí)現(xiàn)
lxml 實(shí)現(xiàn)代碼
1 導(dǎo)入lxml模塊
# -*- coding: utf-8 -*-
import spidercomm as common
import urlparse
from lxml import etree
2 使用lxml找到所有tag a,spidercomm 模塊的爬取所有鏈接的方法會(huì)用到其返回值。
# get all tags a from a single url
def a_links(url_seed,attrs={}):
html = common.download(url_seed)
tree = etree.HTML(html)
alinks= tree.xpath("http://a")
return alinks
3 調(diào)用spidercomm的getNumberOfPages方法栖秕,判斷要爬取的頁面是否分頁
def crawled_page(crawled_url):
html = common.download(crawled_url)
tree = etree.HTML(html)
title= tree.xpath("/html/body/div[1]/div[1]/div[1]/h1")
if title == None or len(title) == 0:
return "Title_Is_None",crawled_url
contents = tree.xpath("/html/body/div[1]/div[1]/div[1]/div[2]/*")
if contents == None or len(contents) ==0:
return title.text, "Content_Is_None"
content = ''
for x in contents:
if (x.text != None):
content = content + x.xpath('string()')
return title[0].text,content
4 判斷是否分頁
def isMultiPaged(url):
html_page1 = common.download(url % 1)
tree = etree.HTML(html_page1)
xp1 = tree.xpath("/html/body/div[1]/div[1]/div[1]/div[2]/*")
xp1 = ",".join(x.text for x in xp1)
html_page2 = common.download(url % 2)
if html_page2 == None:
return False
tree = etree.HTML(html_page2)
xp2 = tree.xpath("/html/body/div[1]/div[1]/div[1]/div[2]/*")
xp2 = ",".join(x.text for x in xp2)
if xp1 == xp2:
return False
else:
return True
5 獲取所有分頁數(shù)
def getNumberOfPages(url):
count = 1
flag = True
if (isMultiPaged(url)):
while flag:
url= url % count
print "url: %s" % url
count += 1
html = common.download(url)
if html==None:
break
return count
lxml 客戶端代碼
同前面的bs4客戶端春塌,只是導(dǎo)入的是spiderlxml模塊及調(diào)用。
公用方法代碼
1 導(dǎo)入所需模塊
# -*- coding: utf-8 -*-
import urllib2
import time
import urlparse
2 下載頁面
def download(url,retry=2):
# print "downloading %s" % url
header = {
'User-Agent':'Mozilla/5.0'
}
try:
req = urllib2.Request(url,headers=header)
html = urllib2.urlopen(req).read()
except urllib2.HTTPError as e:
print "download error: %s" % e.reason
html = None
if retry >0:
print e.code
if hasattr(e,'code') and 500 <= e.code < 600:
print e.code
return download(url,retry-1)
time.sleep(1)
return html
3 將爬取內(nèi)容寫入文件
def writePage(filename,content):
content = unicode(content).encode('utf-8',errors='ignore')+"\n"
if ('Title_Is_None.txt' in filename):
with open(filename,'a') as file:
file.write(content)
else:
with open(filename,'wb+') as file:
file.write(content)
4 獲取單一url的所有外鏈接
# get urls to be crawled
#:param alinks: list of tag 'a' href, dependent on implementation eg. bs4,lxml
def to_be_crawled_link(alinks,url_seed,url_root):
links_to_be_crawled=set()
if len(alinks)==0:
return links_to_be_crawled
print "len of alinks is %d" % len(alinks)
for link in alinks:
link = link.get('href')
if link != None and 'javascript:' not in link:
if link not in links_to_be_crawled:
realUrl = urlparse.urljoin(url_root,link)
links_to_be_crawled.add(realUrl)
return links_to_be_crawled
5 獲取指定分頁的所有外連接
def to_be_crawled_links(alinks,count,url_root,url_seed):
url = url_seed % count
links = to_be_crawled_link(alinks,url_root,url)#,{'class':'title'})
links.add(url)
return links
結(jié)語:
實(shí)現(xiàn)的還很粗糙,抓取的內(nèi)容也很簡(jiǎn)單只壳,希望和大家一起討論俏拱,并進(jìn)一步完善框架。
兩種實(shí)現(xiàn)爬取下來的頁面內(nèi)容似乎差不多吼句,格式有些差別锅必,bs4的要好一些,可能是自己代碼沒處理好惕艳,需要再研究下搞隐,以后要完善的地方還好多,作業(yè)先提交吧远搪。