前言
上一篇中我們?cè)诰S基百科的內(nèi)部網(wǎng)站上隨機(jī)跳轉(zhuǎn)進(jìn)入文章類(lèi)網(wǎng)頁(yè),而忽視外部網(wǎng)站鏈接贱鼻。本篇文章將處理網(wǎng)站的外部鏈接并試圖收集一些網(wǎng)站數(shù)據(jù)喻奥。和單個(gè)域名網(wǎng)站爬取不同,不同域名的網(wǎng)站結(jié)構(gòu)千差萬(wàn)別赖欣,這就意味我們的代碼需要更加的靈活以適應(yīng)不同的網(wǎng)站結(jié)構(gòu)屑彻。
因此,我們將代碼寫(xiě)成一組函數(shù)顶吮,這些函數(shù)組合起來(lái)就可以應(yīng)用在不同類(lèi)型的網(wǎng)絡(luò)爬蟲(chóng)需求社牲。
隨機(jī)跳轉(zhuǎn)外部鏈接
利用函數(shù)組,我們可以在50行左右滿足爬取外部網(wǎng)站的需求悴了。
示例代碼:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import datetime
import random
from urllib.parse import quote
pages = set()
random.seed(datetime.datetime.now())
''' 獲取一個(gè)網(wǎng)頁(yè)的所有互聯(lián)網(wǎng)鏈接'''
# 獲取網(wǎng)頁(yè)所有內(nèi)部鏈接
def get_internal_links(soup, include_url):
internal_links = []
# find all links that befin with a '/'
print(include_url)
for link in soup.find_all('a',
href=re.compile(r'^((/|.)*' + include_url + ')')):
if link.attrs['href'] is not None:
if link.attrs['href'] not in internal_links:
internal_links.append(link.attrs['href'])
return internal_links
# retrieves a list of all external links found on a page
#獲取網(wǎng)頁(yè)上所有外部鏈接
def get_external_links(soup, exclude_url):
external_links = []
# Finds all links that starts with 'http' or 'www' that do not contain the
# current URL
for link in soup.find_all('a',
href=re.compile(r'^(http|www)((?!' + exclude_url + ').)*$')):
if link.attrs['href'] is not None:
if link.attrs['href'] not in external_links:
external_links.append(link.attrs['href'])
return external_links
#拆分網(wǎng)址獲取主域名
def split_address(address):
address_parts = address.replace('http://', '').split('/')
return address_parts
#隨機(jī)外部鏈接跳轉(zhuǎn)
def get_random_external_link(starting_page):
html = urlopen(starting_page)
soup = BeautifulSoup(html, 'lxml')
external_links = get_external_links(
soup, split_address(starting_page)[0]) # find the domain URL
if len(external_links) == 0:
internal_links = get_internal_links(soup, starting_page)
print(len(internal_links))
return get_external_links(soup,
internal_links[random.randint(0, len(internal_links) - 1)])
else:
return external_links[random.randint(0, len(external_links) - 1)]
hop_count = set()
#只跳轉(zhuǎn)外部鏈接搏恤,設(shè)置跳轉(zhuǎn)次數(shù)loop, 默認(rèn)跳轉(zhuǎn)5次
def follow_external_only(starting_site, loop=5):
global hop_count
external_link = get_random_external_link(
quote(starting_site, safe='/:?='))
print('Random external link is: ' + external_link)
while len(hop_count) < loop:
hop_count.add(external_link)
print(len(hop_count))
follow_external_only(external_link)
follow_external_only("http://www.baidu.com")
由于代碼沒(méi)有異常處理和反反爬蟲(chóng)處理,因此一定會(huì)報(bào)錯(cuò)湃交。由于跳轉(zhuǎn)是隨機(jī)的熟空,可以多運(yùn)行幾次,有興趣的可以根據(jù)每次的報(bào)錯(cuò)原因完善代碼搞莺。
輸出結(jié)果:
Random external link is: http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
1
Random external link is: http://baishi.baidu.com/watch/6388818335201070269.html
2
Random external link is: http://v.baidu.com/tv/
3
Random external link is: http://player.baidu.com/yingyin.html
4
Random external link is: http://help.baidu.com/question?prod_en=player
5
Random external link is: http://home.baidu.com
[Finished in 6.3s]
抓取網(wǎng)頁(yè)上所有外部鏈接
把代碼寫(xiě)成函數(shù)的好處是可以簡(jiǎn)單地修改或者添加以滿足不同的需求而不會(huì)破壞代碼息罗。比如:
目的:爬取整個(gè)網(wǎng)頁(yè)所有外部鏈接并對(duì)每個(gè)鏈接標(biāo)記
我們可以添加如下函數(shù):
# Collects a list of all external URLs found on the site
all_ext_links = set()
all_int_links = set()
def get_all_external_links(site_url):
html = urlopen(site_url)
soup = BeautifulSoup(html, 'lxml')
print(split_address(site_url)[0])
int
internal_links = get_internal_links(soup, split_address(site_url)[0])
external_links = get_external_links(soup, split_address(site_url)[0])
for link in external_links:
if link not in all_ext_links:
all_ext_links.add(link)
print(link)
for link in internal_links:
if link not in all_int_links:
print('About to get link: ' + link)
all_int_links.add(link)
get_all_external_links(link)
# follow_external_only("http://www.baidu.com")
get_all_external_links('http://oreilly.com')
輸出結(jié)果如下:
oreilly.com
oreilly.com
https://cdn.oreillystatic.com/pdf/oreilly_high_performance_organizations_whitepaper.pdf
http://twitter.com/oreillymedia
http://fb.co/OReilly
https://www.linkedin.com/company/oreilly-media
https://www.youtube.com/user/OreillyMedia
About to get link: https://www.oreilly.com
https:
https:
https://www.oreilly.com
http://www.oreilly.com/ideas
https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav
http://www.oreilly.com/conferences/
http://shop.oreilly.com/
http://members.oreilly.com
https://www.oreilly.com/topics
https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+get+started+now
https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170203+homepage+sign+in
https://www.safaribooksonline.com/live-training/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+take+a+live+online+course
https://www.safaribooksonline.com/learning-paths/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+follow+a+path
https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+unlimited+access
http://www.oreilly.com/live-training/?view=grid
https://www.safaribooksonline.com/your-experience/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+safari+platform
https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+2017+trends
https://www.oreilly.com/ideas?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+read+latest+articles
http://www.oreilly.com/about/
http://www.oreilly.com/work-with-us.html
http://www.oreilly.com/careers/
http://shop.oreilly.com/category/customer-service.do
http://www.oreilly.com/about/contact.html
http://www.oreilly.com/emails/newsletters/
http://www.oreilly.com/terms/
http://www.oreilly.com/privacy.html
http://www.oreilly.com/about/editorial_independence.html
About to get link: https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav
https:
https:
https://www.oreilly.com/
About to get link: https://www.oreilly.com/
https:
https:
About to get link: https://www.oreilly.com/topics
......
程序會(huì)一直循環(huán)下去直到達(dá)到python默認(rèn)的循環(huán)極限, 有興趣的朋友可以像上面的代碼一樣添加默認(rèn)循環(huán)限制loop=5才沧。