爬蟲第五講:BeautifulSoup網(wǎng)頁解析庫

BeautifulSoup

BeautifulSoup是靈活又方便的網(wǎng)頁解析庫弹谁,處理高效乾巧,支持多種解析器。利用它不用編寫正則表達(dá)式即可以方便地實(shí)現(xiàn)網(wǎng)頁信息的提取

安裝BeautifulSoup

pip3 install beautifulsoup4

BeautifulSoup用法

  • 解析庫

    解析庫 使用方法 優(yōu)勢(shì) 劣勢(shì)
    Python標(biāo)準(zhǔn)庫 BeautifulSoup(markup,"html.parser") Python的內(nèi)置標(biāo)準(zhǔn)庫预愤、執(zhí)行速度適中沟于、文檔容錯(cuò)能力強(qiáng) Python2.7.3 or Python3.2.2之前的版本容錯(cuò)能力差
    lxml HTML解析庫 BeautifulSoup(markup,"lxml") 速度快、文檔容錯(cuò)能力強(qiáng) 需要安裝C語言庫
    lxml XML解析庫 BeautifulSoup(markup,"xml") 速度快植康、唯一支持XML的解析器 需要安裝C語言庫
    html5lib BeautifulSoup(markup,"html5lib") 最好的容錯(cuò)性旷太、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 速度慢、不依賴外部擴(kuò)展

基本使用

import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.baidu.com').text
soup = BeautifulSoup(response,'lxml')
print(soup.prettify())#prettify美化供璧,會(huì)格式化輸出存崖,還會(huì)自動(dòng)補(bǔ)齊閉合
print(soup.title.string)#打印head里面的title

標(biāo)簽選擇器
選擇元素

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b?</p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1><!---Elsa---></a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.title)#html title,并且標(biāo)簽也會(huì)輸出
print(type(soup.title))#type <class 'bs4.element.Tag'>
print(soup.head)#html head
print(soup.p)#只第一個(gè)找到的p標(biāo)簽
print(soup.p.name)#獲取名稱 就是p標(biāo)簽的名字睡毒,就是p嘛

獲取名稱
見上面例子

獲取屬性
有些類似jQuery


import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b?</p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1><!---Elsa---></a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])#返回第一個(gè)找到的p標(biāo)簽的屬性名為name的屬性值来惧,返回值是dropmouse。soup.p.attrs返回的是由屬性鍵值對(duì)組成的字典{'class': ['title'], 'name': 'dropmouse'}
print(soup.p['name'])#返回值也是dropmouse吕嘀,和上面的方法結(jié)果一樣违寞。

獲取內(nèi)容比如獲取p標(biāo)簽中的內(nèi)容

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b?</p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1><!---Elsa---></a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)#選擇之后加.string就是選擇標(biāo)簽中的內(nèi)容,這個(gè)內(nèi)容不包含HTML標(biāo)簽

嵌套選擇
'bs4.element.Tag'還可以選擇該Tab中的子標(biāo)簽偶房。比如

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b></p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1><!---Elsa---></a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.body.p.string)#也和jQuery類似

子節(jié)點(diǎn)和子孫節(jié)點(diǎn)

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1><!---Elsa---></a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)#返回p標(biāo)簽內(nèi)的所有內(nèi)容趁曼,包括換行符。list類型
print(soup.p.string)#none棕洋,由于p標(biāo)簽里面嵌套了許多其他HTML標(biāo)簽挡闰,而且不止一個(gè),所以返回none

另一種得到子節(jié)點(diǎn)的方法

import requests
from bs4 import BeautifulSoup
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1><!---Elsa---></a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)#返回包含直接子節(jié)點(diǎn)的迭代器
for i,child in enumerate(soup.p.children):
    print(i,child)
* 返回結(jié)果:*
<list_iterator object at 0x7fda5c186c88>
0 Once upon a time there were three little sisters;and their names lll
  
1 <a class="sister"  id="" link1=""><!---Elsa---></a>
2 

3 <a class="sister"  id="" link2="">Lacie</a>
4  and
    
5 <a class="sister"  id="" link3="">Tille</a>
6 ;
    and They lived at the bottom of a well.

子孫節(jié)點(diǎn)

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1>
        <span>Elsle</span>
    </a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
print(i,child)

會(huì)返回第一個(gè)找到的p下的所有子孫節(jié)點(diǎn)掰盘。

<generator object descendants at 0x7f0b04eceaf0>
0 Once upon a time there were three little sisters;and their names lll
    
1 <a class="sister"  id="" link1="">
<span>Elsle</span>
</a>
2 

3 <span>Elsle</span>
4 Elsle
5 

6 

7 <a class="sister"  id="" link2="">Lacie</a>
8 Lacie
9  and
    
10 <a class="sister"  id="" link3="">Tille</a>
11 Tille
12 ;
    and They lived at the bottom of a well.

父節(jié)點(diǎn)和祖先節(jié)點(diǎn)

import requests
from bs4 import BeautifulSoup
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1>
        <span>Elsle</span>
    </a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)

返回結(jié)果:先找到第一個(gè)a標(biāo)簽摄悯,然后找到這個(gè)a標(biāo)簽的父節(jié)點(diǎn),再輸出整個(gè)p標(biāo)簽包含里面的所有內(nèi)容都輸出愧捕。

<p class="story">Once upon a time there were three little sisters;and their names lll
    <a class="sister"  id="" link1="">
<span>Elsle</span>
</a>
<a class="sister"  id="" link2="">Lacie</a> and
    <a class="sister"  id="" link3="">Tille</a>;
    and They lived at the bottom of a well.</p>

祖先節(jié)點(diǎn)

soup.a.parents #這就是第一個(gè)找到a的祖先標(biāo)簽奢驯,返回一個(gè)迭代器。迭代器包含所有的祖先次绘,一層層從p標(biāo)簽瘪阁、body標(biāo)簽、html標(biāo)簽

兄弟節(jié)點(diǎn)

import requests
from bs4 import BeautifulSoup
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1>
        <span>Elsle</span>
    </a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))#后面的所有兄弟
print(list(enumerate(soup.a.previous_siblings)))#前面的所有兄弟節(jié)點(diǎn)

用上面介紹的選擇器很難精確的選擇某個(gè)element(往往只能選擇第一個(gè)找到的元素)邮偎,所以BeautifulSoup還提供了標(biāo)準(zhǔn)選擇器管跺,向CSS選擇器一樣可以用標(biāo)簽名、屬性禾进、內(nèi)容查找文檔豁跑。

標(biāo)準(zhǔn)選擇器

find_all(name,attrs,recursive,text,**kwargs)

name--標(biāo)簽名
import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))#find_all返回一個(gè)列表,這里返回找到所有的ul包含ul之內(nèi)的所有內(nèi)容泻云。
print(type(soup.find_all('ul')[0]))

*輸出結(jié)果: *

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">That's ok</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">FOO</li>
<li class="element">BAR</li>
</ul>]
<class 'bs4.element.Tag'>

因?yàn)閒ind_all列表中的每個(gè)元素是element.Tag類型的標(biāo)簽艇拍,所以還可以遍歷Tag中的子節(jié)點(diǎn)。這樣可以層層嵌套的查找

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

返回結(jié)果:返回ul下面的所有l(wèi)i

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">That's ok</li>]
[<li class="element">FOO</li>, <li class="element">BAR</li>]

attr find_all(attrs={'name':'element'})查找屬性為name:element鍵值對(duì)的所有元素

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={"class":"list"}))#特殊的屬性如class宠纯、id 可以用class_="list"和id="list-1"代替淑倾。
print(soup.find_all(attrs={"id":"list-1"}))

textfind_all(text="FOO")

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text="Foo"))

返回值:['Foo']
查找元素沒用,只能判斷有沒有找到目標(biāo)征椒。用處不大。

find(name,attrs,recursive,text,**kwargs)

返回找到的第一個(gè)元素湃累,如果沒找到返回None勃救,find_all是返回所有元素的列表碍讨。
不演示了

find_parents() find_parent與find_all和find()類似

返回所有的祖先節(jié)點(diǎn)和返回父節(jié)點(diǎn)

find_next_siblings(),find_next_sibling()

返回后面所有的兄弟節(jié)點(diǎn)和返回后面的第一個(gè)節(jié)點(diǎn)

find_previous_siblings(),find_previous_sibling()

返回前面所有的兄弟節(jié)點(diǎn)和返回前面第一個(gè)兄弟節(jié)點(diǎn)

find_all_next(),find_next()

返回節(jié)點(diǎn)后所有符合條件的節(jié)點(diǎn)和返回節(jié)點(diǎn)后第一個(gè)符合條件的節(jié)點(diǎn)

find_all_previous(),find_previous()

返回節(jié)點(diǎn)前所有符合條件的節(jié)點(diǎn)和返回節(jié)點(diǎn)前第一個(gè)符合條件的節(jié)點(diǎn)

CSS選擇器

通過select()直接傳入CSS選擇器即可完成選擇

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.select('.pannel .pannel-heading'))#返回pannel類下pannel-heading類的元素的內(nèi)容
print(soup.select('ul li'))#返回ul類型之下的li類型的標(biāo)簽蒙秒,包含內(nèi)容
print(soup.select('#list-2 .element'))#返回id=list-2下的element類的元素

結(jié)果

<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">That's ok</li>, <li class="element">FOO</li>, <li class="element">BAR</li>]
[<li class="element">FOO</li>, <li class="element">BAR</li>]

獲取屬性


import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
    print(ul['id'])#返回所有ul的id這個(gè)屬性的值
    print(ul.attrs['id'])#返回所有ul的id這個(gè)屬性的值勃黍,和上面一樣,用這個(gè)辦法可以返回任意的屬性晕讲。

獲取內(nèi)容get_text()

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
    print(li.get_text())

返回結(jié)果:

Foo
Bar
That's ok
FOO
BAR

總結(jié)

  • 推薦使用lxml解析庫覆获,必要時(shí)使用html.parser或者h(yuǎn)tml5lib
  • 標(biāo)簽選擇器速度快但篩選功能弱
  • 建議使用find()、find_all()查詢匹配單個(gè)或多個(gè)結(jié)果
  • 如果對(duì)CSS選擇器熟悉瓢省,建議使用CSS選擇器select()
  • 記住常用的獲取屬性和文本的方法
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末弄息,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子勤婚,更是在濱河造成了極大的恐慌摹量,老刑警劉巖,帶你破解...
    沈念sama閱讀 216,744評(píng)論 6 502
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件馒胆,死亡現(xiàn)場離奇詭異缨称,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)祝迂,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,505評(píng)論 3 392
  • 文/潘曉璐 我一進(jìn)店門睦尽,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人型雳,你說我怎么就攤上這事当凡。” “怎么了四啰?”我有些...
    開封第一講書人閱讀 163,105評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵宁玫,是天一觀的道長。 經(jīng)常有香客問我柑晒,道長欧瘪,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,242評(píng)論 1 292
  • 正文 為了忘掉前任匙赞,我火速辦了婚禮佛掖,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘涌庭。我一直安慰自己芥被,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,269評(píng)論 6 389
  • 文/花漫 我一把揭開白布坐榆。 她就那樣靜靜地躺著拴魄,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上匹中,一...
    開封第一講書人閱讀 51,215評(píng)論 1 299
  • 那天夏漱,我揣著相機(jī)與錄音,去河邊找鬼顶捷。 笑死挂绰,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的服赎。 我是一名探鬼主播葵蒂,決...
    沈念sama閱讀 40,096評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢(mèng)啊……” “哼重虑!你這毒婦竟也來了践付?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 38,939評(píng)論 0 274
  • 序言:老撾萬榮一對(duì)情侶失蹤嚎尤,失蹤者是張志新(化名)和其女友劉穎荔仁,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體芽死,經(jīng)...
    沈念sama閱讀 45,354評(píng)論 1 311
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡乏梁,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,573評(píng)論 2 333
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了关贵。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片遇骑。...
    茶點(diǎn)故事閱讀 39,745評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖揖曾,靈堂內(nèi)的尸體忽然破棺而出落萎,到底是詐尸還是另有隱情,我是刑警寧澤炭剪,帶...
    沈念sama閱讀 35,448評(píng)論 5 344
  • 正文 年R本政府宣布练链,位于F島的核電站,受9級(jí)特大地震影響奴拦,放射性物質(zhì)發(fā)生泄漏媒鼓。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,048評(píng)論 3 327
  • 文/蒙蒙 一错妖、第九天 我趴在偏房一處隱蔽的房頂上張望绿鸣。 院中可真熱鬧,春花似錦暂氯、人聲如沸潮模。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,683評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽擎厢。三九已至究流,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間锉矢,已是汗流浹背梯嗽。 一陣腳步聲響...
    開封第一講書人閱讀 32,838評(píng)論 1 269
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留沽损,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 47,776評(píng)論 2 369
  • 正文 我出身青樓循头,卻偏偏與公主長得像绵估,于是被迫代替她去往敵國和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子卡骂,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,652評(píng)論 2 354

推薦閱讀更多精彩內(nèi)容