BeautifulSoup
BeautifulSoup是靈活又方便的網(wǎng)頁解析庫弹谁,處理高效乾巧,支持多種解析器。利用它不用編寫正則表達(dá)式即可以方便地實(shí)現(xiàn)網(wǎng)頁信息的提取
安裝BeautifulSoup
pip3 install beautifulsoup4
BeautifulSoup用法
-
解析庫
解析庫 使用方法 優(yōu)勢(shì) 劣勢(shì) Python標(biāo)準(zhǔn)庫 BeautifulSoup(markup,"html.parser") Python的內(nèi)置標(biāo)準(zhǔn)庫预愤、執(zhí)行速度適中沟于、文檔容錯(cuò)能力強(qiáng) Python2.7.3 or Python3.2.2之前的版本容錯(cuò)能力差 lxml HTML解析庫 BeautifulSoup(markup,"lxml") 速度快、文檔容錯(cuò)能力強(qiáng) 需要安裝C語言庫 lxml XML解析庫 BeautifulSoup(markup,"xml") 速度快植康、唯一支持XML的解析器 需要安裝C語言庫 html5lib BeautifulSoup(markup,"html5lib") 最好的容錯(cuò)性旷太、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 速度慢、不依賴外部擴(kuò)展
基本使用
import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.baidu.com').text
soup = BeautifulSoup(response,'lxml')
print(soup.prettify())#prettify美化供璧,會(huì)格式化輸出存崖,還會(huì)自動(dòng)補(bǔ)齊閉合
print(soup.title.string)#打印head里面的title
標(biāo)簽選擇器
選擇元素
import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
<html><head><title>This is a test Html code</title></head>
<body>
<p class="title" name="dropmouse"<b>The doc story</b?</p>
<p class="story">Once upon a time there were three little sisters;and their names lll
<a class="sister" id =""link1><!---Elsa---></a>
<a class="sister" id =""link2>Lacie</a> and
<a class="sister" id =""link3>Tille</a>;
and They lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.title)#html title,并且標(biāo)簽也會(huì)輸出
print(type(soup.title))#type <class 'bs4.element.Tag'>
print(soup.head)#html head
print(soup.p)#只第一個(gè)找到的p標(biāo)簽
print(soup.p.name)#獲取名稱 就是p標(biāo)簽的名字睡毒,就是p嘛
獲取名稱
見上面例子
獲取屬性
有些類似jQuery
import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
<html><head><title>This is a test Html code</title></head>
<body>
<p class="title" name="dropmouse"<b>The doc story</b?</p>
<p class="story">Once upon a time there were three little sisters;and their names lll
<a class="sister" id =""link1><!---Elsa---></a>
<a class="sister" id =""link2>Lacie</a> and
<a class="sister" id =""link3>Tille</a>;
and They lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])#返回第一個(gè)找到的p標(biāo)簽的屬性名為name的屬性值来惧,返回值是dropmouse。soup.p.attrs返回的是由屬性鍵值對(duì)組成的字典{'class': ['title'], 'name': 'dropmouse'}
print(soup.p['name'])#返回值也是dropmouse吕嘀,和上面的方法結(jié)果一樣违寞。
獲取內(nèi)容比如獲取p標(biāo)簽中的內(nèi)容
import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
<html><head><title>This is a test Html code</title></head>
<body>
<p class="title" name="dropmouse"<b>The doc story</b?</p>
<p class="story">Once upon a time there were three little sisters;and their names lll
<a class="sister" id =""link1><!---Elsa---></a>
<a class="sister" id =""link2>Lacie</a> and
<a class="sister" id =""link3>Tille</a>;
and They lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)#選擇之后加.string就是選擇標(biāo)簽中的內(nèi)容,這個(gè)內(nèi)容不包含HTML標(biāo)簽
嵌套選擇
'bs4.element.Tag'還可以選擇該Tab中的子標(biāo)簽偶房。比如
import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
<html><head><title>This is a test Html code</title></head>
<body>
<p class="title" name="dropmouse"<b>The doc story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names lll
<a class="sister" id =""link1><!---Elsa---></a>
<a class="sister" id =""link2>Lacie</a> and
<a class="sister" id =""link3>Tille</a>;
and They lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.body.p.string)#也和jQuery類似
子節(jié)點(diǎn)和子孫節(jié)點(diǎn)
import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
<html><head><title>This is a test Html code</title></head>
<body>
<p class="story">Once upon a time there were three little sisters;and their names lll
<a class="sister" id =""link1><!---Elsa---></a>
<a class="sister" id =""link2>Lacie</a> and
<a class="sister" id =""link3>Tille</a>;
and They lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)#返回p標(biāo)簽內(nèi)的所有內(nèi)容趁曼,包括換行符。list類型
print(soup.p.string)#none棕洋,由于p標(biāo)簽里面嵌套了許多其他HTML標(biāo)簽挡闰,而且不止一個(gè),所以返回none
另一種得到子節(jié)點(diǎn)的方法
import requests
from bs4 import BeautifulSoup
html = """
<html><head><title>This is a test Html code</title></head>
<body>
<p class="story">Once upon a time there were three little sisters;and their names lll
<a class="sister" id =""link1><!---Elsa---></a>
<a class="sister" id =""link2>Lacie</a> and
<a class="sister" id =""link3>Tille</a>;
and They lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)#返回包含直接子節(jié)點(diǎn)的迭代器
for i,child in enumerate(soup.p.children):
print(i,child)
* 返回結(jié)果:*
<list_iterator object at 0x7fda5c186c88>
0 Once upon a time there were three little sisters;and their names lll
1 <a class="sister" id="" link1=""><!---Elsa---></a>
2
3 <a class="sister" id="" link2="">Lacie</a>
4 and
5 <a class="sister" id="" link3="">Tille</a>
6 ;
and They lived at the bottom of a well.
子孫節(jié)點(diǎn)
import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
<html><head><title>This is a test Html code</title></head>
<body>
<p class="story">Once upon a time there were three little sisters;and their names lll
<a class="sister" id =""link1>
<span>Elsle</span>
</a>
<a class="sister" id =""link2>Lacie</a> and
<a class="sister" id =""link3>Tille</a>;
and They lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
print(i,child)
會(huì)返回第一個(gè)找到的p下的所有子孫節(jié)點(diǎn)掰盘。
<generator object descendants at 0x7f0b04eceaf0>
0 Once upon a time there were three little sisters;and their names lll
1 <a class="sister" id="" link1="">
<span>Elsle</span>
</a>
2
3 <span>Elsle</span>
4 Elsle
5
6
7 <a class="sister" id="" link2="">Lacie</a>
8 Lacie
9 and
10 <a class="sister" id="" link3="">Tille</a>
11 Tille
12 ;
and They lived at the bottom of a well.
父節(jié)點(diǎn)和祖先節(jié)點(diǎn)
import requests
from bs4 import BeautifulSoup
html = """
<html><head><title>This is a test Html code</title></head>
<body>
<p class="story">Once upon a time there were three little sisters;and their names lll
<a class="sister" id =""link1>
<span>Elsle</span>
</a>
<a class="sister" id =""link2>Lacie</a> and
<a class="sister" id =""link3>Tille</a>;
and They lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)
返回結(jié)果:先找到第一個(gè)a標(biāo)簽摄悯,然后找到這個(gè)a標(biāo)簽的父節(jié)點(diǎn),再輸出整個(gè)p標(biāo)簽包含里面的所有內(nèi)容都輸出愧捕。
<p class="story">Once upon a time there were three little sisters;and their names lll
<a class="sister" id="" link1="">
<span>Elsle</span>
</a>
<a class="sister" id="" link2="">Lacie</a> and
<a class="sister" id="" link3="">Tille</a>;
and They lived at the bottom of a well.</p>
祖先節(jié)點(diǎn)
soup.a.parents #這就是第一個(gè)找到a的祖先標(biāo)簽奢驯,返回一個(gè)迭代器。迭代器包含所有的祖先次绘,一層層從p標(biāo)簽瘪阁、body標(biāo)簽、html標(biāo)簽
兄弟節(jié)點(diǎn)
import requests
from bs4 import BeautifulSoup
html = """
<html><head><title>This is a test Html code</title></head>
<body>
<p class="story">Once upon a time there were three little sisters;and their names lll
<a class="sister" id =""link1>
<span>Elsle</span>
</a>
<a class="sister" id =""link2>Lacie</a> and
<a class="sister" id =""link3>Tille</a>;
and They lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))#后面的所有兄弟
print(list(enumerate(soup.a.previous_siblings)))#前面的所有兄弟節(jié)點(diǎn)
用上面介紹的選擇器很難精確的選擇某個(gè)element(往往只能選擇第一個(gè)找到的元素)邮偎,所以BeautifulSoup還提供了標(biāo)準(zhǔn)選擇器管跺,向CSS選擇器一樣可以用標(biāo)簽名、屬性禾进、內(nèi)容查找文檔豁跑。
標(biāo)準(zhǔn)選擇器
find_all(name,attrs,recursive,text,**kwargs)
name--標(biāo)簽名
import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
<div class="pannel-heading">
<h4>Hello</h4>
</div>
<div class="pannel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">That's ok</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">FOO</li>
<li class="element">BAR</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))#find_all返回一個(gè)列表,這里返回找到所有的ul包含ul之內(nèi)的所有內(nèi)容泻云。
print(type(soup.find_all('ul')[0]))
*輸出結(jié)果: *
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">That's ok</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">FOO</li>
<li class="element">BAR</li>
</ul>]
<class 'bs4.element.Tag'>
因?yàn)閒ind_all列表中的每個(gè)元素是element.Tag類型的標(biāo)簽艇拍,所以還可以遍歷Tag中的子節(jié)點(diǎn)。這樣可以層層嵌套的查找
import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
<div class="pannel-heading">
<h4>Hello</h4>
</div>
<div class="pannel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">That's ok</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">FOO</li>
<li class="element">BAR</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
返回結(jié)果:返回ul下面的所有l(wèi)i
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">That's ok</li>]
[<li class="element">FOO</li>, <li class="element">BAR</li>]
attr find_all(attrs={'name':'element'})查找屬性為name:element鍵值對(duì)的所有元素
import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
<div class="pannel-heading">
<h4>Hello</h4>
</div>
<div class="pannel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">That's ok</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">FOO</li>
<li class="element">BAR</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={"class":"list"}))#特殊的屬性如class宠纯、id 可以用class_="list"和id="list-1"代替淑倾。
print(soup.find_all(attrs={"id":"list-1"}))
textfind_all(text="FOO")
import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
<div class="pannel-heading">
<h4>Hello</h4>
</div>
<div class="pannel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">That's ok</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">FOO</li>
<li class="element">BAR</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text="Foo"))
返回值:['Foo']
查找元素沒用,只能判斷有沒有找到目標(biāo)征椒。用處不大。
find(name,attrs,recursive,text,**kwargs)
返回找到的第一個(gè)元素湃累,如果沒找到返回None勃救,find_all是返回所有元素的列表碍讨。
不演示了
find_parents() find_parent與find_all和find()類似
返回所有的祖先節(jié)點(diǎn)和返回父節(jié)點(diǎn)
find_next_siblings(),find_next_sibling()
返回后面所有的兄弟節(jié)點(diǎn)和返回后面的第一個(gè)節(jié)點(diǎn)
find_previous_siblings(),find_previous_sibling()
返回前面所有的兄弟節(jié)點(diǎn)和返回前面第一個(gè)兄弟節(jié)點(diǎn)
find_all_next(),find_next()
返回節(jié)點(diǎn)后所有符合條件的節(jié)點(diǎn)和返回節(jié)點(diǎn)后第一個(gè)符合條件的節(jié)點(diǎn)
find_all_previous(),find_previous()
返回節(jié)點(diǎn)前所有符合條件的節(jié)點(diǎn)和返回節(jié)點(diǎn)前第一個(gè)符合條件的節(jié)點(diǎn)
CSS選擇器
通過select()直接傳入CSS選擇器即可完成選擇
import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
<div class="pannel-heading">
<h4>Hello</h4>
</div>
<div class="pannel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">That's ok</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">FOO</li>
<li class="element">BAR</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.select('.pannel .pannel-heading'))#返回pannel類下pannel-heading類的元素的內(nèi)容
print(soup.select('ul li'))#返回ul類型之下的li類型的標(biāo)簽蒙秒,包含內(nèi)容
print(soup.select('#list-2 .element'))#返回id=list-2下的element類的元素
結(jié)果
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">That's ok</li>, <li class="element">FOO</li>, <li class="element">BAR</li>]
[<li class="element">FOO</li>, <li class="element">BAR</li>]
獲取屬性
import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
<div class="pannel-heading">
<h4>Hello</h4>
</div>
<div class="pannel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">That's ok</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">FOO</li>
<li class="element">BAR</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul['id'])#返回所有ul的id這個(gè)屬性的值
print(ul.attrs['id'])#返回所有ul的id這個(gè)屬性的值勃黍,和上面一樣,用這個(gè)辦法可以返回任意的屬性晕讲。
獲取內(nèi)容get_text()
import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
<div class="pannel-heading">
<h4>Hello</h4>
</div>
<div class="pannel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">That's ok</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">FOO</li>
<li class="element">BAR</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
print(li.get_text())
返回結(jié)果:
Foo
Bar
That's ok
FOO
BAR
總結(jié)
- 推薦使用lxml解析庫覆获,必要時(shí)使用html.parser或者h(yuǎn)tml5lib
- 標(biāo)簽選擇器速度快但篩選功能弱
- 建議使用find()、find_all()查詢匹配單個(gè)或多個(gè)結(jié)果
- 如果對(duì)CSS選擇器熟悉瓢省,建議使用CSS選擇器select()
- 記住常用的獲取屬性和文本的方法