BS4
BeautifulSoup是用來(lái)從HTML or XML中提取數(shù)據(jù)的Python lib。BeautifulSoup將文檔轉(zhuǎn)化為樹(shù)形結(jié)構(gòu)(DOM)葬项,每個(gè)節(jié)點(diǎn)都是下述四種類型的Python對(duì)象:
- BeautifulSoup
<class 'bs4.BeatifulSoup'>
- Tag
<class 'bs4.element.Tag'>
- NavigableString
<class 'bs4.element.NavigableString'>
- Comment
<class 'bs4.element.Comment'>
從集合角度理解以上4中類的關(guān)系(類概念上并不準(zhǔn)確)
- BeautifulSoup 為全集(將Document以入?yún)魅肷葿eautifulSoup object)泞当, 包含 Tag子集
- Tag 包含 NavigableString 子集
- Comment 為 NavigableString 特殊集合
Usage
BeautifulSoup的第一個(gè)入?yún)⑹荄ocument,第二個(gè)入?yún)⒅付―ocument parser 類型.
from bs4 import BeautifulSoup
import requests, re
url = 'http://m.kdslife.com/club/'
# get whole HTTP response
response = requests.get(url)
# args[0] is HTML document, args[1] select LXML parser. returned BeautifulSoup object
soup = BeautifulSoup( response.text, 'lxml')
print soup.name
# [document]'
print type(soup)
# <class 'bs4.BeatifulSoup'>
Sample codes for Tag objects
# BeutifulSoup --> Tag
# get the Tag object(title)
res = soup.title
print res
# <title>KDS Life</title>
res = soup.title.name
print res
# title
# attribules of a Tag object
res = soup.section
print type(res)
# <class 'bs4.element.Tag'>
print res['class']
# ['forum-head-hot', 'clearfix']
# All the attributes of section Tag object, returned a dict
print res.attrs
#{'class': ['forum-head-hot', 'clearfix']}
Sample codes for NavigableString object
# NavigableString object describes the string in Tag object
res = soup.title
print res.string
# KDS Life
print type(res.string)
# <class 'bs4.element.NavigableString'>
Sample codes for Comment object
# Comment, is a special NavigableString object
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
print type(comment)
# <class 'bs4.element.Comment'>
BS4 Parser
按照優(yōu)先順序自動(dòng)解析民珍,'lxml' --> 'html5lib' --> 'html.parser'
常用Tag對(duì)象方法
find_all()
find_all(name,attrs,recursive,text,**kwargs)
不解釋襟士,直接看代碼
# filter, returned a matching list
# returned [] if matching nothing
title = soup.find_all('title')
print title
#[<title>Google</title>]
res = soup.find_all('div', 'topAd')
print res
# find all the elements whose id is 'gb-main'
res = soup.find_all(id='topAd')
print res
#[<div id="topAd">...</div>]
# find all the elements with 'img' tag and 'src' attribute matching the specific pattern
res = soup.find_all('img', src=re.compile(r'^http://club-img',re.I))
print res
# [![](http://upload-images.jianshu.io/upload_images/1876246-100fdca5a06a87b5.src?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240),
#...]
select()
# css selector
# select those whose tag's id = wrapperto
res = soup.select('#wrapperto')
print res
# [<div class="swiper-wrapper clearfix" id="wrapperto"></div>]
# select those 'img' tags who have 'src' attribute
res = soup.select('img[src]')
print res
#[![](http://upload-images.jianshu.io/upload_images/1876246-e154ab8cb1175dfd.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240), <im
#g src="http://club-img.kdslife.com/attach/1k0/gs/a/o41gty-1coa.png@0o_1l_600w_90q.src"/>]
# select those 'img' tags whose 'src' attribute is
res = soup.select('img[src=http://icon.pch-img.net/kds/club_m/club/icon/user1.png]')
print res
# [![](http://upload-images.jianshu.io/upload_images/1876246-e154ab8cb1175dfd.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]
Other
# get_text()
markup = '<a >\n a link to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup,'lxml')
res = soup.get_text()
print res
# a link to example.com
res = soup.i.get_text()
print res
# example.com
# .stripped_string
res = soup.stripped_strings
print list(res)
# [u'a link to', u'example.com']
最后貼上一個(gè)簡(jiǎn)單的KDS圖片爬蟲(chóng)
Note
- BeautifulSoup進(jìn)行了編碼檢測(cè)并自動(dòng)轉(zhuǎn)為Unicode. soup.original_encoding屬性來(lái)獲取自動(dòng)識(shí)別編碼的結(jié)果。
- Input converts to unicode, output encodes with utf-8
- 在BS使用中嚷量,可配合 XPath expression使用