爬取靜態(tài)網(wǎng)頁案例:
from bs4 import BeautifulSoup
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
url = 'http://news.baidu.com/'
# 取得新聞標(biāo)題
def craw2(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
for title_href in soup.find_all('div'):
print([title.get_text()
for title in title_href.find_all('a')])
craw2(url)
上面案例可以得到網(wǎng)頁代碼中內(nèi)容隧土,即使不設(shè)置header頭也能獲取到嚣镜。
但是,相同代碼應(yīng)用于獲取動態(tài)網(wǎng)站就失效了剩瓶。
在動態(tài)頁面中,所顯示的內(nèi)容往往不是通過HTML頁面呈現(xiàn)的城丧,而是通過調(diào)用js等方式從數(shù)據(jù)庫中得到數(shù)據(jù)延曙,回顯到網(wǎng)頁上。
爬取動態(tài)網(wǎng)頁案例:
爬取https://www.infoq.cn/為案例亡哄,如果直接爬取網(wǎng)站HTML頁面是無法獲取想要的文章內(nèi)容搂鲫,可以在頁面右鍵顯示網(wǎng)頁源代碼看到想要的文章內(nèi)容并不在HTML中,所以可確定為動態(tài)網(wǎng)站磺平。
下面操作方法來源此文章
找到內(nèi)容對應(yīng)url
捕捉信息發(fā)現(xiàn)通過post請求獲取內(nèi)容
通過模擬客戶端發(fā)送post請求來爬取,首先需要找到post請求url:
import requests
from bs4 import BeautifulSoup
def test0():
url="https://www.infoq.cn/public/v1/config/getAdList"
headers = {
'Accept':'*/*',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Content-Length':'0',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie':'這里的內(nèi)容用自己',
'Host':'www.infoq.cn',
'Origin':'https://www.infoq.cn',
'Referer':'https://www.infoq.cn/public/v1/config/getAdList',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
}
re=requests.post(url,headers = headers,data={'start':0})
print(re.text)
def test1():
url="https://www.infoq.cn/public/v1/my/recommond"
headers = {
'Accept':'*/*',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Content-Length':'11',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie':'這里的內(nèi)容用自己',
'Host':'www.infoq.cn',
'Origin':'https://www.infoq.cn',
'Referer':'https://www.infoq.cn/public/v1/my/recommond',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
}
re=requests.post(url,headers = headers,data={'start':33,'offset':33})
print(re.text)
test0()
#獲取到的內(nèi)容
{'code': 0,
'data':
{'list':
[{'name': 'QCon',
'link': 'https://2019.qconbeijing.com/track?utm_source=infoq&utm_medium=banner&term=lusu',
'image': 'https://static001.geekbang.org/resource/image/97/70/97c80c4d51d01cbebebcf397e5e63f70.png'},
{'name': 'QCon廣州',
'link': 'https://qconguangzhou.geekbang.org/?utm_source=infoq&utm_medium=banner',
'image': 'https://static001.infoq.cn/resource/image/0a/60/0a7516e6e8ed699d14fcfaf412475960.jpg'},
{'name': '西云數(shù)據(jù)',
'link': 'https://www.bagevent.com/event/2356848?bag_track=Banner3',
'image': 'https://static001.geekbang.org/resource/image/0c/fd/0c3a140ba910744136d9fe0e726945fd.jpg'},
{'name': 'GMTC',
'link': 'https://gmtc2019.geekbang.org/?utm_source=infoq&utm_medium=banner&utm_campaign=7',
'image': 'https://static001.geekbang.org/resource/image/72/fa/726dfd8346cdf7a0de843b32df3ccbfa.jpg'},
{'name': 'GTLC',
'link': 'https://gtlc2019.geekbang.org/?utm_source=infoq&utm_medium=guanwang&utm_campaign=banner',
'image': 'https://static001.geekbang.org/resource/image/b6/87/b6375457793da7b7b5fbc9eba8530a87.jpg'},
{'name': '極客大學(xué)',
'link': 'https://time.geekbang.org/special/arithmetic?utm_source=infoq_web&utm_medium=banner',
'image': 'https://static001.geekbang.org/resource/image/53/dd/53c9dca735254ffa687984ffef0f93dd.jpg'},
{'name': '企業(yè)賬號',
'link': 'https://service.geekbang.org/goods/list?category=7&page=1#utm_source=website&utm_medium=infoq&utm_campaign=banner&utm_term=0221',
'image': 'https://static001.infoq.cn/resource/image/37/a7/375a5847c1f0d397d2280f23e077c6a7.jpg'}],
'offsets': 0},
'error': {},
'extra': {'cost': 0.001510902,
'request-id': '396b8defe24e713f8610038737671eba@2@infoq'}}
test1()
#獲取到的內(nèi)容
{'code': 0,
'data': [{'aid': 22233,
'article_cover': 'https://static001.geekbang.org/resource/image/e4/71/e4a978422d6c045c7a541b78bbca1f71.jpeg',
'article_cover_point': '{"big":{"point":{"x":0,"y":189,"w":2815,"h":1451}},"small":{"point":{"x":0,"y":0,"w":2816,"h":2091}},"width":2816,"height":2112}',
'article_sharetitle': '北大AI公開課2019 | 商湯科技沈徽:AI創(chuàng)新與落地',
'article_subtitle': '',
'article_summary': '北大AI公開課第四講如期開講拐辽,商湯科技集團(tuán)副總裁拣挪、商業(yè)與數(shù)據(jù)洞察事業(yè)群總裁、工程院院長沈徽帶來了《AI創(chuàng)新與落地》的分享',
'article_title': '北大AI公開課2019 | 商湯科技沈徽:AI創(chuàng)新與落地',
'author':
[{'uid': 1277332,
'nickname': '蔡芳芳',
'avatar': ''}],
'ctime': 1552714233280,
'is_collect': False,
'no_author': '',
'publish_time': 1552716000000,
'score': 1552716000000,
'sub_author': [],
'sub_topic': [],
'topic':
[{'id': 31,
'name': 'AI'},
{'id': 3,
'name': '文化 & 方法'},
{'id': 127,
'name': '計(jì)算機(jī)視覺'}],
'type': 1,
'utime': 1552716005424,
'uuid': 'GoNSrsxpC0AT6a3V-6NQ',
'views': 0},
{'aid': 22232,
'article_cover': 'https://static001.geekbang.org/resource/image/ee/47/ee719a85e81ae073f79039ba9da3df47.jpg',
'article_cover_point': '{"big":{"point":{"x":0,"y":309,"w":5184,"h":2672}},"small":{"point":{"x":506,"y":332,"w":4040,"h":3001}},"width":5184,"height":3456}',
'article_sharetitle': '3·15曝光丨智能機(jī)器人一年撥打40億個(gè)騷擾電話俱诸,6億人信息已遭泄露菠劝!',
'article_subtitle': '',
'article_summary': '在昨晚的315晚會上,一條探針盒子+數(shù)據(jù)匹配+智能外呼機(jī)器人的灰色產(chǎn)業(yè)鏈遭到曝光睁搭。據(jù)報(bào)道赶诊,智能外呼機(jī)器人一年撥打電話可達(dá)40多億個(gè),探針盒子公司收集有全國6億用戶的各類信息园骆!',
'article_title': '3·15曝光丨智能機(jī)器人一年撥打40億個(gè)騷擾電話舔痪,6億人信息已遭泄露!',
'author':
[{'uid': 1000106,
'nickname': '小智',
'avatar': 'https://static001.geekbang.org/account/avatar/00/0f/42/aa/b9a67c2e.jpg'}],
'ctime': 1552698015145,
'is_collect': False,
'no_author': '',
'publish_time': 1552698000000,
'score': 1552698000000,
'sub_author': [],
'sub_topic': [],
'topic': [{'id': 21,
'name': '安全'},
{'id': 15,
'name': '大數(shù)據(jù)'},
{'id': 148,
'name': '信息泄露'}],
'type': 1,
'utime': 1552698015145,
'uuid': 'NgG*uSOwwI2OVhA80o0G',
'views': 0},
{'aid': 22230,
'article_cover': 'https://static001.geekbang.org/resource/image/30/6f/3087f86b2cbe7b3222c3bb5d7557126f.jpg',
'article_cover_point': '{"big":{"point":{"x":0,"y":0,"w":1279,"h":659}},"small":{"point":{"x":51,"y":0,"w":1148,"h":852}},"width":1280,"height":853}',
'article_sharetitle': '這可能是人工智能領(lǐng)域覆蓋最全的一份技術(shù)趨勢報(bào)告',
'article_subtitle': '',
'article_summary': '這份報(bào)告對AI領(lǐng)域的技術(shù)預(yù)測可謂面面俱到锌唾,無論是對于AI企業(yè)锄码、研究者夺英,還是AI學(xué)習(xí)者來說都有一定參考價(jià)值',
'article_title': '這可能是人工智能領(lǐng)域覆蓋最全的一份技術(shù)趨勢報(bào)告',
'author': [{'uid': 1462160,
'nickname': '未來今日研究所',
'avatar': ''}],
'ctime': 1552698005153,
'is_collect': False,
'no_author': '',
'publish_time': 1552698000000,
'score': 1552698000000,
'sub_author': [],
'sub_topic': [],
'topic':
[{'id': 31,
'name': 'AI'},
{'id': 1,
'name': '語言 & 開發(fā)'},
{'id': 45,
'name': '物聯(lián)網(wǎng)'}],
'translator': [{'uid': 1282296,
'nickname': 'Debra',
'avatar': ''}],
'type': 1,
'utime': 1552698005153,
'uuid': 'A315uodoMbWrNrZh*MzP',
'views': 0},
......
{'aid': 22225,
'article_cover': 'https://static001.geekbang.org/resource/image/6f/a2/6f3b4ff25e9e2e926e8e2bd4436f96a2.jpg',
'article_cover_point': '{"big":{"point":{"x":0,"y":10,"w":1000,"h":515}},"small":{"point":{"x":136,"y":12,"w":761,"h":565}},"width":1000,"height":600}',
'article_sharetitle': 'Google 和 Facebook 披露全球范圍宕機(jī)原因',
'article_subtitle': '',
'article_summary': '昨日,Google滋捶、Facebook兩巨頭在同一天相繼發(fā)生全球大規(guī)模宕機(jī)痛悯,其中Facebook的斷電時(shí)常更是超過10小時(shí)之久。',
'article_title': 'Google 和 Facebook 披露全球范圍宕機(jī)原因',
'author':
[{'uid': 1278039,
'nickname': '張嬋',
'avatar': ''}],
'ctime': 1552642208366,
'is_collect': False,
'no_author': '',
'publish_time': 1552642207254,
'score': 1552642207254,
'sub_author': [],
'sub_topic': [],
'topic':
[{'id': 3,
'name': '文化 & 方法'},
{'id': 147,
'name': '企業(yè)動態(tài)'},
{'id': 48,
'name': '方法論'}],
'type': 1,
'utime': 1552642208366,
'uuid': 'e-NCah5RTmJMrvmrmbCU',
'views': 0}],
'error': {},
'extra':
{'cost': 0.028648169,
'request-id': '1ceebf53fd24003586f1272cc881b7fd@2@infoq'}}