學習Python爬蟲的第一個小demo策州,給出一些筆記,以便日后復習。
在使用Python做爬蟲的時候鹉勒,可以分為兩大塊:1.將目標網頁內容請求下來;2.對請求下來的內容做整理
這里也是先給出每一步的筆記吵取,然后給出最終的源代碼禽额。
一、導入相關庫
import requests
from lxml import etree
二皮官、將目標網頁內容請求下來
1.設置請求頭
- 原因是一些網站可能會有反爬蟲機制脯倒,設置請求頭,可以繞過一些網站的反爬蟲機制捺氢,成功獲取數據藻丢。
- 設置請求頭的時候,一般情況下要設置
User-Agent
和Referer
摄乒,如果只設置這兩項不足以繞過網站的反爬蟲機制的話悠反,就使用Chrome的開發(fā)者工具,設置更多的請求頭馍佑。
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
"Referer": "https://www.douban.com/"
}
2.請求網頁內容
douban_url = "https://movie.douban.com/cinema/nowplaying/shanghai/"
response = requests.get(douban_url, headers=headers)
douban_text = response.text
三斋否、對請求下來的內容做整理
- 這里主要是使用lxml配合xpath語法進行整理,將每一部電影的信息整理到字典中拭荤,最終將所有的電影存放在列表中
html_element = etree.HTML(douban_text)
ul = html_element.xpath('//ul[@class="lists"]')[0]
lis = ul.xpath('./li')
movies = []
for li in lis:
title = li.xpath('./@data-title')[0]
score = li.xpath('./@data-score')[0]
star = li.xpath('./@data-star')[0]
duration = li.xpath('./@data-duration')[0]
region = li.xpath('./@data-region')[0]
director = li.xpath('./@data-director')[0]
actors = li.xpath('./@data-actors')[0]
post = li.xpath('.//img/@src')[0]
movie = {
"title": title,
"score": score,
"star": star,
"duration": duration,
"redion": region,
"director": director,
"actors": actors,
"post": post
}
movies.append(movie)
for movie in movies:
print(movie)
四如叼、完整代碼
# 導入相關庫
import requests
from lxml import etree
# 1.將目標網頁的內容請求下來
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
"Referer": "https://www.douban.com/"
}
douban_url = "https://movie.douban.com/cinema/nowplaying/shanghai/"
response = requests.get(douban_url, headers=headers)
douban_text = response.text
# 2.將抓取的數據進行處理
html_element = etree.HTML(douban_text)
ul = html_element.xpath('//ul[@class="lists"]')[0]
lis = ul.xpath('./li')
movies = []
for li in lis:
title = li.xpath('./@data-title')[0]
score = li.xpath('./@data-score')[0]
star = li.xpath('./@data-star')[0]
duration = li.xpath('./@data-duration')[0]
region = li.xpath('./@data-region')[0]
director = li.xpath('./@data-director')[0]
actors = li.xpath('./@data-actors')[0]
post = li.xpath('.//img/@src')[0]
movie = {
"title": title,
"score": score,
"star": star,
"duration": duration,
"redion": region,
"director": director,
"actors": actors,
"post": post
}
movies.append(movie)
for movie in movies:
print(movie)