爬蟲流程:爬蟲的原理:url -> html -> model (洗數(shù)據(jù)) -> 分析
- 依賴的包
requests // 用于發(fā)送請求捺典,獲取頁面信息
pyquery // pyquery庫是jQuery的Python實現(xiàn),將響應內(nèi)容轉(zhuǎn)化為PyQuery對象,實現(xiàn)css選擇(分析頁面) - 獲取頁面數(shù)據(jù)
- 循環(huán)url
import os
import requests
from pyquery import PyQuery as pq
as
語句可以將包名稱簡化段化;
class Model(object):
def __repr__(self):
name = self.__class__.__name__
properties = ('{}=({})'.format(k, v) for k, v in self.__dict__.items())
s = '\n<{} \n {}>'.format(name, '\n '.join(properties))
return s
- 基類逛艰,用于調(diào)整爬取的數(shù)據(jù)結構,注意后面的
return
哀墓,能夠返回真正的數(shù)據(jù)趁餐,不然打印出來的類全是類型,socket
第三章有截圖篮绰; -
__repr__()
方法不用調(diào)用澎怒,print 輸出時,自動調(diào)用這個方法阶牍,也稱為魔法函數(shù)喷面; - 類屬性:
__class__.__name__
: 返回類名
__dict__
:返回屬性的字典集合 - () 的使用
- 三個
\n
:
字符串都有 join() 方法,參數(shù)時要連接的元素序列
class Movie(Model):
def __init__(self):
self.name = ''
self.score = 0
self.quote = ''
self.cover_url = ''
self.ranking = 0
定義屬性(字段)走孽,存儲數(shù)據(jù)惧辈。
def movie_from_div(div):
e = pq(div)
m = Movie()
m.name = e('.title').text()
m.score = e('.rating_num').text()
m.quote = e('.inq').text()
m.cover_url = e('img').attr('src')
m.ranking = e('.pic em').text()
return m
每次想要進行 css 選擇,都需要用 eq() 進行包裝磕瓷。上一個是針對整個頁面盒齿,這個只是針對 div 內(nèi)的元素;
文本的獲取用 .text() 方法
屬性的獲取用 .attr() 方法
如果目標元素沒有 class 或 id 標記困食,那么可以通過父元素向下查找
def movies_from_url(url):
r = requests.get(url)
page = r.content
e = pq(page)
items = e('.item')
movies = [movie_from_div(i) for i in items]
return movies
request.get()
下載 url
對應的頁面边翁,頁面內(nèi)容通過 content
屬性獲得頁面內(nèi)容(html),這兩步下載頁面硕盹。
pq(page)
獲得支持 css
語法的對象
def main():
url = 'https://movie.douban.com/top250'
movies = movies_from_url(url)
print('top250 movies', movies)
if __name__ == '__main__':
main()
通過觀察 url 規(guī)律符匾,可以爬取多個頁面
def main():
# 在頁面上點擊下一頁, 觀察 url 變化, 找到規(guī)律
for i in range(0, 250, 25):
url = 'https://movie.douban.com/top250?start={}'.format(i)
movies = movies_from_url(url)
print('top250 movies', movies)
基礎爬蟲之將數(shù)據(jù)保存至數(shù)據(jù)庫mongodb
import os
import requests
from pyquery import PyQuery as pq
from pymongo import MongoClient
class Model(object):
db = MongoClient().web16_4_pachong
def __repr__(self):
name = self.__class__.__name__
properties = ('{0} : ({1})'.format(k, v) for k, v in self.__dict__.items())
s = '\n<{0} \n {1}>'.format(name, '\n '.join(properties))
return s
def save(self):
name = self.__class__.__name__
_id = self.db[name].save(self.__dict__)
class Movie(Model):
@classmethod
def valid_names(cls):
names = [
# (字段名, 類型, 默認值)
('name', str, ''),
('score', int, 0),
('quote', str, ''),
('cover_url', str, ''),
('ranking', int, 0),
]
return names
def movie_from_div(div):
e = pq(div)
m = Movie()
m.name = e('.title').text()
m.score = e('.rating_num').text()
m.quote = e('.inq').text()
m.cover_url = e('img').attr('src')
m.ranking = e('.pic em').text()
m.save()
return m
def movies_from_url(url):
r = requests.get(url)
page = r.content
e = pq(page)
items = e('.item')
movies = [movie_from_div(i) for i in items]
return movies
def main():
url = 'https://movie.douban.com/top250'
movies = movies_from_url(url)
print('top250 movies', movies)
if __name__ == '__main__':
main()