最近都沒怎么寫爬蟲罚舱,主要是不知道如何能夠把爬到的數(shù)據(jù)利用起來爆阶,今天就貼一個簡單的爬蟲祖灰。
import requests
import pymongo
import time
from urllib.parse import *
client = pymongo.MongoClient('localhost', 27017)
douban = client['douban']
movie = douban['movie']
tag_list = ['熱門', '最新', '經(jīng)典', '可播放', '豆瓣高分', '冷門佳片', '華語', '歐美', '韓國', '日本', '動作',
'喜劇', '愛情', '科幻', '懸疑', '恐怖', '成長']
url_list = ['https://movie.douban.com/j/search_subjects?type=movie&tag={}&'
'sort=recommend&page_limit=20&page_start={}'.format(quote(tag), page) for tag in tag_list for page in range(0, 500, 20)]
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'}
def get_item(url):
r = requests.get(url, headers=headers)
wb_data = r.json()
if wb_data['subjects']:
for value in wb_data['subjects']:
data = {
'title': value['title'],
'id': value['id'],
'url': value['url'],
'images': value['cover'],
'rate': value['rate']
}
movie.insert_one(data)
else:
pass
for url in url_list:
get_item(url)
time.sleep(1)
print(movie.find().count())
爬取的數(shù)據(jù)不多只有幾千條萍鲸,而且有重復的部分售葡,缺點多多 桥帆,繼續(xù)學習丛晦。