python爬蟲06-分析ajax請求爬取今日頭條街拍美圖存入mongodb

昨天學(xué)習(xí)了分析ajax來爬取動態(tài)加載的技術(shù)贡耽，今天來分享下成果。
ajax只是一種技術(shù)阱冶，不是一門語言木蹬，他是用利用XML向服務(wù)器請求若皱，然后用JavaScript來渲染頁面，達(dá)到頁面地址不變晦譬，而內(nèi)容改變的一種異步加載技術(shù)锈拨。

現(xiàn)在越來越多的網(wǎng)站采用這種技術(shù)图毕，前后端分離是web發(fā)展的大趨勢像樊，因此生棍，我們在用requests請求的得到的頁面源碼媳谁，可能只有一個<body></body>標(biāo)簽，而頁面全都是利用JavaScript渲染而來氢妈。所以這就給我們爬取數(shù)據(jù)帶來了麻煩段多。

分析ajax時要注意上傳的參數(shù)进苍，如果參數(shù)太復(fù)雜我們就不用分析ajax了，直接用Selenium和chromeDriver搭配使用直接獲取渲染完成后的頁面拣宏，即可見即可得杠人。

我試了微博，結(jié)果參數(shù)太多嗡善，我分析不出規(guī)律罩引。

今天就以頭條街拍為例，來分析ajax爬取揭蜒。

先打開頭條剔桨，然后在搜索框里輸入街拍，回車搜索：

image.png

然后就可以進(jìn)入這個頁面：

image.png

然后進(jìn)入開發(fā)者模式，然后點network選項似舵，在選擇XHR過濾器葱峡，然后刷新頁面，再一直向下翻就可以看到下面的場景：

image.png

點擊第一條，會出來這個請求的請求頭军援，響應(yīng)胸哥，和其他信息：

image.png

觀察到Request URL,這里的鏈接就是我們在向下拉的時候頁面請求的鏈接，在點下面的幾條庐船，可以發(fā)現(xiàn)，只有offset和一個timestamp在變化揩瞪，其他的幾個參數(shù)是不變的李破。offset 是偏移量壹将，每次加20，而timestamp屯曹，是我們電腦上的時鐘的1000倍的整數(shù)部分惊畏，即:
time.time()*1000//1

所以我們就可以構(gòu)造出請求一頁的參數(shù)：

    params = {'aid': '24',
              'app_name': 'web_search',
              'offset': offset,
              'format': 'json',
              'keyword': '街拍',
              'autoload': 'true',
              'count': '20',
              'en_qc': '1',
              'cur_tab': '1',
              'from': 'search_tab',
              'pd': 'synthesis',
              'timestamp': int(time.time()*1000//1)

然后我們利用urllib.parse中的urlencode()將其編碼颜启，與基礎(chǔ)鏈接構(gòu)成請求鏈接，然后請求頁面涌萤，返回response：

def get_page(offset):
    '''獲取一頁頭條'''
    params = {'aid': '24',
              'app_name': 'web_search',
              'offset': offset,
              'format': 'json',
              'keyword': '街拍',
              'autoload': 'true',
              'count': '20',
              'en_qc': '1',
              'cur_tab': '1',
              'from': 'search_tab',
              'pd': 'synthesis',
              'timestamp': int(time.time()*1000//1)
              }
    headers = {
        'Accept': 'application/json, text/javascript',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-Hans-CN, zh-Hans; q=0.5',
        'Cache-Control': 'max-age=0',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Host': 'www.toutiao.com',
        'Referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763',
        'X-Requested-With': 'XMLHttpRequest'
    }
    base_url = 'https://www.toutiao.com/api/search/content/?'
    url=base_url+urlencode(params)
    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print("error", e.args)
        return None

接下來解釋解析的到的json數(shù)據(jù)了负溪，我們觀察Preview一欄川抡，發(fā)現(xiàn)须尚，我們要的文章標(biāo)題以及街拍圖片都存在data中：

image.png

可以發(fā)現(xiàn)title項就是我們要的標(biāo)題耐床，圖片在image_list里，但是圖片的數(shù)量可能就只有一兩張胯甩，我們這里就不深入到每一個文章里去找圖片了，就把這幾張照片保存即可蜡豹，于是就有了下面的解析函數(shù)：

def parse_page(json):
    if json.get('data'):
        for item in json.get('data'):
            try:
                title = item.get('title')
                images = item.get('image_list')
            except:
                continue
            else:
                if title is None or images is None:
                    continue
                else:
                    for image in images:
                        yield {
                            'title': title,
                            'image': image.get('url')
                        }

這里返回的是一個生成器對象比較省內(nèi)存镜廉，也好用愚战。

現(xiàn)在有了title也有了圖片的地址寂玲，就可以開始保存圖片了，這里我么采用圖片的md5值作為圖片的名稱想许，這樣可以去除重復(fù)断序，當(dāng)然這里圖少也可以不用，然后就是將每一條數(shù)據(jù)保存到mongodb數(shù)據(jù)庫中漱凝，這個數(shù)據(jù)庫還挺好使的阵苇。

def save_img(item):
    title = item.get('title')
    image = item.get('image')
    if not os.path.exists(title):
        os.makedirs(title)
    try:
        response = requests.get(image)
        if response.status_code == 200:
            file_path = "{0}/{1}.{2}".format(title,
                                             md5(response.content).hexdigest(),
                                             'jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(response.content)
            else:
                print("圖片已存在", file_path)
    except requests.ConnectionError:
        print("保存圖片失敗")

保存到數(shù)據(jù)庫：

def insert_into_mongodb(item, collection):
    '''輸入字典和要存入的集合'''
    result = collection.insert_one(item)
    print(result)

main（）函數(shù)接受一個offset值绅项，然后執(zhí)行獲取頁面趁怔，解析頁面，保存圖片关斜，存儲數(shù)據(jù)庫等操作：

def main(offset):
    print("main",offset)
    client = MongoClient('mongodb://localhost:27017')
    db = client.toutiao
    collection = db.jiepai
    json = get_page(offset)
    for item in parse_page(json):
        print(item)
        save_img(item)
        insert_into_mongodb(item,collection)

這次我成功的用處了多進(jìn)程铺浇，用的進(jìn)程池Pool（）實現(xiàn)，但還是有點曲折丁稀，因為pycharm里運行多線程會卡死线衫，但是在cmd。也就是雙擊文件運行就不會出問題枯跑，這是奇怪白热。多線程相關(guān)代碼還要保存在
if __name__ == '__main__':這下面才能正常運行：

GROUP_START = 0
GROUP_STOP = 20
if __name__ == '__main__':
    freeze_support()
    pool = Pool()
    group = ([x*20 for x in range(GROUP_START, GROUP_STOP+1)])
    print(group)
    pool.map(main, group)
    pool.close()
    pool.join()

然后是運行結(jié)果：因為cmd界面運行完會直接退出屋确，我就加了個input（）來等待我關(guān)。

image.png

但是最后他還是直接退出了评疗。茵烈。呜投。因為我之前剛運行一遍，所以會重復(fù)雕拼，明天你們運行下即很順了粘招。

這是爬下來的結(jié)果洒扎，總共今天昨天兩次一共155條，：

文件夾：

image.png

第一張竟然是朱一龍磷醋。邓线。。
數(shù)據(jù)庫：

image.png

總之很成功震庭！

加油器联！

下面給出全部的代碼：

import os
import requests
import json
from pymongo import MongoClient
from hashlib import md5
from multiprocessing import Pool
from multiprocessing import freeze_support
from urllib.parse import urlencode
import time
def get_page(offset):
    '''獲取一頁頭條'''
    params = {'aid': '24',
              'app_name': 'web_search',
              'offset': offset,
              'format': 'json',
              'keyword': '街拍',
              'autoload': 'true',
              'count': '20',
              'en_qc': '1',
              'cur_tab': '1',
              'from': 'search_tab',
              'pd': 'synthesis',
              'timestamp': int(time.time()*1000//1)
              }
    headers = {
        'Accept': 'application/json, text/javascript',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-Hans-CN, zh-Hans; q=0.5',
        'Cache-Control': 'max-age=0',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Host': 'www.toutiao.com',
        'Referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763',
        'X-Requested-With': 'XMLHttpRequest'
    }
    base_url = 'https://www.toutiao.com/api/search/content/?'
    url=base_url+urlencode(params)

    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print("error", e.args)
        return None

def parse_page(json):
    if json.get('data'):
        for item in json.get('data'):
            try:
                title = item.get('title')
                images = item.get('image_list')
            except:
                continue
            else:
                if title is None or images is None:
                    continue
                else:
                    for image in images:
                        yield {
                            'title': title,
                            'image': image.get('url')
                        }

def save_img(item):
    title = item.get('title')
    image = item.get('image')
    if not os.path.exists(title):
        os.makedirs(title)
    try:
        response = requests.get(image)
        if response.status_code == 200:
            file_path = "{0}/{1}.{2}".format(title,
                                             md5(response.content).hexdigest(),
                                             'jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(response.content)
            else:
                print("圖片已存在", file_path)
    except requests.ConnectionError:
        print("保存圖片失敗")

def insert_into_mongodb(item, collection):
    result = collection.insert_one(item)
    print(result)

def main(offset):
    print("main",offset)
    client = MongoClient('mongodb://localhost:27017')
    db = client.toutiao
    collection = db.jiepai
    json = get_page(offset)
    for item in parse_page(json):
        print(item)
        save_img(item)
        insert_into_mongodb(item,collection)



GROUP_START = 0
GROUP_STOP = 20
if __name__ == '__main__':
    freeze_support()
    pool = Pool()
    group = ([x*20 for x in range(GROUP_START, GROUP_STOP+1)])
    print(group)
    pool.map(main, group)
    pool.close()
    pool.join()
    input()

在運行時习贫，請先確保安裝了相關(guān)的庫苫昌，以及mongodb數(shù)據(jù)庫和可視化工具。

這次的爬蟲寫的很完美奥务，代碼之間耦合性低袜硫，維護(hù)起來很容易！