Python學(xué)習(xí)日記3| 用python多進(jìn)程爬取58同城北京地區(qū)10w+數(shù)據(jù)

今天是4.13號(hào)痒留。

昨天把會(huì)議論文算是完成任務(wù)的寫(xiě)完然后提交了，而實(shí)習(xí)還沒(méi)有找上，所以最近一段時(shí)間應(yīng)該都會(huì)整天在實(shí)驗(yàn)室學(xué)習(xí)python吧，加上最近一個(gè)多星期全部都是大雨哪也去不了（說(shuō)的好像不下雨就會(huì)出去轉(zhuǎn)悠一樣副编。本來(lái)還想問(wèn)一下送宋教授現(xiàn)在有什么項(xiàng)目可以跟過(guò)去做，但又怕把python的學(xué)習(xí)拉下，所以還是最近半個(gè)月先把這個(gè)課程全部學(xué)完吧痹届。另外電腦運(yùn)行pycharm真心帶不動(dòng)呻待，所以也在等家里的那臺(tái)筆記本寄過(guò)來(lái)，同時(shí)不得不提的是也在等投稿的論文消息队腐，wish there is a good result蚕捉。

照樣在貼上代碼之前，總結(jié)在實(shí)際中新學(xué)的知識(shí)與所遇到的問(wèn)題柴淘。
(1).快捷鍵ctrl+/可以多行注釋?zhuān)窟x定后tab可以多行縮進(jìn)迫淹，shift+tab則可以向左縮進(jìn)。
(2).注意select('')和split('')得到的結(jié)果都是列表为严，所以都要在后面加下標(biāo)[number]敛熬。
(3).X.stripped_strings 用于去除字符串X中包含的空格或空行。同時(shí)注意要用list()把那一串?dāng)?shù)據(jù)括起來(lái)第股。
(4).對(duì)于多種分類(lèi)情況時(shí)应民，最好用if語(yǔ)句來(lái)進(jìn)行判斷。判斷某特點(diǎn)字符串s1是包含在另一字符串s2中夕吻，可用if 's1' in 's2'

(5).要關(guān)注抓取的數(shù)據(jù)是網(wǎng)頁(yè)自帶的诲锹，還是通過(guò)request返回的json數(shù)據(jù)，一般json都是字典數(shù)據(jù)涉馅。對(duì)于瀏覽量等JS數(shù)據(jù)归园，首先在審查元素的network-JS中找到相關(guān)網(wǎng)頁(yè)，然后進(jìn)行解析稚矿。
解析過(guò)程包括：將查詢(xún)網(wǎng)頁(yè)的id導(dǎo)出庸诱，然后用format()直接替換到相應(yīng)的JS動(dòng)態(tài)網(wǎng)頁(yè)構(gòu)造成新的網(wǎng)頁(yè)；接著跟一般網(wǎng)頁(yè)解析一樣用requests.get()去請(qǐng)求晤揣；最后由于JS網(wǎng)頁(yè)的回應(yīng)內(nèi)容都是字符串偶翅，所以直接用js.text然后再用相應(yīng)的split或其他方法截取自己想要的內(nèi)容。
還一個(gè)問(wèn)題要注意碉渡，對(duì)于請(qǐng)求JS數(shù)據(jù)時(shí)聚谁，記得加上headers包括： 'Referer'和 'User-Agent'

第一段

__author__ = 'guohuaiqi'
#!/usr/bin/env python
# _*_ coding: utf-8 _*_
from bs4 import BeautifulSoup
import requests
import string

url='http://bj.58.com/sale.shtml'
host='http://bj.58.com'

#得到所有商品類(lèi)目的鏈接并保存下來(lái)
def get_cate_link(url):
    web_data=requests.get(url)
    soup=BeautifulSoup(web_data.text,'lxml')
    allurl=soup.select('#ymenu-side > ul > li > ul > li > b > a')
    for item in allurl:
        cate_link=host+item.get('href')
        #print(cate_link)

# get_cate_link(url)

cate_list="""
    http://bj.58.com/shouji/
    http://bj.58.com/tongxunyw/
    http://bj.58.com/danche/
    http://bj.58.com/fzixingche/
    http://bj.58.com/diandongche/
    http://bj.58.com/sanlunche/
    http://bj.58.com/peijianzhuangbei/
    http://bj.58.com/diannao/
    http://bj.58.com/bijiben/
    http://bj.58.com/pbdn/
    http://bj.58.com/diannaopeijian/
    http://bj.58.com/zhoubianshebei/
    http://bj.58.com/shuma/
    http://bj.58.com/shumaxiangji/
    http://bj.58.com/mpsanmpsi/
    http://bj.58.com/youxiji/
    http://bj.58.com/jiadian/
    http://bj.58.com/dianshiji/
    http://bj.58.com/ershoukongtiao/
    http://bj.58.com/xiyiji/
    http://bj.58.com/bingxiang/
    http://bj.58.com/binggui/
    http://bj.58.com/chuang/
    http://bj.58.com/ershoujiaju/
    http://bj.58.com/bangongshebei/
    http://bj.58.com/diannaohaocai/
    http://bj.58.com/bangongjiaju/
    http://bj.58.com/ershoushebei/
    http://bj.58.com/yingyou/
    http://bj.58.com/yingeryongpin/
    http://bj.58.com/muyingweiyang/
    http://bj.58.com/muyingtongchuang/
    http://bj.58.com/yunfuyongpin/
    http://bj.58.com/fushi/
    http://bj.58.com/nanzhuang/
    http://bj.58.com/fsxiemao/
    http://bj.58.com/xiangbao/
    http://bj.58.com/meirong/
    http://bj.58.com/yishu/
    http://bj.58.com/shufahuihua/
    http://bj.58.com/zhubaoshipin/
    http://bj.58.com/yuqi/
    http://bj.58.com/tushu/
    http://bj.58.com/tushubook/
    http://bj.58.com/wenti/
    http://bj.58.com/yundongfushi/
    http://bj.58.com/jianshenqixie/
    http://bj.58.com/huju/
    http://bj.58.com/qiulei/
    http://bj.58.com/yueqi/
    http://bj.58.com/tiaozao/
"""

第二段

__author__ = 'guohuaiqi'
# !/usr/bin/env python
# _*_ coding: utf-8 _*_
from bs4 import BeautifulSoup
import requests
import time
import pymongo
import sys

client=pymongo.MongoClient('localhost',27017)
tongcheng=client['tongcheng']
urllist=tongcheng['urllist']
content=tongcheng['content']


#爬取所有商品的鏈接保存下來(lái),這里的url來(lái)自cate_list
def get_content_links(cate_url,page):
    # http://bj.58.com/danche/pn2/ 這里要構(gòu)造函數(shù)，不然傳來(lái)的類(lèi)目鏈接只是進(jìn)來(lái)后的首頁(yè)
    page_list='{}pn{}/'.format(cate_url,str(page))
    web_data=requests.get(page_list)
    soup=BeautifulSoup(web_data.text,'lxml')
    time.sleep(1)
    if soup.find('td','t'):
        allurl=soup.select('td.t a.t')
        for url1 in allurl:
            content_link=url1.get('href').split('?')[0]
            if 'bj.58.com' not in content_link:
                pass
            else:
                urllist.insert_one({'url':content_link})
                # print(content_link)
                get_item_content(content_link)
    else:
        pass

# cate_url='http://bj.58.com/youxiji/'
# get_content_links(cate_url,20)

# 爬取每個(gè)頁(yè)面的詳情內(nèi)容,包括標(biāo)題滞诺，時(shí)間形导，價(jià)格，區(qū)域
def get_item_content(content_link):
# 先判斷數(shù)據(jù)是否來(lái)自58习霹，將來(lái)自精品或者轉(zhuǎn)轉(zhuǎn)的數(shù)據(jù)朵耕，統(tǒng)一不要
#     for url2 in content_link:
#         if 'bj.58.com' not in url2:
#             pass
#         else:
    try:
        web_data1=requests.get(content_link)
        soup=BeautifulSoup(web_data1.text,'lxml')
        page_not_exist = '404' in soup.find('script',type='text/javascript').get('src').split('/')
        if page_not_exist:
            pass
        else:
            if '區(qū)域' in soup.select('#content > div.person_add_top.no_ident_top > div.per_ad_left > div.col_sub.sumary > ul > li:nth-of-type(2) > div.su_tit')[0].get_text():
                if soup.find_all('#content > div.person_add_top.no_ident_top > div.per_ad_left > div.col_sub.sumary > ul > li:nth-of-type(2) > div.su_con > span'):
                    district=list(soup.select('#content > div.person_add_top.no_ident_top > div.per_ad_left > div.col_sub.sumary > ul > li:nth-of-type(2) > div.su_con > span')[0].stripped_strings)
                else:
                    district=list(soup.select('#content > div.person_add_top.no_ident_top > div.per_ad_left > div.col_sub.sumary > ul > li:nth-of-type(2) > div.su_con')[0].stripped_strings)
            elif '區(qū)域' in soup.select('#content > div.person_add_top.no_ident_top > div.per_ad_left > div.col_sub.sumary > ul > li:nth-of-type(3) > div.su_tit')[0].get_text():
                if soup.find_all('#content > div.person_add_top.no_ident_top > div.per_ad_left > div.col_sub.sumary > ul > li:nth-of-type(3) > div.su_con > span'):
                    district=list(soup.select('#content > div.person_add_top.no_ident_top > div.per_ad_left > div.col_sub.sumary > ul > li:nth-of-type(3) > div.su_con > span')[0].stripped_strings)
                else:
                    district=list(soup.select('#content > div.person_add_top.no_ident_top > div.per_ad_left > div.col_sub.sumary > ul > li:nth-of-type(3) > div.su_con')[0].stripped_strings)
            else:
                district=None
            data={
                'goods_cate':soup.select('#header > div.breadCrumb.f12 > span:nth-of-type(3) > a')[0].text.strip(),
                'title':soup.select('#content h1')[0].text.strip(),
                'date':soup.select('#content li.time')[0].text.replace('.','-'),
                'price':soup.select('span.price.c_f50')[0].text.replace('元','').strip() if '面議'not in soup.select('span.price.c_f50')[0].text else None,
                'district':district
                }
            content.insert_one(data)
            # print(data)
    except requests.ConnectionError as e:
        print(e.response)
#
# b=['http://bj.58.com/shuma/23190415633187x.shtml','http://bj.58.com/yishu/25471342844357x.shtml','http://bj.58.com/shouji/25683386143296x.shtml','http://bj.58.com/shuma/23425779899550x.shtml']
# get_item_content(b)
# get_content_links('http://bj.58.com/shouji/',20)

第三段

# _*_ coding: utf-8 _*_
#!/usr/bin/env python
__author__ = 'guohuaiqi'
from multiprocessing import Pool
from get_cate_link import cate_list
from get_all_contents import get_content_links,urllist,content

# 加入斷點(diǎn)續(xù)傳機(jī)制，在出現(xiàn)斷開(kāi)后淋叶，用rest_list替換pool,map()函數(shù)中的cate_links
db_urllist=[item['url'] for item in urllist.find()]
content_urllist=[item['url'] for item in content.fina()]
x=set(db_urllist)
y=set(content_urllist)
rest_list=x-y

def get_all_links(cate_url):
    for page in range(1,101):
        get_content_links(cate_url,page)

if __name__=='__main__':
    pool=Pool()
    pool.map(get_all_links,cate_list.split())

第四段
最后再加上一個(gè)count函數(shù)來(lái)對(duì)數(shù)據(jù)庫(kù)中的item計(jì)數(shù)

__author__ = 'guohuaiqi'
# !/usr/bin/env python
# _*_ coding: utf-8 _*_
import time
from get_all_contents1 import content

while True:
    print(content.find().count())
    time.sleep(3)

再要注意的就是阎曹，一定一定在寫(xiě)代碼前在最前面加上：
#!/usr/bin/env python
__ coding: utf-8 __**

在爬取了10745條數(shù)據(jù)后自己手動(dòng)停止了程序，一共花了差不多12分鐘。

最后編輯于：2017.12.03 04:07:07

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末处嫌，一起剝皮案震驚了整個(gè)濱河市栅贴，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌熏迹，老刑警劉巖檐薯，帶你破解...
沈念sama閱讀 216,591評(píng)論 6贊 501
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異注暗，居然都是意外死亡坛缕，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,448評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門(mén)捆昏，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)赚楚，“玉大人，你說(shuō)我怎么就攤上這事骗卜≈背浚” “怎么了？”我有些...
開(kāi)封第一講書(shū)人閱讀 162,823評(píng)論 0贊 353
道士緝兇錄：失蹤的賣(mài)姜人
文/不壞的土叔我叫張陵膨俐，是天一觀的道長(zhǎng)。經(jīng)常有香客問(wèn)我罩句，道長(zhǎng)焚刺，這世上最難降的妖魔是什么？我笑而不...
開(kāi)封第一講書(shū)人閱讀 58,204評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任门烂，我火速辦了婚禮乳愉，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘屯远。我一直安慰自己蔓姚，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 67,228評(píng)論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布慨丐。她就那樣靜靜地躺著坡脐，像睡著了一般。火紅的嫁衣襯著肌膚如雪房揭。梳的紋絲不亂的頭發(fā)上备闲，一...
開(kāi)封第一講書(shū)人閱讀 51,190評(píng)論 1贊 299
城市分裂傳說(shuō)
那天，我揣著相機(jī)與錄音捅暴，去河邊找鬼恬砂。笑死，一個(gè)胖子當(dāng)著我的面吹牛蓬痒，可吹牛的內(nèi)容都是我干的泻骤。我是一名探鬼主播，決...
沈念sama閱讀 40,078評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼狱掂！你這毒婦竟也來(lái)了演痒？” 一聲冷哼從身側(cè)響起，我...
開(kāi)封第一講書(shū)人閱讀 38,923評(píng)論 0贊 274
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤符欠，失蹤者是張志新（化名）和其女友劉穎嫡霞，沒(méi)想到半個(gè)月后，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體希柿，經(jīng)...
沈念sama閱讀 45,334評(píng)論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡诊沪，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,550評(píng)論 2贊 333
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了曾撤。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片端姚。...
茶點(diǎn)故事閱讀 39,727評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖挤悉，靈堂內(nèi)的尸體忽然破棺而出渐裸，到底是詐尸還是另有隱情，我是刑警寧澤装悲，帶...
沈念sama閱讀 35,428評(píng)論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布昏鹃，位于F島的核電站，受9級(jí)特大地震影響诀诊，放射性物質(zhì)發(fā)生泄漏洞渤。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,022評(píng)論 3贊 326
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一属瓣、第九天我趴在偏房一處隱蔽的房頂上張望载迄。院中可真熱鬧，春花似錦抡蛙、人聲如沸护昧。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 31,672評(píng)論 0贊 22
一樁弒父案粗截，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)惋耙。三九已至，卻和暖如春熊昌，著一層夾襖步出監(jiān)牢的瞬間怠晴，已是汗流浹背。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 32,826評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工浴捆，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留蒜田，地道東北人。一個(gè)月前我還...
沈念sama閱讀 47,734評(píng)論 2贊 368
代替公主和親
正文我出身青樓选泻，卻偏偏與公主長(zhǎng)得像冲粤，于是被迫代替她去往敵國(guó)和親美莫。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,619評(píng)論 2贊 354

Python學(xué)習(xí)日記3| 用python多進(jìn)程爬取58同城北京地區(qū)10w+數(shù)據(jù)

推薦閱讀更多精彩內(nèi)容