基于scrapy-redis實現(xiàn)分布式爬取房天下(新房,二手房)

說明:本文僅供初學(xué)者學(xué)習(xí)交流征懈;請勿用作其他用途

1.分析過程

  • 通過分析石咬,我們可以發(fā)現(xiàn)除了北京以外,其他新房二手房url都有共同點卖哎,以上海為例,新房鏈接為https://sh.newhouse.fang.com/house/s/
    二手房鏈接為https://sh.esf.fang.com/删性,只有城市簡稱部分不同亏娜,所以我們只需要找到所有城市列表就能實現(xiàn)爬取全部城市新房,二手房
  • 進(jìn)入房天下首頁蹬挺,查看更多城市
    image.png

    點擊更多城市维贺,出現(xiàn)城市列表就是我們需要的開始爬取頁面,url為https://www.fang.com/SoufunFamily.htm
    image.png

2.開始編碼

以下部分直接上代碼巴帮,基本上都是分析爬取信息的xpath的過程溯泣,熟練之后就會發(fā)現(xiàn)是一項體力活...

# -*- coding: utf-8 -*-
"""
items.py
"""

import scrapy


class NewHouseItem(scrapy.Item):
    province = scrapy.Field()#省份
    city = scrapy.Field()#城市
    name = scrapy.Field()#名稱
    price = scrapy.Field()#價格
    rooms = scrapy.Field()#幾居室
    ares = scrapy.Field()#面積
    address = scrapy.Field()#地址
    district = scrapy.Field()#區(qū)域
    sale = scrapy.Field()#是否在售
    origin_url = scrapy.Field()#原始url


class ESFHouseItem(scrapy.Item):
    province = scrapy.Field()#省份
    city = scrapy.Field()#城市
    name = scrapy.Field()#名稱
    price = scrapy.Field()#總價
    rooms = scrapy.Field()#幾居室
    floor = scrapy.Field()#層
    toward = scrapy.Field()#朝向
    year = scrapy.Field()#年代
    ares = scrapy.Field()#面積
    address = scrapy.Field()#地址
    unit = scrapy.Field()#單價
    origin_url = scrapy.Field()#原始url

以下是爬蟲代碼部分:

# -*- coding: utf-8 -*-
"""
soufang.py
"""
import re

import scrapy
from scrapy_redis.spiders import RedisSpider
from fang.items import NewHouseItem, ESFHouseItem


class SoufangSpider(RedisSpider):
    name = 'soufang'
    allowed_domains = ['fang.com']
    # start_urls = ['https://www.fang.com/SoufunFamily.htm']
    redis_key = "soufang:start_urls"

    def parse(self, response):
        trs = response.xpath("http://div[@class='outCont']//tr")
        province = ''
        for tr in trs:
            tds = tr.xpath(".//td[not(@class)]")
            province_td = tds[0]
            province_text = province_td.xpath(".//text()").get()
            province_text = re.sub(r"\s", "", province_text)
            if province_text:
                province = province_text
            if province == '其它':
                continue
            city_td = tds[1]
            city_links = city_td.xpath(".//a")
            for city_link in city_links:
                city = city_link.xpath(".//text()").get()
                city_url = city_link.xpath(".//@href").get()
                url_module = city_url.split("http://")
                scheme = url_module[0]
                domain = url_module[1]
                if 'bj.' in domain:
                    newhouse_url = 'https://newhouse.fang.com/house/s/'
                    esf_url = 'http://esf.fang.com/'
                else:
                    newhouse_url = scheme + '//' + 'newhouse.' + domain + 'house/s/'
                    esf_url = scheme + '//' + 'esf.' + domain

                yield scrapy.Request(url=newhouse_url, callback=self.parse_newhouse, meta={"info": (province, city)})
                yield scrapy.Request(url=esf_url, callback=self.parse_esf, meta={"info": (province, city)})
                break
            break

    def parse_newhouse(self, response):
        province, city = response.meta.get('info')
        lis = response.xpath("http://div[contains(@class, 'nl_con')]/ul/li")
        for li in lis:
            li_sect = li.xpath(".//div[@class='nlcd_name']/a/text()")
            if not li_sect:
                continue
            name = li_sect.get().strip()
            house_type = li.xpath(".//div[contains(@class, 'house_type')]/a/text()").getall()
            rooms = '/'.join([item.strip() for item in house_type if item.endswith('居')]) or '未知'
            ares = li.xpath("string(.//div[contains(@class, 'house_type')])").get()
            ares = ares.split('-')[1].strip() if '-' in ares else '未知'
            address = li.xpath(".//div[@class='address']/a/@title").get()
            address_info = li.xpath("string(.//div[@class='address'])").get()
            district = re.search(r'.*\[(.*)\].*', address_info).group(1)
            sale = li.xpath(".//div[contains(@class, 'fangyuan')]/span/text()").get()
            price = li.xpath("string(.//div[@class='nhouse_price'])").get().strip()
            origin_url = li.xpath(".//div[@class='nlcd_name']/a/@href").get()
            item = NewHouseItem(name=name, rooms=rooms, ares=ares, address=address, district=district, sale=sale,
                                price=price, origin_url=origin_url, province=province, city=city)
            yield item

        next_url = response.xpath("http://div[@class='page']//a[@class='next']/@href").get()
        if next_url:
            print('下一頁:新房》》》', response.urljoin(next_url))
            yield scrapy.Request(url=response.urljoin(next_url), callback=self.parse_newhouse,
                                 meta={"info": (province, city)})
        else:
            print("未找到下一頁新房數(shù)據(jù)")

    def parse_esf(self, response):
        province, city = response.meta.get('info')
        print(province, city)
        dls = response.xpath("http://div[contains(@class, 'shop_list')]/dl")
        for dl in dls:
            name = dl.xpath(".//span[@class='tit_shop']/text()").get()
            infos = dl.xpath(".//p[@class='tel_shop']/text()").getall()
            rooms, floor, toward, ares, year = '未知', '未知','未知','未知','未知'
            for info in infos:
                if '廳' in info:
                    rooms = info.strip()
                elif '層' in info:
                    floor = info
                elif '向' in info:
                    toward = info
                elif '㎡' in info:
                    ares = info
                elif '建' in info:
                    year = info
            address=dl.xpath(".//p[@class='add_shop']/span/text()").get()
            price = dl.xpath("string(.//dd[@class='price_right']/span[1])").get()
            unit =  dl.xpath("string(.//dd[@class='price_right']/span[2])").get()
            detail_url = dl.xpath(".//p[@class='title']/a/@href").get()
            origin_url = response.urljoin(detail_url)
            item = ESFHouseItem(name=name, rooms=rooms, ares=ares, address=address, toward=toward, floor=floor,
                                price=price, origin_url=origin_url, province=province, city=city, year=year, unit=unit)
            yield item
        next_url = None
        next_page_info = response.xpath("http://div[@class='page_al']//p")
        for info in next_page_info:
            if info.xpath("./a/text()").get() == "下一頁":
                next_url = info.xpath("./a/@href").get()
                print(next_url)
        if next_url:
            print('下一頁:二手房》》》',response.urljoin(next_url))
            yield scrapy.Request(url=response.urljoin(next_url), callback=self.parse_esf,
                                 meta={"info": (province, city)})
        else:
            print("未找到下一頁二手房數(shù)據(jù)")

加了一個請求頭的中間件,里面有兩種獲取方式

# -*- coding: utf-8 -*-
"""
middlewares.py
"""
import random

from faker import Factory
from scrapy import signals

f = Factory.create()

class UserAgentDownloadMiddleWare(object):
    #user-agent隨機(jī)請求頭中間件
    USER_AGENTS = [
        # Opera
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
        "Opera/8.0 (Windows NT 5.1; U; en)",
        "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
        # Firefox
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
        "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
        # Safari
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
        # chrome
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",
        # 360
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
        # 淘寶瀏覽器
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
        # 獵豹瀏覽器
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
        # QQ瀏覽器
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
        # sogou瀏覽器
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
        # maxthon瀏覽器
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36",
        # UC瀏覽器
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",
    ]

    def process_request(self, request, spider):
        user_agent = random.choice(self.USER_AGENTS)
        # user_agent = f.user_agent()  #另外一種方式榕茧,需要安裝faker庫
        print(user_agent)
        request.headers['User-Agent'] = user_agent

setting部分

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

DOWNLOADER_MIDDLEWARES = {
   'fang.middlewares.UserAgentDownloadMiddleWare': 543,
}
##########scrspy-redis setting##############
#確保request存儲到redis中
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
#確保所有爬蟲共享相同的去重指紋
DUPEFILTER_CLASS ="scrapy_redis.dupefilter.RFPDupeFilter"
ITEM_PIPELINES={
    "scrapy_redis.pipelines.RedisPipeline":300
}
#實現(xiàn)暫停和恢復(fù)
SCHEDULER_PERSIST = True
REDIS_HOST='127.0.0.1'  #redis數(shù)據(jù)庫host
REDIS_PORT=6379   #redi數(shù)據(jù)庫默認(rèn)端口
#############################################

3 執(zhí)行爬蟲

前面我們在爬蟲代碼里面我們定義了一個redis的key:redis_key = "soufang:start_urls"垃沦,用于告訴爬蟲開始爬取的url。

  1. 進(jìn)入爬蟲目錄spiders,執(zhí)行命令scrapy runspider soufang.py,此時爬蟲開始運行用押,但是會阻塞住肢簿,監(jiān)聽開始爬取的url,如下:
    image.png

2.目前我只在windows上測試過爬取過程蜻拨,結(jié)果是正常的池充,嚴(yán)格意義上分布式爬取應(yīng)該是多臺機(jī)器同時爬才能看到效果(打臉了。缎讼。)收夸,這里主要給大家看下思路,在本地windows安裝redis,先后啟動服務(wù)端redis-server.exe和客戶端redis-cli.exe血崭,在客戶端push一個開始url進(jìn)去卧惜,命令:lpush soufang:start_urls https://www.fang.com/SoufunFamily.htm這里的soufang:start_urls是前面soufang.py里面定義的key值」Π保回車序苏,此時可以看到前面阻塞的爬蟲開始工作了

image.png

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市捷凄,隨后出現(xiàn)的幾起案子忱详,更是在濱河造成了極大的恐慌,老刑警劉巖跺涤,帶你破解...
    沈念sama閱讀 217,185評論 6 503
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件匈睁,死亡現(xiàn)場離奇詭異监透,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)航唆,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,652評論 3 393
  • 文/潘曉璐 我一進(jìn)店門胀蛮,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人糯钙,你說我怎么就攤上這事粪狼。” “怎么了任岸?”我有些...
    開封第一講書人閱讀 163,524評論 0 353
  • 文/不壞的土叔 我叫張陵再榄,是天一觀的道長。 經(jīng)常有香客問我享潜,道長困鸥,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,339評論 1 293
  • 正文 為了忘掉前任剑按,我火速辦了婚禮疾就,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘艺蝴。我一直安慰自己猬腰,他們只是感情好,可當(dāng)我...
    茶點故事閱讀 67,387評論 6 391
  • 文/花漫 我一把揭開白布吴趴。 她就那樣靜靜地躺著漆诽,像睡著了一般。 火紅的嫁衣襯著肌膚如雪锣枝。 梳的紋絲不亂的頭發(fā)上厢拭,一...
    開封第一講書人閱讀 51,287評論 1 301
  • 那天,我揣著相機(jī)與錄音撇叁,去河邊找鬼供鸠。 笑死,一個胖子當(dāng)著我的面吹牛陨闹,可吹牛的內(nèi)容都是我干的楞捂。 我是一名探鬼主播,決...
    沈念sama閱讀 40,130評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼趋厉,長吁一口氣:“原來是場噩夢啊……” “哼寨闹!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起君账,我...
    開封第一講書人閱讀 38,985評論 0 275
  • 序言:老撾萬榮一對情侶失蹤繁堡,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體椭蹄,經(jīng)...
    沈念sama閱讀 45,420評論 1 313
  • 正文 獨居荒郊野嶺守林人離奇死亡闻牡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,617評論 3 334
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了绳矩。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片罩润。...
    茶點故事閱讀 39,779評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖翼馆,靈堂內(nèi)的尸體忽然破棺而出割以,到底是詐尸還是另有隱情,我是刑警寧澤应媚,帶...
    沈念sama閱讀 35,477評論 5 345
  • 正文 年R本政府宣布拳球,位于F島的核電站,受9級特大地震影響珍特,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜魔吐,卻給世界環(huán)境...
    茶點故事閱讀 41,088評論 3 328
  • 文/蒙蒙 一扎筒、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧酬姆,春花似錦嗜桌、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,716評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至相满,卻和暖如春层亿,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背立美。 一陣腳步聲響...
    開封第一講書人閱讀 32,857評論 1 269
  • 我被黑心中介騙來泰國打工匿又, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人建蹄。 一個月前我還...
    沈念sama閱讀 47,876評論 2 370
  • 正文 我出身青樓碌更,卻偏偏與公主長得像,于是被迫代替她去往敵國和親洞慎。 傳聞我的和親對象是個殘疾皇子痛单,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 44,700評論 2 354

推薦閱讀更多精彩內(nèi)容