一開始看京東商城的商品鹉戚,發(fā)現(xiàn)很多信息都在網(wǎng)頁源代碼上攒射,以為會比淘寶的大規(guī)模爬取簡單點膘婶,結(jié)果被京東欺騙了無數(shù)次,整整寫了差不多六個小時培愁,真是坑爹啊。先貼上github地址:https://github.com/xiaobeibei26/jingdong
先說下這個網(wǎng)站缓窜,首先在首頁隨便輸入一個想爬取的商品類別定续,觀察到一般商品數(shù)目都是100頁的,除非有些比較稀少的商品禾锤,如圖
介紹一下網(wǎng)站的分析過程私股,默認情況下在首頁輸入一件商品時,出來的搜索頁面是只有30件商品的恩掷,屏幕的右側(cè)下拉框拉到下面會觸發(fā)一個ajax的請求倡鲸,把剩下的30個商品渲染出來,一般每頁60個商品里面是有三個左右是廣告的黄娘,也就是有效商品一般是57個峭状。這里看一下這個AJAX請求,這個是爬取難點
看一看這個請求頭逼争,我當(dāng)時第一個感覺以為很多參數(shù)是可以去掉优床,拿到一個很簡便的鏈接就可以了
當(dāng)時沒注意,刪了很多參數(shù)直接請求誓焦,結(jié)果調(diào)試了很久胆敞,獲得的商品在插進數(shù)據(jù)庫去重的時候都是只剩網(wǎng)頁的一般,細細觀察了一下發(fā)現(xiàn)鏈接雖然不同杂伟,請求回來的商品卻是一樣的竿秆,然后我再細細看了看這個ajax請求,鼓搗了好久稿壁,最終發(fā)現(xiàn)這個URL后面的每個數(shù)字都是每一件商品的ID幽钢,而這個ID隱藏在第一次剛打開網(wǎng)頁時候最初的那些商品里面,如圖.........
這里結(jié)合ajax請求的參數(shù)看傅是,
然后我又從新改掉爬蟲邏輯匪燕,改代碼蕾羊,又花了兩個小時,好慘啊.......
然后終于可以一次提取完整的網(wǎng)頁商品了帽驯,最后提示一下龟再,京東網(wǎng)頁第一頁的商品里面頁數(shù)page是顯示1和2的,第二頁是3和4尼变,這個有點特殊利凑,最后上一張爬蟲主程序圖
# -*- coding: utf-8 -*-
import scrapy
from jingdong.items import JingdongItem
class JdSpider(scrapy.Spider):
name = "jd"
allowed_domains = ["www.jd.com"]
start_urls = ['http://www.jd.com/']
search_url1 = 'https://search.jd.com/Search?keyword={key}&enc=utf-8&page={page}'
#search_url2='https://search.jd.com/s_new.php?keyword={key}&enc=utf-8&page={page}&scrolling=y&pos=30&show_items={goods_items}'
search_url2= 'https://search.jd.com/s_new.php?keyword={key}&enc=utf-8&page={page}&s=26&scrolling=y&pos=30&tpl=3_L&show_items={goods_items}'
shop_url ='http://mall.jd.com/index-{shop_id}.html'
def start_requests(self):
key = '褲子'
for num in range(1,100):
page1 = str(2*num-1)#構(gòu)造頁數(shù)
page2 = str(2*num)
yield scrapy.Request(url=self.search_url1.format(key=key,page=page1),callback=self.parse,dont_filter = True)
yield scrapy.Request(url=self.search_url1.format(key=key,page=page1),callback=self.get_next_half,meta={'page2':page2,'key':key},dont_filter = True)
#這里一定要加dont_filter = True,不然scrapy會自動忽略掉這個重復(fù)URL的訪問
def get_next_half(self,response):
try:
items = response.xpath('//*[@id="J_goodsList"]/ul/li/@data-pid').extract()
key = response.meta['key']
page2 =response.meta['page2']
goods_items=','.join(items)
yield scrapy.Request(url=self.search_url2.format(key=key, page=page2, goods_items=goods_items),
callback=self.next_parse,dont_filter=True)#這里不加這個的話scrapy會報錯dont_filter嫌术,官方是說跟allowed_domains沖突哀澈,可是第一個請求也是這個域名,實在無法理解
except Exception as e:
print('沒有數(shù)據(jù)')
def parse(self, response):
all_goods = response.xpath('//div[@id="J_goodsList"]/ul/li')
for one_good in all_goods:
item = JingdongItem()
try:
data = one_good.xpath('div/div/a/em')
item['title'] = data.xpath('string(.)').extract()[0]#提取出該標簽所有文字內(nèi)容
item['comment_count'] = one_good.xpath('div/div[@class="p-commit"]/strong/a/text()').extract()[0]#評論數(shù)
item['goods_url'] = 'http:'+one_good.xpath('div/div[4]/a/@href').extract()[0]#商品鏈接
item['shops_id']=one_good.xpath('div/div[@class="p-shop"]/@data-shopid').extract()[0]#店鋪ID
item['shop_url'] =self.shop_url.format(shop_id=item['shops_id'])
goods_id=one_good.xpath('div/div[2]/div/ul/li[1]/a/img/@data-sku').extract()[0]
if goods_id:
item['goods_id'] =goods_id
price=one_good.xpath('div/div[3]/strong/i/text()').extract()#價格
if price:#有寫商品評論數(shù)是0度气,價格也不再源代碼里面割按,應(yīng)該是暫時上首頁的促銷商品,每頁有三四件磷籍,我們忽略掉
item['price'] =price[0]
yield item
except Exception as e:
pass
def next_parse(self,response):
all_goods=response.xpath('/html/body/li')
for one_good in all_goods:
item = JingdongItem()
try:
data = one_good.xpath('div/div/a/em')
item['title'] = data.xpath('string(.)').extract()[0] # 提取出該標簽所有文字內(nèi)容
item['comment_count'] = one_good.xpath('div/div[@class="p-commit"]/strong/a/text()').extract()[0] # 評論數(shù)
item['goods_url'] = 'http:' + one_good.xpath('div/div[4]/a/@href').extract()[0] # 商品鏈接
item['shops_id'] = one_good.xpath('div/div[@class="p-shop"]/@data-shopid').extract()[0] # 店鋪ID
item['shop_url'] = self.shop_url.format(shop_id=item['shops_id'])
goods_id = one_good.xpath('div/div[2]/div/ul/li[1]/a/img/@data-sku').extract()[0]
if goods_id:
item['goods_id'] = goods_id
price = one_good.xpath('div/div[3]/strong/i/text()').extract() # 價格
if price: # 有寫商品評論數(shù)是0适荣,價格也不再源代碼里面,應(yīng)該是暫時上首頁的促銷商品院领,每頁有三四件弛矛,我們忽略掉
item['price'] = price[0]
yield item
except Exception as e:
pass
pipline代碼如圖
class JingdongPipeline(object):
# def __init__(self):
# self.client = MongoClient()
# self.database = self.client['jingdong']
# self.db = self.database['jingdong_infomation']
#
# def process_item(self, item, spider):#這里以每個用戶url_token為ID,有則更新比然,沒有則插入
# self.db.update({'goods_id':item['goods_id']},dict(item),True)
# return item
#
# def close_spider(self,spider):
# self.client.close()
def __init__(self):
self.conn = pymysql.connect(host='127.0.0.1',port=3306,user ='root',passwd='root',db='jingdong',charset='utf8')
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
try:#有些標題會重復(fù)丈氓,所以添加異常
title = item['title']
comment_count = item['comment_count'] # 評論數(shù)
shop_url = item['shop_url'] # 店鋪鏈接
price = item['price']
goods_url = item['goods_url']
shops_id = item['shops_id']
goods_id =int(item['goods_id'])
#sql = 'insert into jingdong_goods(title,comment_count,shop_url,price,goods_url,shops_id) VALUES (%(title)s,%(comment_count)s,%(shop_url)s,%(price)s,%(goods_url)s,%(shops_id)s,)'
try:
self.cursor.execute("insert into jingdong_goods(title,comment_count,shop_url,price,goods_url,shops_id,goods_id)values(%s,%s,%s,%s,%s,%s,%s)", (title,comment_count,shop_url,price,goods_url,shops_id,goods_id))
self.conn.commit()
except Exception as e:
pass
except Exception as e:
pass
運行結(jié)果如圖
運行了幾分鐘,每頁一千條谈秫,共爬了幾萬條褲子扒寄,京東的褲子真是多
mysql數(shù)據(jù)庫插入操作