目的:練習(xí)爬取當(dāng)當(dāng)網(wǎng)站特定關(guān)鍵詞下圖書數(shù)據(jù)烙博,并將抓取到的數(shù)據(jù)存儲(chǔ)在mysql數(shù)據(jù)庫(kù)中
1.新建項(xiàng)目當(dāng)當(dāng):
scrapy startproject dd
2.cd 到項(xiàng)目目錄
cd dd
3.創(chuàng)建當(dāng)當(dāng)爬蟲 叛买,用基本爬蟲模板
scrapy genspider -t basic dd_spider dangdang.com
4.使用pycharm打開dd項(xiàng)目
5.打開當(dāng)當(dāng)沸枯,搜索特定的關(guān)鍵字的圖書分析網(wǎng)頁(yè)和需要抓取的字段
# -*- coding: utf-8 -*-
import scrapy
class DdItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
link = scrapy.Field()
now_price = scrapy.Field()
comment_num = scrapy.Field()
detail = scrapy.Field()
6.打開爬蟲文件唤殴,導(dǎo)入剛編寫的item斋攀,以及修改的開始的爬取網(wǎng)址
from dd.items import DdItem
定義Item
item = DdItem()
item["title"] = response.xpath("http://p[@class='name']/a/@title").extract()
item["link"] = response.xpath("http://p[@class='name']/a/@href").extract()
item["now_price"] = response.xpath("http://p[@class='price']/span[@class='search_now_price']/text()").extract()
item["comment_num"] = response.xpath("http://p/a[@class='search_comment_num']/text()").extract()
item["detail"] = response.xpath("http://p[@class='detail']/text()").extract()
yield item
定義循環(huán)爬取方法
for i in range(2,27):
url = "http://search.dangdang.com/?key=python&act=input&page_index="+str(i)
yield Request(url, callback=self.parse())
完整的代碼
# -*- coding: utf-8 -*-
import scrapy
from dd.items import DdItem
from scrapy.http import Request
class DdSpiderSpider(scrapy.Spider):
name = 'dd_spider'
allowed_domains = ['dangdang.com']
start_urls = ['http://search.dangdang.com/?key=python&act=input&page_index=1']
def parse(self, response):
item = DdItem()
item["title"] = response.xpath("http://p[@class='name']/a/@title").extract()
item["link"] = response.xpath("http://p[@class='name']/a/@href").extract()
item["now_price"] = response.xpath("http://p[@class='price']/span[@class='search_now_price']/text()").extract()
item["comment_num"] = response.xpath("http://p/a[@class='search_comment_num']/text()").extract()
item["detail"] = response.xpath("http://p[@class='detail']/text()").extract()
yield item
for i in range(2,27):
url = "http://search.dangdang.com/?key=python&act=input&page_index="+str(i)
yield Request(url, callback=self.parse())
7.在setting中,取消注釋Pipeline的注釋,以及將Robots協(xié)議設(shè)置為False
ITEM_PIPELINES = {
'dd.pipelines.DdPipeline': 300,
}
ROBOTSTXT_OBEY = False
8.打開pipelines文件
通過(guò)for循環(huán)讀取爬取到的itme的值,并打印測(cè)試抓取效果
class DdPipeline(object):
def process_item(self, item, spider):
for i in range(0,len(item["title"])):
title = item["title"][i]
link = item["link"][i]
now_price = item["now_price"][i]
comment_num = item["comment_num"][i]
detail = item["detail"][i]
print(title)
print(link)
print(now_price)
print(comment_num)
print(detail)
return item
9.運(yùn)行爬蟲查看效果,使用pycharm的Terminal或mac終端弧哎,進(jìn)入的dd的文件夾目錄輸入
scrapy crawl dd_spider --nolog
10.爬取沒(méi)問(wèn)題雁比,接下來(lái)要將抓取到的數(shù)據(jù),存入到Mysql的數(shù)據(jù)庫(kù)中,使用的是第三方庫(kù)PyMysql撤嫩,提前安裝好PyMysql偎捎,直接使用命令 pip install pymysql 來(lái)安裝。
11.終端打開并連接上mysql ,輸入創(chuàng)建數(shù)據(jù)庫(kù)dd命令,并切換成dd數(shù)據(jù)庫(kù)
create database dd;
use dd;
創(chuàng)建數(shù)據(jù)庫(kù)表books茴她,并創(chuàng)建需要存儲(chǔ)的相應(yīng)字段:
自動(dòng)自增id蜕径,title,link败京,now_price,comment_num梦染,detail
create table books(id int AUTO_INCREMENT PRIMARY KEY,title char(200),link char(100)unique,now_price int(10),comment_num char(100),detail char(255) );
12.導(dǎo)入pymysql
import pymysql
# -*- coding: utf-8 -*-
import pymysql
class DdPipeline(object):
def process_item(self, item, spider):
#創(chuàng)建連接
conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd")
for i in range(0,len(item["title"])):
title = item["title"][i]
link = item["link"][i]
now_price = item["now_price"][i]
comment_num = item["comment_num"][i]
detail = item["detail"][i]
#構(gòu)建sql語(yǔ)句插入數(shù)據(jù)
sql = "insert into books(title,link,now_price,comment_num,detail) VALUES ('"+title+"','"+link+"','"+now_price+"','"+comment_num+"','"+detail+"')"
conn.query(sql)
#關(guān)閉連接
conn.close()
return item
無(wú)法爭(zhēng)取的寫入寫入數(shù)據(jù)庫(kù)赡麦,報(bào)ModuleNotFoundError: No module named 'pymysql'
還沒(méi)找到解決方案
解決辦法:更換SQL語(yǔ)句的寫法
conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd",charset='utf8')
cursor = conn.cursor()
cursor.execute('set names utf8') # 固定格式
cursor.execute('set autocommit=1') # 設(shè)置自動(dòng)提交
sql = "insert into goods(title,link,now_price,comment_num,detail) VALUES (%s,%s,%s,%s,%s)"
param = (title,link,now_price,comment_num,detail)
cursor.execute(sql,param )
conn.commit()
完整的代碼
# -*- coding: utf-8 -*-
import pymysql
class DdPipeline(object):
def process_item(self, item, spider):
#創(chuàng)建連接
conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd",charset='utf8')
cursor = conn.cursor()
cursor.execute('set names utf8') # 固定格式
cursor.execute('set autocommit=1') # 設(shè)置自動(dòng)提交
for i in range(0,len(item["title"])):
title = item["title"][i]
link = item["link"][i]
now_price = item["now_price"][i]
comment_num = item["comment_num"][i]
detail = item["detail"][i]
sql = "insert into goods(title,link,now_price,comment_num,detail) VALUES (%s,%s,%s,%s,%s)"
param = (title,link,now_price,comment_num,detail)
cursor.execute(sql,param )
conn.commit()
cursor.close()
#關(guān)閉連接
conn.close()
return item
心得,出現(xiàn)問(wèn)題比較多的是數(shù)據(jù)的編碼問(wèn)題帕识,數(shù)據(jù)表字段的編碼如何存入的字段編碼不符可能會(huì)存不進(jìn)去泛粹,也可能是亂碼
優(yōu)化:
1.抓取的到當(dāng)當(dāng)?shù)脑u(píng)論數(shù)和價(jià)格都是字符,需要轉(zhuǎn)化成數(shù)字肮疗,這樣方便進(jìn)行排序
2.寫入數(shù)據(jù)庫(kù)的時(shí)候使用Try 代碼更健壯
def getNumber(string):
newString = string.encode('UTF-8')
lastStr = re.findall(r"\d+\.?\d*", newString)
yield int(lastStr)