步驟一 創(chuàng)建爬取項目:
1. 進入你的桌面文件夾
cd desktop
2. 創(chuàng)建爬蟲項目
scrapy startproject imove
3.創(chuàng)建爬蟲機器人,名字就叫movie
cd imove
scrapy genspider movie
4.調整settings.py
變更user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'
拒絕機器人協(xié)議
ROBOTSTXT_OBEY = False
管道通
ITEM_PIPELINES = {
'imovie.pipelines.ImoviePipeline': 300,
}
步驟二 初始化
- 在填入需要爬入的網(wǎng)站:http://****.com
填入最開始的那個網(wǎng)頁 : http://****.com.index.html
allowed_domains = ['www.dytt8.net']
start_urls = ['https://www.dytt8.net/html/gndy/dyzz/index.html']
2. 爬取內容的結構化數(shù)據(jù)類型 items.py
import scrapy
class ImovieItem(scrapy.Item):
title = scrapy.Field()
date = scrapy.Field()
url = scrapy.Field()
步驟三 填寫爬蟲規(guī)則
觀察網(wǎng)站票编,需要爬取電影名字、時間卵渴、詳情url地址慧域,方便繼續(xù)深入爬取
網(wǎng)頁的規(guī)則為:
(Xpath 語言)
//table
名字 = .//a/text()
日期 = .//td[@style='padding-left:3px']/font/text()
URL= 域名+ ".//a/@href"
步驟四 實現(xiàn)自動翻頁
1. 判斷是否存在下一頁
if response.xpath("http://a[text()='下一頁']"):
2. 找出下一頁網(wǎng)址
(Xpath 語言)
//a[text()='下一頁']/@href
3. 點擊它!
yield self.make_requests_from_url(next_page)
步驟五 入庫
1. import sqlite3
2. sql 建表:
create table if not exists movies (title text ,date text , url text);
3. sql 查表
insert into movies (title,date,url) values (?,?,?);",(item["title"],item["date"],item["url"])
4. 數(shù)據(jù)庫驗證(可以不做)parse_sqlite.py
import sqlite3
import pandas as pd
conn = sqlite3.connect("data.sqlite")
df = pd.read_sql_query("select * from movies limit 5;", conn)
print(df)
步驟六 運行
scrapy crawl movie
運行結果:
image.png
全都保存在數(shù)據(jù)庫中浪读,方便下步操作
全部代碼:
# movie.py
# -*- coding: utf-8 -*-
import scrapy
from imovie.items import ImovieItem
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['www.dytt8.net']
start_urls = ['https://www.dytt8.net/html/gndy/dyzz/index.html']
def parse(self, response):
tables = response.xpath("http://table")
imoveitem = ImovieItem()
for table in tables:
try:
imoveitem["title"] = table.xpath(".//a/text()").extract_first()
imoveitem["date"] = table.xpath(".//td[@style='padding-left:3px']/font/text()").extract_first().split()[0]
imoveitem["url"] = "https://www.dytt8.net"+table.xpath(".//a/@href").extract_first()
except:pass
print(imoveitem)
yield imoveitem
if response.xpath("http://a[text()='下一頁']"):
next_page = "https://www.dytt8.net/html/gndy/dyzz/"+response.xpath("http://a[text()='下一頁']/@href").extract_first()
yield self.make_requests_from_url(next_page)
# items.py
import scrapy
class ImovieItem(scrapy.Item):
title = scrapy.Field()
date = scrapy.Field()
url = scrapy.Field()
# pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import sqlite3
class ImoviePipeline(object):
def __init__(self):
self.conn = sqlite3.connect("data.sqlite")
cur = self.conn.cursor()
# with self.conn.cursor() as cur:
# cur.execute("create table movies (titie text ,date text , url text);")
cur.execute("create table if not exists movies (title text ,date text , url text);")
cur.close()
def process_item(self, item, spider):
cur = self.conn.cursor()
# with self.conn.cursor() as cur:
# cur.execute("insert into movies (title,date,url) values (?,?,?);",(item["title"],item["date"],item["url"]))
cur.execute("insert into movies (title,date,url) values (?,?,?);",(item["title"],item["date"],item["url"]))
self.conn.commit()
print("插入成功!")
cur.close()
return item
# parse_sqlite.py
import sqlite3
import pandas as pd
conn = sqlite3.connect("data.sqlite")
df = pd.read_sql_query("select * from movies;", conn)
print(df)