1. 海王評論數(shù)據(jù)爬取前分析
海王上映了索绪,然后口碑炸了,對咱來說漠其,多了一個可爬可分析的電影嘴高,美哉~
摘錄一個評論
零點場剛看完,溫導的電影一直很不錯和屎,無論是速7拴驮,電鋸驚魂還是招魂都很棒。打斗和音效方面沒話說非常棒柴信,特別震撼套啤。總之随常,DC扳回一分( ̄▽ ̄)潜沦。比正義聯(lián)盟好的不止一點半點(我個人感覺)。還有艾梅伯希爾德是真的漂亮线罕,溫導選的人都很棒止潮。
真的第一次看到這么牛逼的電影 轉場特效都吊炸天
2. 海王案例開始爬取數(shù)據(jù)
數(shù)據(jù)爬取的依舊是貓眼的評論,這部分內容咱們用把牛刀钞楼,scrapy爬取喇闸,一般情況下,用一下requests就好了
抓取地址、交流群:1029344413?分享視頻資料
http://m.maoyan.com/mmdb/comments/movie/249342.json?_v_=yes&offset=15&startTime=2018-12-11%2009%3A58%3A43
關鍵參數(shù)
url:http://m.maoyan.com/mmdb/comments/movie/249342.json
offset:15startTime:起始時間
scrapy 爬取貓眼代碼特別簡單燃乍,我分開幾個py文件即可唆樊。Haiwang.py
import scrapyimport jsonfromhaiwang.itemsimport HaiwangItemclass HaiwangSpider(scrapy.Spider):
? ? name ='Haiwang'? ? allowed_domains = ['m.maoyan.com']
? ? start_urls = ['http://m.maoyan.com/mmdb/comments/movie/249342.json?_v_=yes&offset=0&startTime=0']
? ? def parse(self, response):
? ? ? ? print(response.url)
? ? ? ? body_data = response.body_as_unicode()
? ? ? ? js_data = json.loads(body_data)
? ? ? ? item = HaiwangItem()
? ? ? ? forinfoinjs_data["cmts"]:
? ? ? ? ? ? item["nickName"] = info["nickName"]
? ? ? ? ? ? item["cityName"] = info["cityName"]if"cityName"ininfoelse""? ? ? ? ? ? item["content"] = info["content"]
? ? ? ? ? ? item["score"] = info["score"]
? ? ? ? ? ? item["startTime"] = info["startTime"]
? ? ? ? ? ? item["approve"] = info["approve"]
? ? ? ? ? ? item["reply"] = info["reply"]
? ? ? ? ? ? item["avatarurl"] = info["avatarurl"]
? ? ? ? ? ? yield item
? ? ? ? yieldscrapy.Request("http://m.maoyan.com/mmdb/comments/movie/249342.json?_v_=yes&offset=0&startTime={}".format(item["startTime"]),callback=self.parse)
setting.py
設置需要配置headers
DEFAULT_REQUEST_HEADERS = {
? ? "Referer":"http://m.maoyan.com/movie/249342/comments?_v_=yes",
? ? "User-Agent":"Mozilla/5.0 Chrome/63.0.3239.26 Mobile Safari/537.36",
? ? "X-Requested-With":"superagent"}
需要配置一些抓取條件
# Obey robots.txt rulesROBOTSTXT_OBEY = False# See also autothrottle settings and docsDOWNLOAD_DELAY = 1# Disable cookies (enabled by default)COOKIES_ENABLED = False
開啟管道
# Configure item pipelines# See https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {
? 'haiwang.pipelines.HaiwangPipeline': 300,
}
items.py
獲取你想要的數(shù)據(jù)
import scrapyclass HaiwangItem(scrapy.Item):
? ? # define the fields for your item here like:# name = scrapy.Field()nickName = scrapy.Field()
? ? cityName = scrapy.Field()
? ? content = scrapy.Field()
? ? score = scrapy.Field()
? ? startTime = scrapy.Field()
? ? approve = scrapy.Field()
? ? reply =scrapy.Field()
? ? avatarurl = scrapy.Field()
pipelines.py
保存數(shù)據(jù),數(shù)據(jù)存儲到csv文件中
import osimport csvclass HaiwangPipeline(object):
? ? def__init__(self):
? ? ? ? store_file = os.path.dirname(__file__) +'/spiders/haiwang.csv'? ? ? ? self.file = open(store_file,"a+", newline="", encoding="utf-8")
? ? ? ? self.writer = csv.writer(self.file)
? ? def process_item(self, item, spider):
? ? ? ? try:
? ? ? ? ? ? self.writer.writerow((
? ? ? ? ? ? ? ? item["nickName"],
? ? ? ? ? ? ? ? item["cityName"],
? ? ? ? ? ? ? ? item["content"],
? ? ? ? ? ? ? ? item["approve"],
? ? ? ? ? ? ? ? item["reply"],
? ? ? ? ? ? ? ? item["startTime"],
? ? ? ? ? ? ? ? item["avatarurl"],
? ? ? ? ? ? ? ? item["score"]
? ? ? ? ? ? ))
? ? ? ? except Exception as e:
? ? ? ? ? ? print(e.args)
? ? ? ? def close_spider(self, spider):
? ? ? ? ? ? self.file.close()
begin.py
編寫運行腳本
fromscrapyimport cmdline
cmdline.execute(("scrapy crawl Haiwang").split())
搞定刻蟹,等著數(shù)據(jù)來到逗旁,就可以了