昨天初學(xué)了下scrapy,今天測(cè)試下效果,看見網(wǎng)上很多都是用豆瓣的頁面做測(cè)試,那么久換個(gè)不一樣的月杉,就選擇 天天美劇 了
#coding:utf-8
import json
import scrapy
from my_scrapy_project.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["ttmeiju.com"]
start_urls = [
"http://www.ttmeiju.com/"
]
def parse(self, response):
for sel in response.xpath("http://table[contains(@class,'seedtable')]/tr[contains(@class,'Scontent')]"):
item = DmozItem()
title = sel.xpath('td[2]/a/text()').extract()[0]
link = sel.xpath('td[2]/a/@href').extract()
download = sel.xpath('td[3]/a/@href').extract()
item['title'] = title
item['link'] = link
item['download'] = download
yield item
- response.xpath("http://table[contains(@class,'seedtable')]/tr[contains(@class,'Scontent')]") 這段選擇了天天美劇首頁新的資源板塊,意思是選擇class為seedtable的table里面class為scontent的tr
- sel.xpath('td[2]/a/text()').extract()[0] 選擇的是片源的名字抠艾,通過審查元素苛萎,查看源代碼可以看到
- sel.xpath('td[3]/a/@href').extract() 這是資源的各種下載鏈接
輸出結(jié)果:
{"download": ["http://pan.baidu.com/s/1i3CcdQd"], "link": ["http://www.ttmeiju.com/seed/38897.html"], "title": "\n\u840c\u5ba0\u4e5f\u75af\u72c2 Pets Wild at Heart S01E02 HR-HDTV \u5927\u5bb6\u5b57\u5e55\u7ec4 \n \n "},
{"download": ["http://pan.baidu.com/s/1c0rT2pi"], "link": ["http://www.ttmeiju.com/seed/38896.html"], "title": "\n\u840c\u5ba0\u4e5f\u75af\u72c2 Pet Wild at Heart S01E01 HR-HDTV \u5927\u5bb6\u5b57\u5e55\u7ec4 \n \n "},
{"download": ["http://pan.baidu.com/s/1qWp3jgo"], "link": ["http://www.ttmeiju.com/seed/38895.html"], "title": "\n\u840c\u5ba0\u4e5f\u75af\u72c2 Pet Wild At Heart S01E01 \u5927\u5bb6\u5b57\u5e55\u7ec4 \n \n "},
{"download": ["http://pan.baidu.com/s/1bnxvmFl"], "link": ["http://www.ttmeiju.com/seed/38894.html"], "title": "\n\u7231\u306e\u65c5\u9986 The Love Hotel \u5927\u5bb6\u5b57\u5e55\u7ec4 \n \n "},
又出現(xiàn)中文字符編碼問題 。检号。腌歉。。 明天解決