spider:解析downloader返回的response,產(chǎn)生爬取項(xiàng)scraped item冀宴,產(chǎn)生額外的爬取請(qǐng)求
item piplines:以流水線形式處理spider產(chǎn)生的爬取項(xiàng),清理略贮,檢驗(yàn)甚疟,去重逃延,將數(shù)據(jù)存儲(chǔ)到數(shù)據(jù)庫(kù)。
download middleware:修改engine揽祥,scheduler,downloader的請(qǐng)求或響應(yīng)
scrapy -h startproject, genspider,settings,crawl,list,shell
1:建立一個(gè)爬蟲(chóng)工程和模板: scrapy startproject BaiduStocks
2:編寫(xiě)spider : cd BaiduStocks????scrapy genspider example example.com
3:編寫(xiě) item pipeline
4:優(yōu)化配置策略
request 類 class scrapy.http.Reqeust() 屬性和方法:.url, .method, .headers, .body, .meta, .copy()
response類 class scrapy.http.Response()屬性和方法:.url, .status, .headers, .body, .flags, .request, .copy()?
scrapy 支持多種html解析方法:Beatiful Soup, lxml, re, XPath Selector, CSS Selector.
def gen(n):
? ?for i in range(n):
? ? yield i**2