以網頁 http://quotes.toscrape.com/ 為例
命令:
scrapy shell 'http://quotes.toscrape.com/'
In [4]: response.xpath('//*[@class="quote"]')
Out[4]:
[<Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
<Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
<Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
<Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
<Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
<Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
<Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
<Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
<Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>,
<Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>]
In [5]: quotes = response.xpath('//*[@class="quote"]')
In [6]: quote = quotes[0]
In [7]: quote
Out[7]: <Selector xpath='//*[@class="quote"]' data=u'<div class="quote" itemscope itemtype="h'>
In [8]: quote.extract()
Out[8]: u'<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d</span>\n <span>by <small class="author" itemprop="author">Albert Einstein</small>\n <a href="/author/Albert-Einstein">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> \n \n <a class="tag" href="/tag/change/page/1/">change</a>\n \n <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n \n <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n \n <a class="tag" href="/tag/world/page/1/">world</a>\n \n </div>\n </div>'
對單個quote的處理:
In [9]: quote.xpath('.//*[@class="text"]')
Out[9]: [<Selector xpath='.//*[@class="text"]' data=u'<span class="text" itemprop="text">\u201cThe '>]
In [10]: quote.xpath('.//*[@class="text"]/text()').extract()
Out[10]: [u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d']
In [11]: quote.xpath('.//*[@class="text"]/text()').extract_first()
Out[11]: u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'
上面是用class藕溅,也可以用itemprop
text = quote.xpath('.//*[@itemprop="text"]/text()').extract_first()
In [13]: text
Out[13]: u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'
對于 custom quote乘碑,如果不加最前面那個點 . 的話:
In [16]: quote.xpath('//*[@itemprop="text"]/text()').extract()
Out[16]:
[u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d',
u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d',
u'\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d',
u'\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d',
u"\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d",
u'\u201cTry not to become a man of success. Rather become a man of value.\u201d',
u'\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d',
u"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d",
u"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d",
u'\u201cA day without sunshine is like, you know, night.\u201d']
有點神奇,具體為什么?我不知道九火。以后知道了再回來補吧。
對于 meta 標簽的 content部分的獲取拆融,語法稍微不同
quote.xpath('.//*[@itemprop="keywords"]/@content').extract()
Out[20]: [u'change,deep-thoughts,thinking,world']