Chapter 3: Starting to Crawl
- urlparse module
1.Key:
Think First.
What data am I trying to gather? Can this be accomplished by scraping just a few predefined websites (almost always the easier option), or does my crawler need to be able to discover new websites I might not know about?
我需要收集哪些數(shù)據(jù)?這些數(shù)據(jù)可以通過采集幾個(gè)已經(jīng)確定的網(wǎng)站(永遠(yuǎn)是最簡(jiǎn)單的做法)完成嗎珊搀?或者我需要通過爬蟲發(fā)現(xiàn)那些我可能不知道的網(wǎng)站從而獲取我想要的信息嗎雁乡?
When my crawler reaches a particular website, will it immediately follow the next outbound link to a new website, or will it stick around for a while and drill down into the current website?
當(dāng)我的爬蟲到了某一個(gè)網(wǎng)站候衍,它是立即順著下一個(gè)出站鏈接跳轉(zhuǎn)到下一個(gè)新網(wǎng)站,還是在網(wǎng)站上呆一會(huì)剂癌,深入采集網(wǎng)站的內(nèi)容麦乞?
Are there any conditions under which I would not want to scrape a particular site? Am I interested in non-English content?
有沒有我不想采集的一些網(wǎng)站?我對(duì)非英文網(wǎng)站的內(nèi)容感興趣么十嘿?
How am I protecting myself against legal action if my web crawler catches the attention of a webmaster on one of the sites it runs across?
如果我的爬蟲引起了某個(gè)網(wǎng)站網(wǎng)管的懷疑,我該如何避免法律責(zé)任糙及?
urlparse module
urlparse 模塊主要是把 url 拆分為六個(gè)部分详幽,并返回元組 tuple筛欢。并且可以把拆分后的部分再組成一個(gè) url浸锨。主要函數(shù)有 urljoin、urlsplit版姑、urlunsplit柱搜、urlparse 等。
urlparse function
>>> from urlparse import urlparse
>>> o =
urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
其將 url 解析成六個(gè)部分(scheme, netloc, path, parameters, query, fragment)剥险。
scrapy
Scrapy uses the Item objects to determine which pieces of information it should save from the pages it visits. This information can be saved by Scrapy in a variety of ways, such as a CSV, JSON, or XML files, using the following commands:
Scrapy 用 Item 對(duì)象決定要從它瀏覽的頁面中提取哪些信息聪蘸。Scrapy 支持用不同的輸出格式來保存這些信息,比如 CSV表制、JSON健爬、XML 文件格式,對(duì)應(yīng)命令如下:
$ scrapy crawl article -o articles.csv -t csv
$ scrapy crawl article -o articles.json -t json
$ scrapy crawl article -o articles.xml -t xml
當(dāng)然我們也可以自己定義 Item 對(duì)象么介,把結(jié)果寫入我們需要的一個(gè)文件或者數(shù)據(jù)庫中娜遵,只要在爬蟲的 parse 部分增加相應(yīng)的代碼即可。
Scrapy 是處理網(wǎng)絡(luò)數(shù)據(jù)采集相關(guān)問題的利器壤短。它可以自動(dòng)收集所有 URL设拟,然后和指定的規(guī)則進(jìn)行比較;確保所有的 URL 是唯一的久脯;根據(jù)需求對(duì)相關(guān)的 URL 進(jìn)行標(biāo)準(zhǔn)化纳胧;以及到更深層的頁面中遞歸查詢。
2.Need to know:
在[用Scrapy采集]的模塊中:
我們需要下載 scrapy 這一個(gè) package帘撰。「 這個(gè) package 不支持 python3.x 和 python2.6跑慕,只能使用 python2.7 〈菡遥」
我的嘗試:
$ sudo pip install scrapy
Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed?
Perhaps try: xcode-select --install
意思是缺少 libxml2 相赁,通過命令行輸入:
$ xcode-select --install
接著會(huì)彈出 Xcode command line tools 下載相寇,里面包含了 libxml2。安裝完成之后钮科,再次嘗試 sudo pip install scrapy唤衫,報(bào)錯(cuò),內(nèi)容為:
>>> from six.moves import xmlrpc_client as xmlrpclib
ImportError: cannot import name xmlrpc_client
在 stackoverflow 上尋找原因:
- six.moves is a virtual namespace. It provides access to packages that were renamed between Python 2 and 3. As such, you shouldn't be installing anything.
- By importing from six.moves.xmlrpc_client the developer doesn't have to handle the case where it is located at xmlrpclib in Python 2, and at xmlrpc.client in Python 3. Note that these are part of the standard library.
- The mapping was added to six version 1.5.0; make sure you have that version or newer.
- Mac comes with six version 1.4.1 pre-installed in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python and this will interfere with any version you install in site-packages (which is listed last in the sys.path).
The best work-around is to use a virtualenv and install your own version of six into that, together with whatever else you need for this project. Create a new virtualenv for new projects.- If you absolutely have to install this at the system level, then for this specific project you'll have to remove the /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python path:
>>> import sys
>>> sys.path.remove('/System/Library/Frameworks/Python.framework
> /Versions/2.7/Extras/lib/python')
- This will remove various OS X-provided packages from your path for just that run of Python; Apple installs these for their own needs.
Mac 自帶的 six 版本過低绵脯,scrapy 需要 six 的版本在 1.5.0 以上佳励,建議是采用 Python 虛擬環(huán)境,如果真的需要在 system level 上進(jìn)行更改的話蛆挫,需要重新安裝 six赃承。
于是,我先嘗試了其中的一個(gè)解決辦法:
$ sudo rm -rf /Library/Python/2.7/site-packages/six*
$ sudo rm -rf
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six*
$ sudo pip install six
但很不幸的是悴侵,sudo rm -rf 嘗試刪除文件的時(shí)候失敗報(bào)錯(cuò)瞧剖,Operation not Permitted。
繼續(xù)查找原因:
- This is because OS X El Capitan ships with six 1.4.1 installed already and when it attempts to uninstall it (because scrapy depends on six >= 1.5) it doesn't have permission to do so because System Integrity Protection doesn't allow even root to modify those directories.
- Ideally, pip should just skip uninstalling those items since they aren't installed to site-packages they are installed to a special Apple directory. However, even if pip skips uninstalling those items and installs six into site-packages we'll hit another bug where Apple puts their pre-installed stuff earlier in the sys.path than site-packages. I've talked to Apple about this and I'm not sure if they're going to do anything about it or not.
我的 Mac OS X 系統(tǒng)版本為 10.11.4可免,Mac 自版本 10.11 之后抓于,由于新的 SIP 機(jī)制,即使是 root 用戶也無法對(duì) /System 中的內(nèi)容進(jìn)行修改刪除(在系統(tǒng)恢復(fù)中可以辦到)浇借。
于是捉撮,我采用另外一種方法繼續(xù)嘗試:
$ sudo pip uninstall six
$ easy_install six
同樣得到的是 Operation not Permitted(此方法在10.11之前的版本應(yīng)該都可以行得通)。
后來嘗試了通過 Python 虛擬環(huán)境進(jìn)行解決妇垢,能力不夠失敗巾遭。
還嘗試了通過下載 Python 官網(wǎng)的 2.7.11,不使用 Mac 系統(tǒng)默認(rèn)自帶的 2.7.10(有人提到使用自己安裝的 Python2.7 可以解決問題)闯估,折騰了半天灼舍,還是失敗告終,還差點(diǎn)弄的 pip 無法安裝 package涨薪。挽救辦法為:
$ brew link python
$ bre unlink python
到最后骑素,本來想著要放棄的,Stackoverflow 上的另一個(gè)辦法讓事情有了轉(zhuǎn)機(jī):
This is a known issue on Mac OSX for Scrapy. You can refer to this link.
Basically the issue is with the PYTHONPATH in your system. To solve the issue change the current PYTHONPATH to point to the newer or none Mac OSX version of Python. Before running Scrapy, try:
export PYTHONPATH=/Library/Python/2.7/site-packages:$PYTHONPATH
If that worked you can change the .bashrc file permanently:
$ echo "export PYTHONPATH=/Library/Python/2.7/site-packages:$PYTHONPATH" >> ~/.bashrc
If none of this works, take a look at the link above.
此時(shí)命令行輸入 python,之后輸入:
>>> import scrapy
沒有報(bào)錯(cuò)尤辱,說明可以導(dǎo)入scrapy砂豌。
嘗試書上的命令:
$ scrapy startproject wikiSpider
得到信息:
New Scrapy project 'wikiSpider' created in:
/Users/randolph/PycharmProjects/Scraping/wikiSpider
You can start your first spider with:
cd wikiSpider
scrapy genspider example example.com
成功!scrapy is ready to go!
3.Correct errors in printing:
- 暫無
4.Still have Question:
- 暫無