<Web Scraping with Python> Chapter 3

Chapter 3: Starting to Crawl

  • urlparse module

1.Key:

Think First.

What data am I trying to gather? Can this be accomplished by scraping just a few predefined websites (almost always the easier option), or does my crawler need to be able to discover new websites I might not know about?
我需要收集哪些數(shù)據(jù)?這些數(shù)據(jù)可以通過采集幾個(gè)已經(jīng)確定的網(wǎng)站(永遠(yuǎn)是最簡(jiǎn)單的做法)完成嗎珊搀?或者我需要通過爬蟲發(fā)現(xiàn)那些我可能不知道的網(wǎng)站從而獲取我想要的信息嗎雁乡?

When my crawler reaches a particular website, will it immediately follow the next outbound link to a new website, or will it stick around for a while and drill down into the current website?
當(dāng)我的爬蟲到了某一個(gè)網(wǎng)站候衍,它是立即順著下一個(gè)出站鏈接跳轉(zhuǎn)到下一個(gè)新網(wǎng)站,還是在網(wǎng)站上呆一會(huì)剂癌,深入采集網(wǎng)站的內(nèi)容麦乞?

Are there any conditions under which I would not want to scrape a particular site? Am I interested in non-English content?
有沒有我不想采集的一些網(wǎng)站?我對(duì)非英文網(wǎng)站的內(nèi)容感興趣么十嘿?

How am I protecting myself against legal action if my web crawler catches the attention of a webmaster on one of the sites it runs across?
如果我的爬蟲引起了某個(gè)網(wǎng)站網(wǎng)管的懷疑,我該如何避免法律責(zé)任糙及?


urlparse module

urlparse 模塊主要是把 url 拆分為六個(gè)部分详幽,并返回元組 tuple筛欢。并且可以把拆分后的部分再組成一個(gè) url浸锨。主要函數(shù)有 urljoin、urlsplit版姑、urlunsplit柱搜、urlparse 等。

urlparse function

>>> from urlparse import urlparse
>>> o =
    urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o    
    ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')
>>> o.scheme  
    'http'
>>> o.port  
    80
>>> o.geturl()  
    'http://www.cwi.nl:80/%7Eguido/Python.html'

其將 url 解析成六個(gè)部分(scheme, netloc, path, parameters, query, fragment)剥险。


scrapy

Scrapy uses the Item objects to determine which pieces of information it should save from the pages it visits. This information can be saved by Scrapy in a variety of ways, such as a CSV, JSON, or XML files, using the following commands:
Scrapy 用 Item 對(duì)象決定要從它瀏覽的頁面中提取哪些信息聪蘸。Scrapy 支持用不同的輸出格式來保存這些信息,比如 CSV表制、JSON健爬、XML 文件格式,對(duì)應(yīng)命令如下:

$ scrapy crawl article -o articles.csv -t csv
$ scrapy crawl article -o articles.json -t json
$ scrapy crawl article -o articles.xml -t xml

當(dāng)然我們也可以自己定義 Item 對(duì)象么介,把結(jié)果寫入我們需要的一個(gè)文件或者數(shù)據(jù)庫中娜遵,只要在爬蟲的 parse 部分增加相應(yīng)的代碼即可。
Scrapy 是處理網(wǎng)絡(luò)數(shù)據(jù)采集相關(guān)問題的利器壤短。它可以自動(dòng)收集所有 URL设拟,然后和指定的規(guī)則進(jìn)行比較;確保所有的 URL 是唯一的久脯;根據(jù)需求對(duì)相關(guān)的 URL 進(jìn)行標(biāo)準(zhǔn)化纳胧;以及到更深層的頁面中遞歸查詢。


2.Need to know:

在[用Scrapy采集]的模塊中:

我們需要下載 scrapy 這一個(gè) package帘撰。「 這個(gè) package 不支持 python3.x 和 python2.6跑慕,只能使用 python2.7 〈菡遥」

我的嘗試:

$ sudo pip install scrapy
    
Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed?
         Perhaps try: xcode-select --install

意思是缺少 libxml2 相赁,通過命令行輸入:

$ xcode-select --install

接著會(huì)彈出 Xcode command line tools 下載相寇,里面包含了 libxml2。安裝完成之后钮科,再次嘗試 sudo pip install scrapy唤衫,報(bào)錯(cuò),內(nèi)容為:

>>> from six.moves import xmlrpc_client as xmlrpclib
ImportError: cannot import name xmlrpc_client

在 stackoverflow 上尋找原因:

  • six.moves is a virtual namespace. It provides access to packages that were renamed between Python 2 and 3. As such, you shouldn't be installing anything.
  • By importing from six.moves.xmlrpc_client the developer doesn't have to handle the case where it is located at xmlrpclib in Python 2, and at xmlrpc.client in Python 3. Note that these are part of the standard library.
  • The mapping was added to six version 1.5.0; make sure you have that version or newer.
  • Mac comes with six version 1.4.1 pre-installed in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python and this will interfere with any version you install in site-packages (which is listed last in the sys.path).
    The best work-around is to use a virtualenv and install your own version of six into that, together with whatever else you need for this project. Create a new virtualenv for new projects.
  • If you absolutely have to install this at the system level, then for this specific project you'll have to remove the /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python path:
>>> import sys
>>> sys.path.remove('/System/Library/Frameworks/Python.framework
> /Versions/2.7/Extras/lib/python')
  • This will remove various OS X-provided packages from your path for just that run of Python; Apple installs these for their own needs.

Mac 自帶的 six 版本過低绵脯,scrapy 需要 six 的版本在 1.5.0 以上佳励,建議是采用 Python 虛擬環(huán)境,如果真的需要在 system level 上進(jìn)行更改的話蛆挫,需要重新安裝 six赃承。
于是,我先嘗試了其中的一個(gè)解決辦法:

$ sudo rm -rf /Library/Python/2.7/site-packages/six*
$ sudo rm -rf 
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six*
$ sudo pip install six

但很不幸的是悴侵,sudo rm -rf 嘗試刪除文件的時(shí)候失敗報(bào)錯(cuò)瞧剖,Operation not Permitted。
繼續(xù)查找原因:

  • This is because OS X El Capitan ships with six 1.4.1 installed already and when it attempts to uninstall it (because scrapy depends on six >= 1.5) it doesn't have permission to do so because System Integrity Protection doesn't allow even root to modify those directories.
  • Ideally, pip should just skip uninstalling those items since they aren't installed to site-packages they are installed to a special Apple directory. However, even if pip skips uninstalling those items and installs six into site-packages we'll hit another bug where Apple puts their pre-installed stuff earlier in the sys.path than site-packages. I've talked to Apple about this and I'm not sure if they're going to do anything about it or not.

我的 Mac OS X 系統(tǒng)版本為 10.11.4可免,Mac 自版本 10.11 之后抓于,由于新的 SIP 機(jī)制,即使是 root 用戶也無法對(duì) /System 中的內(nèi)容進(jìn)行修改刪除(在系統(tǒng)恢復(fù)中可以辦到)浇借。

于是捉撮,我采用另外一種方法繼續(xù)嘗試:

$ sudo pip uninstall six
$ easy_install six

同樣得到的是 Operation not Permitted(此方法在10.11之前的版本應(yīng)該都可以行得通)。

后來嘗試了通過 Python 虛擬環(huán)境進(jìn)行解決妇垢,能力不夠失敗巾遭。
還嘗試了通過下載 Python 官網(wǎng)的 2.7.11,不使用 Mac 系統(tǒng)默認(rèn)自帶的 2.7.10(有人提到使用自己安裝的 Python2.7 可以解決問題)闯估,折騰了半天灼舍,還是失敗告終,還差點(diǎn)弄的 pip 無法安裝 package涨薪。挽救辦法為:

$ brew link python 
$ bre unlink python

到最后骑素,本來想著要放棄的,Stackoverflow 上的另一個(gè)辦法讓事情有了轉(zhuǎn)機(jī):

This is a known issue on Mac OSX for Scrapy. You can refer to this link.
Basically the issue is with the PYTHONPATH in your system. To solve the issue change the current PYTHONPATH to point to the newer or none Mac OSX version of Python. Before running Scrapy, try:

export PYTHONPATH=/Library/Python/2.7/site-packages:$PYTHONPATH

If that worked you can change the .bashrc file permanently:

$ echo "export PYTHONPATH=/Library/Python/2.7/site-packages:$PYTHONPATH" >> ~/.bashrc

If none of this works, take a look at the link above.

此時(shí)命令行輸入 python,之后輸入:

>>> import scrapy

沒有報(bào)錯(cuò)尤辱,說明可以導(dǎo)入scrapy砂豌。

嘗試書上的命令:

$ scrapy startproject wikiSpider

得到信息:

New Scrapy project 'wikiSpider' created in:
    /Users/randolph/PycharmProjects/Scraping/wikiSpider
You can start your first spider with:
    cd wikiSpider
    scrapy genspider example example.com

成功!scrapy is ready to go!


3.Correct errors in printing:

  • 暫無

4.Still have Question:

  • 暫無
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末光督,一起剝皮案震驚了整個(gè)濱河市阳距,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌结借,老刑警劉巖筐摘,帶你破解...
    沈念sama閱讀 217,277評(píng)論 6 503
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡咖熟,警方通過查閱死者的電腦和手機(jī)圃酵,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,689評(píng)論 3 393
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來馍管,“玉大人郭赐,你說我怎么就攤上這事∪贩校” “怎么了捌锭?”我有些...
    開封第一講書人閱讀 163,624評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)罗捎。 經(jīng)常有香客問我观谦,道長(zhǎng),這世上最難降的妖魔是什么桨菜? 我笑而不...
    開封第一講書人閱讀 58,356評(píng)論 1 293
  • 正文 為了忘掉前任豁状,我火速辦了婚禮,結(jié)果婚禮上倒得,老公的妹妹穿的比我還像新娘泻红。我一直安慰自己,他們只是感情好屎暇,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,402評(píng)論 6 392
  • 文/花漫 我一把揭開白布承桥。 她就那樣靜靜地躺著驻粟,像睡著了一般根悼。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上蜀撑,一...
    開封第一講書人閱讀 51,292評(píng)論 1 301
  • 那天挤巡,我揣著相機(jī)與錄音,去河邊找鬼酷麦。 笑死矿卑,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的沃饶。 我是一名探鬼主播母廷,決...
    沈念sama閱讀 40,135評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼糊肤!你這毒婦竟也來了琴昆?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 38,992評(píng)論 0 275
  • 序言:老撾萬榮一對(duì)情侶失蹤馆揉,失蹤者是張志新(化名)和其女友劉穎业舍,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,429評(píng)論 1 314
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡舷暮,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,636評(píng)論 3 334
  • 正文 我和宋清朗相戀三年态罪,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片下面。...
    茶點(diǎn)故事閱讀 39,785評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡复颈,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出沥割,到底是詐尸還是另有隱情券膀,我是刑警寧澤,帶...
    沈念sama閱讀 35,492評(píng)論 5 345
  • 正文 年R本政府宣布驯遇,位于F島的核電站芹彬,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏叉庐。R本人自食惡果不足惜舒帮,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,092評(píng)論 3 328
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望陡叠。 院中可真熱鬧玩郊,春花似錦、人聲如沸枉阵。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,723評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽兴溜。三九已至侦厚,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間拙徽,已是汗流浹背刨沦。 一陣腳步聲響...
    開封第一講書人閱讀 32,858評(píng)論 1 269
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留膘怕,地道東北人想诅。 一個(gè)月前我還...
    沈念sama閱讀 47,891評(píng)論 2 370
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像岛心,于是被迫代替她去往敵國和親来破。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,713評(píng)論 2 354

推薦閱讀更多精彩內(nèi)容

  • **2014真題Directions:Read the following text. Choose the be...
    又是夜半驚坐起閱讀 9,491評(píng)論 0 23
  • 過了三十歲后忘古,就發(fā)現(xiàn)自己寫作的風(fēng)格形成了一種即定的模式徘禁。理性的簡(jiǎn)單評(píng)述,一種沒有任何束縛的自主行為存皂。很久很久我就想...
    譯嫻閱讀 574評(píng)論 8 0
  • 有一個(gè)多月沒有更新簡(jiǎn)書了晌坤,因?yàn)楣ぷ鞔_實(shí)比較忙逢艘,天天加班。現(xiàn)在APP已經(jīng)上線骤菠,是時(shí)侯繼續(xù)我的寫作之旅了它改。這次項(xiàng)目里面...
    黑暗中的孤影閱讀 1,563評(píng)論 9 19
  • 除非互相喜歡央拖,否則所有的喜歡,都是心酸鹉戚。 ———文:瀟湘若夢(mèng) 下班回到家鲜戒,躺在沙發(fā)上了,還沒有休息五分鐘抹凳,討人厭的...
    瀟湘若夢(mèng)閱讀 1,541評(píng)論 3 5
  • 子曰:知之為知知遏餐,不知為不知,是知(智)也赢底。 譯文:孔子說失都,知道為知道,不知道為不知幸冻,這就是智慧粹庞。 然而,真的是嗎...
    純真的友情懵懂的年華閱讀 676評(píng)論 14 25