<strong>先來一個(gè)例子:</strong>
<code>TABLES['urls'] = (
"CREATE TABLE urls
("
" index
int(11) NOT NULL AUTO_INCREMENT," # index of queue
" url
varchar(512) NOT NULL,"
" md5
varchar(16) NOT NULL,"
" status
varchar(11) NOT NULL DEFAULT 'new'," # could be new, downloading and finish
" depth
int(11) NOT NULL,"
" queue_time
timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,"
" done_time
timestamp NOT NULL DEFAULT 0 ON UPDATE CURRENT_TIMESTAMP,"
" PRIMARY KEY (index
),"
" UNIQUE KEY md5
(md5
)"
") ENGINE=InnoDB")
</code>
<ul>
<li><strong>我們一個(gè)一個(gè)來解釋吧:</strong></li>
<li>index我們設(shè)計(jì)為可以自增的模式;</li>
<li>url和md5都是UNIQUE的镇眷,保證了url不會重復(fù),不需要用filter來去重膨桥,直接用數(shù)據(jù)庫實(shí)現(xiàn)州刽,還可以自動使用哈希索引辕狰,如果不設(shè)置UNIQUE就會全表查詢矿酵;</li>
<li>我們必須要有一個(gè)status來標(biāo)記url是否是新的(new)唬复,被爬過(done),或者正在被爬全肮,不然多進(jìn)程爬蟲有可能會同時(shí)抽取同一個(gè)url來爬取敞咧,這是我們不希望的;</li>
<li>depth記錄爬蟲爬取到第幾級</li>
<li>queue_time爬蟲添加到隊(duì)列里的時(shí)間</li>
<li>done_time爬取完成的時(shí)間</li></ul>
<em>最好全都設(shè)成NOT NULL倔矾,避免出錯(cuò)</em>
我們整個(gè)數(shù)據(jù)庫流程大概可以歸結(jié)為:
<em><strong>讀取——>update狀態(tài)妄均,給進(jìn)程內(nèi)的url上鎖——>cursor.commit——>解鎖</strong><em>