引言
本文總結了本人搭建Nutch平臺的過程,也為初探nutch的小伙伴提供一些指導忍啤。
環(huán)境說明
·操作系統(tǒng):Ubuntu18.04LTS
·軟件版本:nutch2.2.1钞澳、solr4.10.3
平臺結構
如同文章標題一樣俭嘁,平臺可以分為3個部分:Nutch宰啦、數(shù)據(jù)庫、前端
Nutch:圖中Index左邊的一部分奸例,負責對網(wǎng)頁進行抓取解析彬犯,調(diào)用數(shù)據(jù)庫進行存儲
數(shù)據(jù)庫:存儲抓取到的網(wǎng)頁數(shù)據(jù)。1.x版本是基于Hadoop架構的,底層存儲使用的是HDFS谐区,而2.x通過使用Apache Gora湖蜕,使得Nutch可以訪問HBase、Accumulo宋列、Cassandra重荠、MySQL、DataFileAvroStore虚茶、AvroStore等數(shù)據(jù)庫戈鲁。
前端:Tomcat 是一個免費的開放源代碼的Web 應用服務器,Solr是一個搜索應用嘹叫。
平臺部署
我們從一臺全新的Ubuntu18.04服務器開始婆殿,先新建一個文件夾來存放平臺所需軟件,這里可以根據(jù)個人情況選擇文件夾的位置罩扇。若無十足把握確保接下來教程中的路徑?jīng)]有問題婆芦,可以按照教程一字不差地進行操作。
lemon@ubuntu:~$ mkdir ~/download/ #新建一個文件夾用來存放下載文件
一喂饥、安裝JDK
step1.下載OracleJDK
step2. 解壓
step3. 加入環(huán)境變量
具體操作如下:
lemon@ubuntu:~$ cd ~/download/
lemon@ubuntu:~/download$ wget http://download.oracle.com/otn-pub/java/jdk/8u191-b12/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64.tar.gz
lemon@ubuntu:~/download$ tar vxf jdk-8u191-linux-x64.tar.gz
lemon@ubuntu:~/download$ ls #查看當前目錄下的文件
jdk1.8.0_191 jdk-8u191-linux-x64.tar.gz
lemon@ubuntu:~/download$ sudo mv jdk1.8.0_191/ /usr/local/jdk1.8/ #將jdk1.8.0_191文件夾移動到/usr/local/下并重命名為jdk1.8
lemon@ubuntu:~/download$ sudo vim /etc/profile #編輯環(huán)境變量
在環(huán)境變量末尾加入如下內(nèi)容:
export JAVA_HOME=/usr/local/jdk1.8
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=.:${JAVA_HOME}/bin:$PATH
保存后重新加載環(huán)境變量消约,使生效:
lemon@ubuntu:~/download$ source /etc/profile #刷新環(huán)境變量,使生效
lemon@ubuntu:~$ java -version#輸入java -version员帮,如顯示以下信息或粮,則JDK安裝成功
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
二、安裝MySQL
step1. 安裝MySQL并配置
step2. 創(chuàng)建數(shù)據(jù)庫與表
由于在安裝Ubuntu系統(tǒng)時捞高,本人選擇了安裝LAMP服務氯材,所以MySQL已安裝完成,僅需設置即可啟用硝岗。
測試是否安裝:
lemon@ubuntu:~$ mysql #輸入mysql氢哮,如出現(xiàn)以下提示,說明已安裝mysql
ERROR 1045 (28000): Access denied for user 'lemon'@'localhost' (using password: NO)
如未安裝:
lemon@ubuntu:~$ sudo apt-get install mysql-server
lemon@ubuntu:~$sudo apt isntall mysql-client
lemon@ubuntu:~$sudo apt install libmysqlclient-dev
如已安裝:
lemon@ubuntu:~$ sudo mysql_secure_installation
兩者都會進入MySQL設置過程型檀,具體設置內(nèi)容如下:
#1
VALIDATE PASSWORD PLUGIN can be used to test passwords...
Press y|Y for Yes, any other key for No: N(不啟用弱密碼檢查)
#2
Please set the password for root here...
New password: (設置root密碼)
Re-enter new password: (重復輸入)
#3
By default, a MySQL installation has an anonymous user,
allowing anyone to log into MySQL without having to have
a user account created for them...
Remove anonymous users? (Press y|Y for Yes, any other key for No) : Y(不啟用匿名用戶)
#4
Normally, root should only be allowed to connect from
'localhost'. This ensures that someone cannot guess at
the root password from the network...
Disallow root login remotely? (Press y|Y for Yes, any other key for No) : Y (不允許root遠程登陸)
#5
By default, MySQL comes with a database named 'test' that
anyone can access...
Remove test database and access to it? (Press y|Y for Yes, any other key for No) : N
#6
Reloading the privilege tables will ensure that all changes
made so far will take effect immediately.
Reload privilege tables now? (Press y|Y for Yes, any other key for No) : Y (立刻刷新權限表)
All done!
接下來進入進入MySQL進行操作:
#最新版的MySQL安裝之后無法使用密碼進行登陸冗尤,需要sudo登錄修改登錄方式
lemon@ubuntu:~$ sudo mysql -uroot -p
Enter password: (空密碼)
mysql>
mysql>UPDATE mysql.user SET authentication_string=PASSWORD('LEMON'), plugin='mysql_native_password' WHERE user='root';
mysql> FLUSH PRIVILEGES;
mysql>exit
lemon@ubuntu:~$ sudo service mysql restart
lemon@ubuntu:~$ mysql -u root -p
Enter password: (上一步設置的密碼,PASSWORD括號內(nèi)的)
mysql>CREATE DATABASE nutch;
mysql>USE nutch
mysql> CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
`batchId`varchar(767)DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;
mysql>exit
*最新版本默認情況下胀溺,MySQL是不允許遠程登錄的裂七,如需遠程訪問需要做一些修改:
lemon@ubuntu:~$sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf
#將bind-address = 127.0.0.1注釋掉,重啟MySQL服務
lemon@ubuntu:~$sudo service mysqld start
接下來就可以通過Navicat等軟件月幌,在其他計算機訪問數(shù)據(jù)庫了碍讯。
三悬蔽、安裝Nutch
step1.下載Nutch
step2. 解壓
step3. 修改ivy.xml扯躺、gora.properties、nutch-site.xml
step4. 編譯Nutch
step5. 網(wǎng)頁抓取配置
具體操作如下:
lemon@ubuntu:~$ cd ~/download/
lemon@ubuntu:~/download$ wget http://archive.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.zip
lemon@ubuntu:~/download$ unzip apache-nutch-2.2.1-src.zip
#如果提示未安裝unzip,就先安裝一下sudo apt install unzip
lemon@ubuntu:~/download$ mkdir ~/software
lemon@ubuntu:~/download$ mv apache-nutch-2.2.1 ~/software/
修改ivy.xml:(用于配置存儲層使用的數(shù)據(jù)庫)
lemon@ubuntu:~/software$ vim apache-nutch-2.2.1/ivy/ivy.xml
將以下兩行取消注釋
<dependency org=”mysql” name=”mysql-connector-java”rev=”5.1.18″ conf=”*->default”/>
<dependency org="org.apache.gora"name="gora-sql" rev="0.1.1-incubating"conf="*->default" />
將
<dependency org="org.apache.gora" name="gora-core" rev="0.3"conf="*->default"/>
改成
<dependency org="org.apache.gora" name="gora-core"rev="0.2.1"conf="*->default"/>
修改gora.properties:(數(shù)據(jù)庫的具體參數(shù))
lemon@ubuntu:~/software$ vim apache-nutch-2.2.1/conf/gora.properties
注釋掉默認的數(shù)據(jù)庫連接配置录语,同時添加以下配置內(nèi)容:
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxx(MySQL用戶名)
gora.sqlstore.jdbc.password=xxxx(MySQL密碼)
如數(shù)據(jù)庫非本機倍啥,需修改localhost為數(shù)據(jù)庫地址
修改nutch-site:(配置Nutch)
lemon@ubuntu:~/software$ vim apache-nutch-2.2.1/conf/nutch-site.xml
增加如下內(nèi)容:
<configuration>
<property>
<name>http.agent.name</name>
<value>LemonSpider</value>
</property>
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header failed. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>
<property>
<name>generate.batch.id</name>
<value>*</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-jsoup</value>
</property>
<property>
<name>http.robots.agents</name>
<value>LemonSpider,*</value>
</property>
編譯Nutch
lemon@ubuntu:~$cd ~/software/apache-nutch-2.2.1
lemon@ubuntu:~/software/apache-nutch-2.2.1$ant
#編譯需要較長時間,請保持聯(lián)網(wǎng)
網(wǎng)頁抓取配置
lemon@ubuntu:~$cd ~/software/apache-nutch-2.2.1/runtime/local
lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local$mkdir -p urls
lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local$echo 'http://www.apache.org/' > urls/seed.txt#設置要抓取的網(wǎng)站
lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local$bin/nutch crawl urls -depth 3 -topN 5#執(zhí)行抓取
-depth -topN 參數(shù)分別是深度和返回前N頁面澎埠,具體參數(shù)可以參考官網(wǎng)手冊
如果報錯虽缕,請仔細檢查是否完全按照上述教程操作、檢查有無修改內(nèi)容時多刪除或者少刪除了字符蒲稳。
成功運行示意圖:
-finishing thread FetcherThread3, activeThreads=5
-finishing thread FetcherThread9, activeThreads=6
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread1, activeThreads=8
-finishing thread FetcherThread8, activeThreads=9
-finishing thread FetcherThread6, activeThreads=4
-finishing thread FetcherThread7, activeThreads=3
-finishing thread FetcherThread0, activeThreads=2
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread5, activeThreads=0
0/0 spinwaiting/active, 11 pages, 0 errors, 0.4 0 pages/s, 78 36 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Parsing http://accumulo.apache.org/
Parsing http://activemq.apache.org/
Parsing http://airavata.apache.org/
Parsing http://allura.apache.org/
Parsing http://ambari.apache.org/
Parsing http://www.apache.org/
Parsing http://www.apache.org/foundation/sponsorship.html
Parsing http://www.apache.org/foundation/thanks.html
Parsing http://www.apache.org/licenses/
Parsing http://www.apache.org/licenses/LICENSE-2.0
Parsing http://www.apache.org/security/
lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local$
四氮趋、安裝tomcat
step1.下載tomcat
step2. 解壓
step3. 啟動
lemon@ubuntu:~$ cd download/
lemon@ubuntu:~/download$ wget http://archive.apache.org/dist/tomcat/tomcat-8/v8.0.33/bin/apache-tomcat-8.0.33.tar.gz
lemon@ubuntu:~/download$ tar vxf apache-tomcat-8.0.33.tar.gz
lemon@ubuntu:~/download$ mv apache-tomcat-8.0.33 ~/software/
lemon@ubuntu:~/download$ cd ~/software/apache-tomcat-8.0.33/
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ bin/startup.sh
此時,在本地瀏覽器中打開localhost:8080或者127.0.0.1:8080江耀,同一局域網(wǎng)下計算機可以訪問本機ip:8080剩胁,例如,本服務器內(nèi)網(wǎng)ip為114.212.167.106祥国,同一局域網(wǎng)下計算機可以訪問114.212.167.106:8080
看到以下頁面就說明tomcat安裝完成:
五昵观、安裝solr與tomcat集成
step1.下載solr并解壓
step2. 解壓
step3. 在tomcat的webapps目錄下新建solr文件夾
step4. 將solr-4.10.3/example/webapps/文件夾下的solr.war拷貝到step2新建的solr文件夾并解壓
step5. step4完成后solr文件夾下會生成collection1文件夾,將apache-nutch-2.2.1/conf/文件夾下的schema.xml拷貝到collection1/conf/文件夾下
step6. 修改tomcat文件夾下webapps/solr/WEB_INF/web.xml
step7. 復制solr-4.10.3/example/lib/ext/文件夾下的jar包到tomcat/webapps/solr/WEB-INF/lib/
step8.在tomcat/webapps/solr/WEB-INF/文件夾下新建classes文件夾舌稀,并將solr-4.10.3/example/resources文件夾下的log4j.properties復制到新建classes文件夾里
step9. 重啟tomcat
lemon@ubuntu:~$ cd ~/download/
lemon@ubuntu:~/download$ wget http://archive.apache.org/dist/lucene/solr/4.10.3/solr-4.10.3.zip
lemon@ubuntu:~/download$ unzip solr-4.10.3.zip
lemon@ubuntu:~/download$ mv solr-4.10.3 ../software/
lemon@ubuntu:~/download$ cd ../software/
lemon@ubuntu:~/software$ cd apache-tomcat-8.0.33/webapps/
lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ mkdir solr
lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ cp ~/software/solr-4.10.3/example/webapps/solr.war ./solr/
lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ jar vxf solr.war
lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ cp -r ~/software/solr-4.10.3/example/solr ../
lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ cp ~/software/apache-nutch-2.2.1/conf/schema.xml ../solr/collection1/conf/
lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ vim solr/WEB-INF/web.xml
取消以下內(nèi)容的注釋啊犬,并修改solrhome的值
<env-entry>
<env-entry-name>solr/home</env-entry-name>
<env-entry-value>/home/lemon/software/apache-tomcat-8.0.33/solr</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ vim ~/software/apache-tomcat-8.0.33/solr/collection1/conf/solrconfig.xml
<!-- Data Directory
Used to specify an alternate directory to hold all index data
other than the default ./data under the Solr home. If
replication is in use, this should match the replication
configuration.
-->
<dataDir>${solr.data.dir:/home/lemon/software/apache-tomcat-8.0.33/solr/collection1/data}</dataDir>
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ cp ~/software/solr-4.10.3/example/lib/ext/* ~/software/apache-tomcat-8.0.33/webapps/solr/WEB-INF/lib/
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ mkdir ~/software/apache-tomcat-8.0.33/webapps/solr/WEB-INF/classes
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ cp ~/software/solr-4.10.3/example/resources/log4j.properties ~/software/apache-tomcat-8.0.33/webapps/solr/WEB-INF/classes
最后,重新啟動tomcat.
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ bin/shutdown.sh
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ bin/startup.sh
五壁查、利用solr為抓取到的數(shù)據(jù)建立索引
lemon@ubuntu:~/software/apache-nutch-2.2.1$ cd ~/software/apache-nutch-2.2.1/runtime/local/
lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local/$bin/nutch crawl -solr http://127.0.0.1:8080/solr/ -reindex
檢索界面:
結語
我在本次搭建也踩了許多坑觉至,本文是避坑后的完整過程,嚴格按照本文操作應該不會出現(xiàn)問題睡腿。由于用于演示康谆,未采用較為復雜的Hbase作為存儲,不過接下來我也將嘗試嫉到。
如果遇到錯誤沃暗,請核對版本是否一致、路徑是否正確何恶、代碼修改是否有誤孽锥。
我將部署過程中所遇到的錯誤做了總結,寫了一篇錯誤集錦细层,將于最近完成惜辑,希望屆時對大家有所幫助。