<b>Introduction</b>
This document describes how to get Nutch 2.X to use HBase as a storage backend(后臺) for Gora. It is assumed(假定) that you have a working knowledge of configuring Nutch 1.X, as currently configuration in 2.X is more complex. It is important to take this in to consideration before progressing any further. We therefore strongly advise that you check out the Nutch 1.X tutorial.
這個文檔描述了怎么獲取nutch2.x使用hbase作為存儲后臺為gora庄萎。如果你已經(jīng)有了使用nutch1.x的經(jīng)驗(yàn)踪少,作為現(xiàn)在配置在2.x是更加復(fù)雜了。它很重要做下一步考慮糠涛。所以我門建議你查看Nutch1.x的教程援奢。
<b>Obtaining(獲得) Software and Configuration</b>
- Grab the latest distribution of Nutch 2.X from here. Do NOT build the source yet. From now on we will refer to the directory where the Nutch code resides as $NUTCH_HOME.
下載nutch2.x的最新版本。先不要build源碼忍捡,現(xiàn)在我們先配置Nutch——home環(huán)境變量
- Download and configure HBase 0.98.8-hadoop2. You can get it here (N.B. Each version of Gora is tied to a particular version of HBase, we therefore suggest you use this version if possible. If you decide to use another version of HBase please do not be surprised if the stack does not work. You should also obtain current documentation for HBase however please again take into consideration that the version of HBase we recommend you use may not correlate to the current documentation. Please keep this in mind and use your initiative.
下載配置hbase0.98.8-hadoop2集漾。你可以在這里獲得。每個gora都綁定在一個特別的hbase版本砸脊,但是我們建議你使用這個版本具篇。如果你決定使用另外一個hbase版本,請不要驚奇如果stack不工作凌埂。你需要獲取當(dāng)前的hbase文檔驱显,但是請注意這個版本的hbase我建議你使用的可能不是當(dāng)前的文檔相關(guān)的。請時刻主動記住這個瞳抓。
- Specify(指定) the GORA backend in $NUTCH_HOME/conf/nutch-site.xml along with all of the other Configuration options suggested within the Nutch 1.x tutorial.
指定gora后臺在$nutch_home/conf/nutch-site.xml隨著和我們在1.x中建議的所有的其他配置選項(xiàng)埃疫。
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
- Ensure the HBase gora-hbase dependency is available in $NUTCH_HOME/ivy/ivy.xml
確保hbase gora-hbase 依賴可用在ivy文件中
<dependency org="org.apache.gora" name="gora-hbase" rev="0.6.1" conf="*->default" />
In addition add the missing hbase-common-0.98.8-hadoop2.jar transitive (傳遞)dependency, this is a bug in gora-hbase 0.6.1 as described here. This bug is removed in current Gora development.
另外添加缺失的hbase-common-0.98.8-hadoop2.jar 傳遞依賴,這是一個bug在gora-hbase0.6.1 并且在這里有描述孩哑。這個bug被移動到gora 開發(fā)栓霜。
<dependency org="org.apache.hbase"
name="hbase-common" rev="0.98.8-hadoop2" conf="*->default" />
Ensure that HBaseStore is set as the default datastore in $NUTCH_HOME/conf/gora.properties. Other documentation for HBaseStore can be found here
確保hbasestore被設(shè)置作為默認(rèn)的數(shù)據(jù)存儲在nutch的gora.properties。別的文檔hbasestore的可以在這看横蜒。
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
- N.B. It's probably worth checking and setting all your usual configuration settings within $NUTCH_HOME/conf/nutch-site.xml etc. before progressing.
NB.這是可能的值得查看設(shè)置所有的常用的配置設(shè)置在nutch-site.xml etc. 在進(jìn)行前胳蛮。
Compile Nutch -> via
ant runtime
Make sure HBase is started and working properly as per the quick start tutorial.
確保hbase 被開啟并且工作適合的作為
- Create a list of URLs as you would do within the Nutch 1.X tutorial.
創(chuàng)建一個列表urls,作為在使用
<b>Invoke Nutch</b>
You should then be able to inject URLs into HBase. Try going to $NUTCH_HOME/runtime/local/bin and do :
你需要可以注入urls 到hbase。試著去bin下的
nutch inject /someseedDir
nutch readdb
Whats Next
You may want to check out the documentation for the Nutch Web Application and then the Nutch REST API as this gives a comprehensive overview of ongoing work with making Nutch 2.X easier to use.