環(huán)境配置
hive-2.3.2
hbase-1.4.2
phoenix-4.14.0-HBase-1.4
spark-2.3.0-bin-hadoop2.7
想通過(guò)phoenix來(lái)統(tǒng)一hbase表的操作冠王,所以首先在phoenix的shell上創(chuàng)建測(cè)試表并添加一些數(shù)據(jù)
CREATE TABLE TEST( ID VARCHAR not null, TEXT VARCHAR CONSTRAINT PK PRIMARY KEY (ID) ) ;
upsert into TEST values('1001','test-text1');
upsert into TEST values('1002','test-text2');
查詢(xún)顯示數(shù)據(jù)如下:
0: jdbc:phoenix:master> select * from TEST;
+-------+-------------+
| ID | TEXT |
+-------+-------------+
| 1001 | test-text1 |
| 1002 | test-text2 |
+-------+-------------+
2 rows selected (0.021 seconds)
此時(shí)hbase表里的數(shù)據(jù)是存在的阳惹,因?yàn)橛行枨笤趆ive中來(lái)操作hbase表中的數(shù)據(jù),所以直接想法是在hive中創(chuàng)建一張外部表來(lái)映射到hbase對(duì)應(yīng)的表上赞辩,建表語(yǔ)句如下
CREATE EXTERNAL TABLE TEST(ID string, TEXT string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,0:TEXT") TBLPROPERTIES ("hbase.table.name" = "TEST");
創(chuàng)建后朴沿,執(zhí)行查詢(xún)語(yǔ)句但未查詢(xún)到數(shù)據(jù)
hive> show create table test;
OK
CREATE EXTERNAL TABLE `test`(
`id` int COMMENT '',
`text` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping'=':key,0:TEXT',
'serialization.format'='1')
TBLPROPERTIES (
'hbase.table.name'='TEST',
'transient_lastDdlTime'='1535441141')
Time taken: 0.178 seconds, Fetched: 13 row(s)
hive> select * from test;
OK
Time taken: 1.643 seconds
hive> exit;
在hbase shell中查看存儲(chǔ)的數(shù)據(jù)如下
hbase(main):001:0> scan 'TEST'
ROW COLUMN+CELL
1001 column=0:\x00\x00\x00\x00, timestamp=1535528809205, value=x
1001 column=0:\x80\x0B, timestamp=1535528809205, value=test-text1
1002 column=0:\x00\x00\x00\x00, timestamp=1535528816321, value=x
1002 column=0:\x80\x0B, timestamp=1535528816321, value=test-text2
2 row(s) in 0.2990 seconds
可以看到表項(xiàng)的列名(qualifier)都為字節(jié)數(shù)組猜谚,這應(yīng)該是phoenix進(jìn)行的轉(zhuǎn)化,于是手動(dòng)在hbase shell中put一條記錄
hbase(main):001:0> put 'TEST','1003','0:TEXT','test-text3'
0 row(s) in 0.2900 seconds
hbase(main):002:0> scan 'TEST'
ROW COLUMN+CELL
1001 column=0:\x00\x00\x00\x00, timestamp=1535528809205, value=x
1001 column=0:\x80\x0B, timestamp=1535528809205, value=test-text1
1002 column=0:\x00\x00\x00\x00, timestamp=1535528816321, value=x
1002 column=0:\x80\x0B, timestamp=1535528816321, value=test-text2
1003 column=0:TEXT, timestamp=1535529809368, value=test-text3
3 row(s) in 0.0420 seconds
在hive的shell下查詢(xún)記錄為:
hive> select * from test;
OK
1003 test-text3
Time taken: 1.764 seconds, Fetched: 1 row(s)
所以應(yīng)該是需要hive讀取phoenix處理過(guò)的表的相關(guān)的lib
參考https://phoenix.apache.org/hive_storage_handler.html中的描述
在hive-env.sh增加一個(gè)環(huán)境變量,如下
export HIVE_AUX_JARS_PATH=/opt/hive-aux
/opt/hive-aux中存放的是phoenix相關(guān)的lib,選擇兩個(gè)使用到的jar包放入
[root@a9ae631a3a34 hive-aux]# ll
total 183776
-rw-r--r-- 1 root root 107956190 Aug 29 05:51 phoenix-4.14.0-HBase-1.4-client.jar
-rw-r--r-- 1 root root 80217117 Aug 29 04:28 phoenix-4.14.0-HBase-1.4-hive.jar
在hive shell中重新創(chuàng)建外部表,測(cè)試語(yǔ)句如下:
> drop table test;
OK
Time taken: 0.193 seconds
hive> create external table TEST (
> id string,
> text string
> )
> STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
> TBLPROPERTIES (
> "phoenix.table.name" = "TEST",
> "phoenix.zookeeper.quorum" = "master",
> "phoenix.zookeeper.znode.parent" = "/hbase",
> "phoenix.zookeeper.client.port" = "2181",
> "phoenix.rowkeys" = "id",
> "phoenix.column.mapping" = "id:ID,text:TEXT"
> );
OK
Time taken: 2.763 seconds
hive> select * from test;
OK
1001 test-text1
1002 test-text2
Time taken: 1.486 seconds, Fetched: 2 row(s)
這個(gè)數(shù)據(jù)是和phoenix接口中的數(shù)據(jù)是一致的赌渣,但是后面收到在hbase中put的一條數(shù)據(jù)并沒(méi)查詢(xún)到魏铅,這應(yīng)該也驗(yàn)證了不同的接口向hbase中寫(xiě)入的數(shù)據(jù)是有差異的,目前看到的差異是
1.phoenix api中寫(xiě)入的列名轉(zhuǎn)換為byte array
2.hbase shell中put寫(xiě)入的列名是按照string處理的
3.hive處理hbase的列名是按照string處理的
下面我想通過(guò)spark sql來(lái)操作剛才在hbase中創(chuàng)建的TEST表項(xiàng),即環(huán)境描述為
hbase>phoenix>hive>spark-sql這樣的關(guān)系
使用如下命令啟動(dòng)spark-sql shell
/opt/spark-2.3.0-bin-hadoop2.7/bin/spark-sql --master=spark://master:7077 --total-executor-cores=2 --executor-memory=512m --jars=/opt/hive-aux/*
查詢(xún)結(jié)果如下
> select * from testdb.test;
18/08/29 08:30:31 WARN RpcControllerFactory: Cannot load configured "hbase.rpc.controllerfactory.class" (org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory) from hbase-site.xml, falling back to use default RpcControllerFactory
1001 test-text1
1002 test-text2
Time taken: 11.923 seconds, Fetched 2 row(s)
這個(gè)結(jié)果是與hive shell中的查詢(xún)結(jié)果是一致的
注意:
不要把phoenix-4.14.0-HBase-1.4-client.jar,phoenix-4.14.0-HBase-1.4-hive.jar這兩個(gè)lib直接拷貝到$SPARK_HOME/jars下面,否則會(huì)有一些lib沖突導(dǎo)致的java.lang.NoSuchMethodError異常
補(bǔ)充:
如果hbase表使用多個(gè)字段作為主鍵的表锡垄,映射關(guān)系可參考如下建表語(yǔ)句
#phoenix中建表語(yǔ)句
CREATE TABLE TEST_COMBPK(
ID1 VARCHAR NOT NULL,
ID2 VARCHAR NOT NULL,
NUM DOUBLE,
TEXT VARCHAR
CONSTRAINT PK PRIMARY KEY (ID1, ID2)
);
#hive中建表語(yǔ)句
create external table TEST_COMBPK (
id1 string,
id2 string,
num double,
text string
)
STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
TBLPROPERTIES (
"phoenix.table.name" = "TEST_COMBPK",
"phoenix.zookeeper.quorum" = "master",
"phoenix.zookeeper.znode.parent" = "/hbase",
"phoenix.zookeeper.client.port" = "2181",
"phoenix.rowkeys" = "id1,id2",
"phoenix.column.mapping" = "id1:ID1,id2:ID2,num:NUM,text:TEXT"
);
#不使用phoenix的外部表映射,這種方式并不能讀取從phoenix api寫(xiě)入的數(shù)據(jù)
CREATE EXTERNAL TABLE TEST_COMBPK(ID STRUCT<ID1:STRING, ID2:STRING>, NUM DOUBLE, TEXT STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,0:NUM,0:TEXT"")
TBLPROPERTIES ("hbase.table.name" = "TEST_COMBPK");