安裝
元素數(shù)據(jù)存儲選擇
默認(rèn)使用derby數(shù)據(jù)庫,不能夠多個用戶同時使用匙隔,多用于測試
使用MySQL數(shù)據(jù)庫存儲元數(shù)據(jù)疑苫,多用于生產(chǎn)環(huán)境
HDFS數(shù)據(jù)倉庫目錄
創(chuàng)建數(shù)據(jù)倉庫目錄
hadoop fs -mkdir -p /user/hive/warehouse
賦予權(quán)限
hadoop fs -chmod a+w /user/hive/warehouse
hadoop fs -chmod a+w /temp
hive安裝
hadoop 用戶將HIVE安裝包解壓到/home/hadoop/apps安裝目錄
tar -zxvf apache-hive-1.2.2-bin.tar.gz -C /home/hadoop/apps
切換到root用戶
創(chuàng)建軟連接
ln -s /home/hadoop/apps/hive-1.2.2 /usr/local/hive
修改屬主
chown -R hadoop:hadoop /usr/local/hive
添加環(huán)境變量
vim /etc/profile
添加內(nèi)容
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:${HIVE_HOME}/bin
使環(huán)境變量生效
source /etc/profile
切換到hadoop用戶
修改HIVE_HOME/conf/hive-site.xml內(nèi)容,沒有則新建
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.183.101:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive123</value>
<description>password to use against metastore database</description>
</property>
</configuration>
注: 修改對應(yīng)ip和密碼
啟動hive
/usr/local/hive/bin/hive
[hadoop@hadoop4 bin]$ hive
Logging initialized using configuration in jar:file:/home/hadoop/apps/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
hive>
啟動成功
hive -hiveconf hive.root.logger=DEBUG,console
顯示日志方式啟動hive
Hive 操作
- 查看數(shù)據(jù)庫
show databases
hive> show databases;
OK
default
Time taken: 0.02 seconds, Fetched: 1 row(s)
- 創(chuàng)建數(shù)據(jù)庫
create database mytestDB;
create database IF NOT EXISTS mytestDB;
hadoop fs -ls /user/hive/warehouse
會增加mytestdb.db目錄 - 選擇數(shù)據(jù)庫
use mytestdb
- 查看表
show tables;
- 創(chuàng)建表
創(chuàng)建用戶表:user_info
字段信息:用戶id,地域id,年齡缀匕,職業(yè)
create table user_info(
user_id string,
area_id string,
age int,
occupation string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
創(chuàng)建成功后,同時會在HDFS中創(chuàng)建目錄
/user/hive/warehouse/mytestdb.db/user_info
- 刪除表
drop table user_info;
user_info表在hdfs的目錄也會被同時刪除
創(chuàng)建內(nèi)部表
- 在數(shù)據(jù)庫rel 中創(chuàng)建學(xué)生信息表
create table student_info(
student_id string comment '學(xué)號',
name string comment '姓名',
age int comment '年齡',
origin string comment '地域'
)
comment '學(xué)生信息表'
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
- 使用load從本地加載數(shù)據(jù)到student_info
load data local inpath '/home/hadoop/apps/hive_test_data/student_info_data.txt' into table student_info;
查看student_info表在hdfs路徑纳决,新增加了student_info_data.txt文件
Found 1 items
-rwxrwxrwx 3 hadoop supergroup 341 2018-01-26 10:34 /user/hive/warehouse/mydb.db/student_info/student_info_data.txt
- 查詢origin為 11 的學(xué)生
hive> select * from student_info where origin='11'
hive> select * from student_info where origin='11'
> ;
OK
1 xiaoming 20 11
6 zhangsan 20 11
7 lisi 19 11
Time taken: 0.473 seconds, Fetched: 3 row(s)
hive>
- 使用load將hdfs文件加載到student_info表中
上傳文件到HDFS中
hadoop fs -put student_info_data.txt /
追加的方式載入
load data inpath '/student_info_data.txt' into table student_info;
hdfs中student_info表位置會出現(xiàn)兩個 student_info_data.txt
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/mydb.db/student_info
Found 2 items
-rwxrwxrwx 3 hadoop supergroup 341 2018-01-26 10:34 /user/hive/warehouse/mydb.db/student_info/student_info_data.txt
-rwxrwxrwx 3 hadoop supergroup 341 2018-01-26 10:39 /user/hive/warehouse/mydb.db/student_info/student_info_data_copy_1.txt
并且HDFS中 /student_info_data.txt會剪切到student_info表的hdfs路徑下/user/hive/warehouse/rel.db/student_info
以重寫的方式載入
load data inpath '/student_info_data.txt' overwrite into table student_info;
會覆蓋原來的數(shù)據(jù).
數(shù)據(jù)類型
- 創(chuàng)建員工表:employee
字段信息:用戶id碰逸,工資乡小,工作過的城市,社保繳費(fèi)情況(養(yǎng)老,醫(yī)保)饵史,福利(吃飯補(bǔ)助(float),是否轉(zhuǎn)正(boolean),商業(yè)保險(float))
create table employee(
user_id string,
salary int,
worked_citys array<string>,
social_security map<string,float>,
welfare struct<meal_allowance:float,if_regular:boolean,commercial_insurance:float>
)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;
- 從本地加載數(shù)據(jù)到表employee
load data local inpath '/home/hadoop/apps/hive_test_data/employee_data.txt' into table employee;
hive> select * from employee;
OK
zhangsan 10800 ["beijing","shanghai"] {"養(yǎng)老":1000.0,"醫(yī)療":600.0} {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
lisi 20000 ["beijing","nanjing"] {"養(yǎng)老":2000.0,"醫(yī)療":1200.0} {"meal_allowance":2000.0,"if_regular":false,"commercial_insurance":500.0}
wangwu 17000 ["shanghai","nanjing"] {"養(yǎng)老":1800.0,"醫(yī)療":1100.0} {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
- 查詢已轉(zhuǎn)正的員工編號,工資,工作過的第一個城市,社保養(yǎng)老繳費(fèi)情況,福利餐補(bǔ)金額
hive> select user_id,
> salary,
> worked_citys[0],
> social_security['養(yǎng)老'],
> welfare.meal_allowance
> from employee
> where welfare.if_regular=true;
OK
zhangsan 10800 beijing 1000.0 2000.0
wangwu 17000 shanghai 1800.0 2000.0
創(chuàng)建外部表
可以提前創(chuàng)建好HDFS路徑
hadoop mkdir -p /user/hive/warehouse/data/student_school_info
如果沒有提前創(chuàng)建好满钟,在創(chuàng)建外部表的時候會根據(jù)指定路徑自動創(chuàng)建
- 創(chuàng)建外部學(xué)生入學(xué)信息表
字段信息:
學(xué)號、姓名胳喷、學(xué)院id湃番、專業(yè)id、入學(xué)年份
HDFS數(shù)據(jù)路徑:/user/hive/warehouse/data/student_school_info
create external table rel.student_school_info(
student_id string,
name string,
institute_id string,
major_id string,
school_year string
)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info';
- 上傳本地數(shù)據(jù)文件到HDFS
hadoop fs -put /home/hadoop/apps/hive_test_data/student_school_info_external_data.txt /user/hive/warehouse/data/student_school_info/
查詢
select * from student_school_info
創(chuàng)建內(nèi)部分區(qū)表
創(chuàng)建學(xué)生入學(xué)信息表
字段信息:學(xué)號吭露、姓名吠撮、學(xué)院id
分區(qū)字段:專業(yè)id
create table student_school_info_partition_maj(
student_id string,
name string,
institute_id string
)
partitioned by(major_id string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
- 使用insert into從student_school_info表將2017年入學(xué)的學(xué)籍信息導(dǎo)入到student_school_info_partition_maj分區(qū)表中
insert into table student_school_info_partition_maj partition(major_id ='bigdata')
select t1.student_id,t1.name,t1.institute_id
from student_school_info t1
where t1. major_id = bigdata;
查看分區(qū)
show partitions student_school_info_partition_maj;
查看hdfs路徑,會增加major_id ='bigdata'目錄
hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition_maj/
刪除分區(qū)
alter table student_school_info_partition drop partition (major_id ='bigdata');
查看分區(qū)表,數(shù)據(jù)已經(jīng)被刪除
使用動態(tài)分區(qū)添加數(shù)據(jù)
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_partition_maj partition(major_id)
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1 ;
- 查看分區(qū)
show partitions student_school_info_partition_maj;
hive> show partitions student_school_info_partition_maj;
OK
major_id=bigdata
major_id=computer
major_id=software
Time taken: 0.114 seconds, Fetched: 3 row(s)
- 查看hdfs路徑
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition_maj
Found 3 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=bigdata
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=computer
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=software
會增加三個目錄,每個目錄存儲對應(yīng)的數(shù)據(jù)
創(chuàng)建外部分區(qū)表
創(chuàng)建學(xué)生入學(xué)信息表
字段信息:學(xué)號、姓名讲竿、學(xué)院id
分區(qū)字段:專業(yè)id
create external table rel.student_school_info_external_partition_maj(
student_id string,
name string,
institute_id string
)
partitioned by(major_id string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info_external_partition_maj';
- 動態(tài)分區(qū)
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_external_partition_maj partition(major_id)
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1;
- 刪除內(nèi)部分區(qū)表,表刪除,hdfs中的數(shù)據(jù)也刪除了
drop table student_school_info_partition_maj;
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/rel.db/
Found 1 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:23 /user/hive/warehouse/rel.db/student_school_info_partition
- 刪除外部分區(qū)表
hive> drop table student_school_info_external_partition_maj;
OK
Time taken: 0.63 seconds
hive> show tables;
OK
student_school_info
student_school_info_partition
Time taken: 0.027 seconds, Fetched: 2 row(s)
查看hdfs中的文件,數(shù)據(jù)依然存在
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/
Found 2 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:06 /user/hive/warehouse/data/student_school_info
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition_maj
Found 3 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=bigdata
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=computer
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software
Found 1 items
-rwxrwxrwx 3 hadoop supergroup 46 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software/000000_0
使用LIKE泥兰、AS創(chuàng)建表,表重命名,添加题禀、修改鞋诗、刪除列
根據(jù)已存在的表結(jié)構(gòu),使用like關(guān)鍵字迈嘹,復(fù)制一個表結(jié)構(gòu)一模一樣的新表
create table student_info2 like student_info;
根據(jù)已經(jīng)存在的表削彬,使用as關(guān)鍵字,創(chuàng)建一個與查詢結(jié)果字段一致的表秀仲,同時將查詢結(jié)果數(shù)據(jù)插入到新表
create table student_info3 as select * from student_info;
只有student_id,name兩個字段的表
create table student_info4 as select student_id,name from student_info;
student_info4表重命名為student_id_name
alter table student_info4 rename to student_id_name;
給student_info3表添加性別列,新添加的字段會在所有列最后融痛,分區(qū)列之前,在添加新列之前已經(jīng)存在的數(shù)據(jù)文件中
如果沒有新添加列對應(yīng)的數(shù)據(jù)神僵,在查詢的時候顯示為空酌心。添加多個列用逗號隔開
alter table student_info_new3 add columns (gender string comment '性別');
-
刪除列或修改列
修改列,將繼續(xù)存在的列再定義一遍挑豌,需要替換的列重新定義
alter table student_info_new3 replace columns(student_id string,name string,age int,origin string,gender2 int);
刪除列,將繼續(xù)存在的列再定義一遍安券,需要刪除的列不再定義
alter table student_info_new3 replace columns(student_id string,name string,age int,origin string);
創(chuàng)建分桶表
按照指定字段取它的hash散列值分桶
創(chuàng)建學(xué)生入學(xué)信息分桶表
字段信息:學(xué)號、姓名氓英、學(xué)院ID侯勉、專業(yè)ID
分桶字段:學(xué)號,4個桶铝阐,桶內(nèi)按照學(xué)號升序排列
create table rel.student_info_bucket(
student_id string,
name string,
age int,
origin string
)
clustered by (student_id) sorted by (student_id asc) into 4 buckets
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
- 插入數(shù)據(jù)
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
cluster by(student_id);
分桶表一般不使用load向分桶表中導(dǎo)入數(shù)據(jù)址貌,因?yàn)閘oad導(dǎo)入數(shù)據(jù)只是將數(shù)據(jù)復(fù)制到表的數(shù)據(jù)存儲目錄下,hive并不會
在load的時候?qū)?shù)據(jù)進(jìn)行分析然后按照分桶字段分桶,load只會將一個文件全部導(dǎo)入到分桶表中练对,并沒有分桶遍蟋。一般
采用insert從其他表向分桶表插入數(shù)據(jù)。
分桶表在創(chuàng)建表的時候只是定義表的模型螟凭,插入的時候需要做如下操作:
在每次執(zhí)行分桶插入的時候在當(dāng)前執(zhí)行的session會話中要設(shè)置hive.enforce.bucketing = true;聲明本次執(zhí)行的是一次分桶操作虚青。
需要指定reduce個數(shù)與分桶的數(shù)量相同set mapreduce.job.reduces=4,這樣才能保證有多少桶就生成多少個文件螺男。
如果定義了按照分桶字段排序棒厘,需要在從其他表查詢數(shù)據(jù)過程中將數(shù)據(jù)按照分區(qū)字段排序之后插入各個桶中,分桶表并不會將各分桶中的數(shù)據(jù)排序下隧。
排序和分桶的字段相同的時候使用Cluster by(字段),cluster by 默認(rèn)按照分桶字段在桶內(nèi)升序排列奢人,如果需要在桶內(nèi)降序排列,
使用distribute by (col) sort by (col desc)組合實(shí)現(xiàn)淆院。
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
distribute by (student_id) sort by (student_id desc);
導(dǎo)出數(shù)據(jù)
- 使用insert將student_info表數(shù)據(jù)導(dǎo)出到本地指定路徑
insert overwrite local directory '/home/hadoop/apps/hive_test_data/export_data' row format delimited fields terminated by '\t' select * from student_info;
- 導(dǎo)出數(shù)據(jù)到本地的常用方法
hive -e"select * from rel.student_info"> ./student_info_data.txt
默認(rèn)結(jié)果分隔符:'\t'
join關(guān)聯(lián)
- join 或者inner join
兩個表通過id關(guān)聯(lián)何乎,只把id值相等的數(shù)據(jù)查詢出來。join的查詢結(jié)果與inner join的查詢結(jié)果相同土辩。
select * from a join b on a.id=b.id;
等同于
select * from a inner join b on a.id=b.id;
full outer join 或者 full join
兩個表通過id關(guān)聯(lián)支救,把兩個表的數(shù)據(jù)全部查詢出來
select * from a full join b on a.id=b.id;
left join
左連接時,左表中出現(xiàn)的join字段都保留脯燃,右表沒有連接上的都為空
select * from a left join b on a.id=b.id;
rightjoin
右連接時搂妻,右表中出現(xiàn)的join字段都保留,左表沒有連接上的都是空
select * from a right join b on a.id=b.id;
left semi join
左半連接實(shí)現(xiàn)了類似IN/EXISTS的查詢語義辕棚,輸出符合條件的左表內(nèi)容欲主。
hive不支持in …exists這種關(guān)系型數(shù)據(jù)庫中的子查詢結(jié)構(gòu)睬塌,hive暫時不支持右半連接闸盔。
select a.id, a.name from a where a.id in (select b.id from b);
使用Hive對應(yīng)于如下語句:
select a.id,a.name from a left semi join b on a.id = b.id;
- map side join
使用分布式緩存將小表數(shù)據(jù)加載都各個map任務(wù)中腐芍,在map端完成join唆香,map任務(wù)輸出后,不需要將數(shù)據(jù)拷貝到reducer階段再進(jìn)行join矿辽,
降低的數(shù)據(jù)在網(wǎng)絡(luò)節(jié)點(diǎn)之間傳輸?shù)拈_銷芭毙。多表關(guān)聯(lián)數(shù)據(jù)傾斜優(yōu)化的一種手段嘱丢。多表連接挽铁,如果只有一個表比較大伟桅,其他表都很小,
則join操作會轉(zhuǎn)換成一個只包含map的Job叽掘。運(yùn)行日志中會出現(xiàn)Number of reduce tasks is set to 0 since there's no reduce operator
沒有reduce的提示楣铁。
select /*+ mapjoin(b) */ a.id, a.name from a join b on a.id = b.id
hive 內(nèi)置函數(shù)
- case when
語法1:CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
說明:如果a等于b,那么返回c更扁;如果a等于d盖腕,那么返回e赫冬;否則返回f
hive> select case 1 when 2 then 'two' when 1 then 'one' else 'zero' end;
one
語法2:CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
說明:如果a為TRUE,則返回b溃列;如果c為TRUE劲厌,則返回d;否則返回e
hive> select case when 1=2 then 'two' when 1=1 then 'one' else 'zero' end;
one
自定義UDF函數(shù)
當(dāng)Hive提供的內(nèi)置函數(shù)無法滿足你的業(yè)務(wù)處理需要時听隐,此時就可以考慮使用用戶自定義函數(shù)(UDF:user-defined function)补鼻。
UDF 作用于單個數(shù)據(jù)行,產(chǎn)生一個數(shù)據(jù)行作為輸出遵绰。
步驟:
- 先開發(fā)一個java類辽幌,繼承UDF增淹,并重載evaluate方法
- 打成jar包上傳到服務(wù)器
- 在使用的時候?qū)ar包添加到hive的classpath
hive>add jar /home/hadoop/apps/hive_test_data/HiveUdfPro-1.0-SNAPSHOT.jar; - 創(chuàng)建臨時函數(shù)與開發(fā)好的java class關(guān)聯(lián)
hive>create temporary function age_partition as 'cn.hadoop.udf.AgePartitionFunction'; - 即可在hql中使用自定義的函數(shù)
select gender,
age_partition(age),
max(core) max_core
from rel.user_core_info
group by gender,
age_partition(age);
HIVE安裝使用時遇到的問題
- 創(chuàng)建表時失敗
原因: mysql字符集問題,要設(shè)置mysql中hive數(shù)據(jù)庫的字符為latin1
- 刪除表時,卡主
原因: 也是字符問題. 是在創(chuàng)建表時,mysql的字符還是utf-8, 后來用命令改掉為latin1,
需要重新設(shè)置,刪除hive數(shù)據(jù)庫,重新創(chuàng)建,并設(shè)置字符集.