快樂(lè)大數(shù)據(jù)第5次課 hive(1)工作原理Hive的執(zhí)行入口是Driver独榴,執(zhí)行的SQL語(yǔ)句首先提交到Drive驅(qū)動(dòng)系宜,然后調(diào)用compiler解釋驅(qū)動(dòng),最終解釋成MapReduce任務(wù)去執(zhí)行。無(wú)法實(shí)現(xiàn)實(shí)時(shí)更新像屋,只能向現(xiàn)有的表中追加數(shù)據(jù)资铡。(2)數(shù)據(jù)模型分區(qū)減少不必要的全表數(shù)據(jù)掃描电禀。 對(duì)表使用分區(qū),將表數(shù)據(jù)按照某個(gè)或某些字段劃分笤休。分區(qū)在HDFS的表現(xiàn)為表路徑下的不同文件目錄尖飞。 為了避免使用分區(qū)產(chǎn)生過(guò)多的小文件,建議只對(duì)離散的字段進(jìn)行分區(qū)店雅。如日期政基,地域,類(lèi)型闹啦。 每一個(gè)表或者分區(qū)腋么,Hive可以進(jìn)一步組織成桶,桶是更為細(xì)顆粒度的數(shù)據(jù)范圍劃分hashcode(col_value) %nums_bucket 使用分桶的原因 獲取更高的查詢處理效率亥揖,使得取樣更有效(3)常用文件格式 默認(rèn)的數(shù)據(jù)格式有TEXTDFILE珊擂。SequenceFile圣勒。列式存儲(chǔ)格式。(4)常規(guī)數(shù)據(jù)類(lèi)型:整數(shù)摧扇,浮點(diǎn)圣贸,布爾,字符串扛稽,時(shí)間戳 (5)復(fù)雜數(shù)據(jù)類(lèi)型:Array吁峻,Map,Struct實(shí)戰(zhàn)mysql 5.7.19的安裝1 檢查是否有已經(jīng)存在的mysql并且刪除rpm -qa | grep mysqlyum -y remove mysql-libs-5.1.66-2.el6_3.x86_64find / -name "*mysql*" | xargs rm -rf2.賦予mysql和mysql用戶組groupadd mysqluseradd mysql -g mysqlcd /usr/local3.上傳mysql-5.7.19在张,并且修改屬主和權(quán)限r(nóng)z tar -zxvf mysql-5.7.19-linux-glibc2.12-x86_64.tar.gzrm mysql-5.7.19-linux-glibc2.12-x86_64.tar.gzmv mysql-5.7.19-linux-glibc2.12-x86_64 mysql#建立data文件夾用含,mysql的啟動(dòng)需要datadir 否組mysql無(wú)法啟動(dòng)mkdir /usr/local/mysql/datacd /usr/local/chown -R mysql:mysql mysql/chmod -R 755 mysql/mkdir /var/lib/mysql/#建立sock通訊文件,并賦予mysql的權(quán)限在/var/lib/下 cd /var/lib/chown mysql:mysql mysql/chmod 777 mysql/cd /var/lib/mysql/touch mysql.sockchown mysql:mysql mysql.sockchmod 777 mysql.sock4.編輯配置文件帮匾,以便mysql的啟動(dòng) cd /etc touch my.cnf vim /etc/my.cnf配置文件如下[mysql]#設(shè)置mysql客戶端默認(rèn)字符集default-character-set=utf8socket=/var/lib/mysql/mysql.sock[mysqld]skip-name-resolve#設(shè)置3306端口port=3306socket=/var/lib/mysql/mysql.sock#設(shè)置mysql的安裝目錄basedir=/usr/local/mysqldatadir=/usr/local/mysql/data#最大了連接數(shù)max_connections=200#服務(wù)端使用的字符集啄骇,默認(rèn)為8比特編碼的latin1字符集character-set-server=utf8#創(chuàng)建新表時(shí)將使用的默認(rèn)存儲(chǔ)引擎default-storage-engine=INNODBlower_case_table_names=1max_allowed_packet=16M5.mysql的啟動(dòng)/usr/local/mysql/bin/mysqld --initialize --user=mysql --datadir=/usr/local/mysql/data --basedir=/usr/local/mysql記住最后一行的臨時(shí)密碼/usr/local/mysql/support-files/mysql.server start#添加軟連接,方便啟動(dòng)ln -s /usr/local/mysql/support-files/mysql.server /etc/init.d/mysql#把mysql的操作放入到系統(tǒng)路徑echo "export PATH=$PATH:/usr/local/mysql/bin" >> /etc/profilesource /etc/profileservice mysql restart#把mysql客戶端放到啟動(dòng)目錄ln -s /usr/local/mysql/bin/mysql /usr/binmysql -u root -p 輸入剛才的臨時(shí)密碼6.mysql的設(shè)置msql>alter user 'root'@'localhost' identified by '123456'; mysql>use mysql; msyql>update user set user.Host='%' where user.User='root'; GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'%' IDENTIFIED BY 'hadoop' WITH GRANT OPTION;(hadoop用戶再訪問(wèn)時(shí)瘟斜,將以用戶名是hadoop缸夹,密碼是hadoop的形式訪問(wèn)。輸入命令mysql -uhadoop -phadoop) mysql> create user 'hive'@'%' identified by 'hive123'; mysql> GRANT ALL PRIVILEGES ON *.* TO 'hive'@'%' IDENTIFIED BY 'hive123' WITH GRANT OPTION; #和后面的hive.xml的連接信息需要對(duì)應(yīng)螺句。 mysql>flush privileges; mysql>show databases; mysql>quit允許3306端口 #iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 3306 -j ACCEPT #iptables -L -n #service iptables save(當(dāng)虛擬機(jī)重新啟動(dòng)后 需要啟動(dòng)mysql 則 先用命令(在root用戶下):service mysql start 再用命令:mysql -u root -p123456 或者mysql -uhadoop -phadoop) Hive的數(shù)據(jù)倉(cāng)庫(kù)配置創(chuàng)建數(shù)據(jù)倉(cāng)庫(kù)虽惭,用戶具有讀和寫(xiě)的權(quán)限在node02上hadoop fs -mkdir -p /user/hive/warehousehadoop fs -chmod a+w /user/hive/warehousehadoop fs -mkdir -p /temphadoop fs -chmod a+w /tempcd ~/appsrztar -zxvf apache-hive-1.2.2-bin.tar.gz hive-1.2.2su rootln -s /home/hadoop/apps/hive-1.2.2 /usr/local/hivechown -R hadoop:hadoop /usr/local/hive添加環(huán)境變量(已經(jīng)添加)cd /home/hadoop/apps/hive-1.2.2/confrz hive.xml(注意自己電腦的mysql的連接地址)cd /usr/local/hive/bin/hiverz 上傳mysql的jar包(先在root用戶下 啟動(dòng)service mysql start)su hadoop 啟動(dòng)hivecd /usr/local/hive/binhive#內(nèi)部表的創(chuàng)建和使用hive>show databases; use default; show tables; create table user_info(user_id string,area_id string,age int,occupation string)row format delimited fields terminated by '\t' lines terminated by '\n'stored as textfile; show tables; desc user_info; create database rel; use rel; show tables; create table student_info(student_id string comment '學(xué)號(hào)',name string comment '姓名',age int comment '年齡',origin string comment '地域')comment '學(xué)生信息表'row format delimited fields terminated by '\t' lines terminated by '\n'stored as textfile; #等于創(chuàng)建了一個(gè)表頭新開(kāi)一個(gè)窗口在Xshelll中新開(kāi)一個(gè)node02的窗口 hadoop fs -ls /user/hive/warehouse/ hadoop fs -ls /user/hive/warehouse/rel.dbmkdir /home/hadoop/apps/hive_test_data將課程的文件都移動(dòng)到 hive_test_data的下面蛇尚。cat student_info_data.txthadoop fs -ls /user/hive/warehouse/在hive下執(zhí)行l(wèi)oad data local inpath '/home/hadoop/apps/hive_test_data/student_info_data.txt' into table student_info; #上傳表的內(nèi)容回看一下hdfs下的路徑:切換到node02的新開(kāi)窗口 hadoop fs -ls /user/hive/warehouse/rel.db/student_info芽唇;再在hive下執(zhí)行select * from student_info;先傳到根目錄下,然后再執(zhí)行(如果沒(méi)有特殊說(shuō)明取劫,即在新開(kāi)窗口的命令行下)在/home/hadoop/apps/hive_test_data下 上傳到根目錄下 hadoop fs -put student_info_data.txt /再回到hive下#使用load將hdfs文件加載到student_info表中l(wèi)oad data inpath '/student_info_data.txt' into table student_info;回到窗口下hadoop fs -ls /user/hive/warehouse/rel.db/student_info披摄;(會(huì)看到拷貝了一份里面的數(shù)據(jù)文本,名字是student_info_data_copy_1.txt)回到hive下load data inpath '/student_info_data.txt' overwrite into table student_info;(先刪除調(diào)以前重名的勇凭,再拷貝新的相同名字的數(shù)據(jù)文件)再回到窗口下hadoop fs -ls /user/hive/warehouse/rel.db/student_info;回到hive下create table rel.employee(user_id string,salary int,worked_citys array,social_security map,welfare struct)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;
load data local inpath '/home/hadoop/apps/hive_test_data/employee_data.txt' into table employee;
select * from employee;
#在hadoop下查看
hadoop fs -ls /user/hive/warehouse/rel.db/employee
#外部表的創(chuàng)建和使用
create external table rel.student_school_info(
student_id string,
name string,
institute_id string,
major_id string,
school_year string
)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info';
上傳本地?cái)?shù)據(jù)文件到hdfs
hadoop fs -put /home/hadoop/apps/hive_test_data/student_school_info_external_data.txt /user/hive/warehouse/data/student_school_info/
注意:不必提前用hadoop fs -mkdir創(chuàng)建路徑义辕,如果沒(méi)有提前創(chuàng)建好虾标,在創(chuàng)建外部表的時(shí)候會(huì)根據(jù)指定路徑自動(dòng)創(chuàng)建
#創(chuàng)建內(nèi)部分區(qū)表(靜態(tài)分區(qū))
create table student_school_info_partition(
student_id string,
name string,
institute_id string,
major_id string
)
partitioned by(school_year string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
? ? 使用insert into從student_school_info表將2017年入學(xué)的學(xué)籍信息導(dǎo)入到student_school_info_partition分區(qū)表中
insert into table student_school_info_partition partition(school_year='2017')
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1
where t1.school_year=2017;
? 查看分區(qū)
show partitions student_school_info_partition;
select * from student_school_info_partition;
? 查看分區(qū)
show partitions student_school_info_partition;
在hadoop窗口下,查看載入的數(shù)據(jù)
hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition/? 會(huì)增加school_year='2017'目錄
刪除分區(qū)
alter table student_school_info_partition drop partition (school_year='2017');
再查看下分區(qū)
show partitions student_school_info_partition;
再在hdf上對(duì)應(yīng)的數(shù)據(jù)
回到hadf上
hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition/
會(huì)發(fā)現(xiàn)內(nèi)部的分區(qū)目錄和數(shù)據(jù)也沒(méi)有了
#創(chuàng)建內(nèi)部分區(qū)表(動(dòng)態(tài)分區(qū))
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_partition partition(school_year)
select t1.student_id,t1.name,t1.institute_id,t1.major_id,t1.school_year
from student_school_info t1
show partitions student_school_info_partition;
select * from student_school_info_partition where school_year="2017";
#創(chuàng)建外部分區(qū)表(先創(chuàng)建外部表灌砖,然后再創(chuàng)建分區(qū))
create external table rel.student_school_info_external_partition(
student_id string,
name string,
institute_id string,
major_id string
)
partitioned by(school_year string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info_external_partition';
手動(dòng)在hdfs下:創(chuàng)建分區(qū)表的字段 hadoop fs -mkdir /user/hive/warehouse/data/student_school_info_external_partition/school_year=2017
接著在hive下需要手動(dòng)添加表結(jié)構(gòu)才可以在hive下識(shí)別(直接不能被識(shí)別)璧函;show partitions student_school_info_external_partition;
alter table student_school_info_external_partition add partition(school_year='2017');
show partitions student_school_info_external_partition; 就出現(xiàn)分區(qū)提示了。
現(xiàn)在添加數(shù)據(jù)
回到hdf下
cd /home/hadoop/apps/hive_test_data/
hadoop fs -put student_school_external_partition_data.txt /user/hive/warehouse/data/student_school_info_external_partition/school_year=2017
再回到hive下
select * from student_school_info_external_partition? where school_year="2017";
刪除外部分區(qū)
alter table student_school_info_external_partition drop partition(school_year='2017');
再驗(yàn)證下
show partitions student_school_info_external_partition;
但是在hdfs下 hdoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition/ 會(huì)發(fā)現(xiàn)目錄存在基显,蘸吓、數(shù)據(jù)也存在是(00000)的形式
#***************使用LIKE、AS創(chuàng)建表,表重命名撩幽,添加库继、修改箩艺、刪除列*************
1. 根據(jù)已存在的表結(jié)構(gòu),使用like關(guān)鍵字宪萄,復(fù)制一個(gè)表結(jié)構(gòu)一模一樣的新表
create table student_info2 like student_info;
2. 根據(jù)已經(jīng)存在的表艺谆,使用as關(guān)鍵字,創(chuàng)建一個(gè)與查詢結(jié)果字段一致的表拜英,同時(shí)將查詢結(jié)果數(shù)據(jù)插入到新表
create table student_info3 as select * from student_info;
只有student_id,name兩個(gè)字段的表
create table student_info4 as select student_id,name from student_info;
#分桶數(shù)據(jù)表
***************創(chuàng)建分桶表*************
1. 按照指定字段取它的hash散列值分桶
創(chuàng)建學(xué)生入學(xué)信息分桶表
字段信息:學(xué)號(hào)静汤、姓名、學(xué)院ID居凶、專業(yè)ID
分桶字段:學(xué)號(hào)虫给,4個(gè)桶,桶內(nèi)按照學(xué)號(hào)升序排列
create table rel.student_info_bucket(
student_id string,
name string,
age int,
origin string
)
clustered by (student_id) sorted by (student_id asc) into 4 buckets
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
2. 向student_info_bucket分桶表插入數(shù)據(jù)
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
cluster by(student_id);
查看hdfs分桶文件
hadoop fs -ls /user/hive/warehouse/rel.db/student_info_bucket
分桶表一般不使用load向分桶表中導(dǎo)入數(shù)據(jù)侠碧,因?yàn)閘oad導(dǎo)入數(shù)據(jù)只是將數(shù)據(jù)復(fù)制到表的數(shù)據(jù)存儲(chǔ)目錄下抹估,hive并不會(huì)
在load的時(shí)候?qū)?shù)據(jù)進(jìn)行分析然后按照分桶字段分桶,load只會(huì)將一個(gè)文件全部導(dǎo)入到分桶表中舆床,并沒(méi)有分桶棋蚌。一般
采用insert從其他表向分桶表插入數(shù)據(jù)。
分桶表在創(chuàng)建表的時(shí)候只是定義表的模型挨队,插入的時(shí)候需要做如下操作:
在每次執(zhí)行分桶插入的時(shí)候在當(dāng)前執(zhí)行的session會(huì)話中要設(shè)置hive.enforce.bucketing = true;聲明本次執(zhí)行的是一次分桶操作谷暮。
需要指定reduce個(gè)數(shù)與分桶的數(shù)量相同set mapreduce.job.reduces=4,這樣才能保證有多少桶就生成多少個(gè)文件盛垦。
如果定義了按照分桶字段排序湿弦,需要在從其他表查詢數(shù)據(jù)過(guò)程中將數(shù)據(jù)按照分區(qū)字段排序之后插入各個(gè)桶中,分桶表并不會(huì)將各分桶中的數(shù)據(jù)排序腾夯。
排序和分桶的字段相同的時(shí)候使用Cluster by(字段),cluster by 默認(rèn)按照分桶字段在桶內(nèi)升序排列颊埃,如果需要在桶內(nèi)降序排列,
使用distribute by (col) sort by (col desc)組合實(shí)現(xiàn)蝶俱。
例如:
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
distribute by (student_id) sort by (student_id desc);
#導(dǎo)出數(shù)據(jù)
使用insert將student_info表數(shù)據(jù)導(dǎo)出到本地指定路徑
insert overwrite local directory '/home/hadoop/apps/hive_test_data/export_data'
row format delimited? fields terminated by '\t' select * from student_info;
導(dǎo)出數(shù)據(jù)到本地的第二種方法
在另一個(gè)窗口下 hive -e"select * from rel.student_info"> ./student_info_data.txt
ll
#join關(guān)聯(lián)
ll
create table rel.a(
id int,
name string
)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
create table rel.b(
id int,
name string
)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
load data local inpath '/home/hadoop/apps/hive_test_data/a_join_data' into table a;
load data local inpath '/home/hadoop/apps/hive_test_data/b_join_data' into table b;
****join或inner join
兩個(gè)表通過(guò)id關(guān)聯(lián)班利,只把id值相等的數(shù)據(jù)查詢出來(lái)。join的查詢結(jié)果與inner join的查詢結(jié)果相同榨呆。
select * from a join b on a.id=b.id;
等同于
select * from a inner join b on a.id=b.id;
****full outer join或full join
兩個(gè)表通過(guò)id關(guān)聯(lián)罗标,把兩個(gè)表的數(shù)據(jù)全部查詢出來(lái)
select * from a full outer join b on a.id=b.id;
(執(zhí)行時(shí)報(bào)錯(cuò),在另一個(gè)窗口中執(zhí)行 直接加大reduce數(shù)到2000积蜻,成功
set mapred.reduce.tasks = 2000;
set mapreduce.reduce.memory.mb=16384;
set mapreduce.reduce.java.opts=-Xmx16384m;)
****left join
左連接時(shí)闯割,左表中出現(xiàn)的join字段都保留,右表沒(méi)有連接上的都為空
select * from a left join b on a.id=b.id;
****right join
右連接時(shí)竿拆,右表中出現(xiàn)的join字段都保留宙拉,左表沒(méi)有連接上的都是空
select * from a right join b on a.id=b.id;
****left semi join
左半連接實(shí)現(xiàn)了類(lèi)似IN/EXISTS的查詢語(yǔ)義,輸出符合條件的左表內(nèi)容丙笋。
hive不支持in …exists這種關(guān)系型數(shù)據(jù)庫(kù)中的子查詢結(jié)構(gòu)谢澈,hive暫時(shí)不支持右半連接煌贴。
例如:
select a.id, a.name from a where a.id in (select b.id from b);
使用Hive對(duì)應(yīng)于如下語(yǔ)句:
select a.id,a.name from a left semi join b on a.id = b.id;
****map side join
使用分布式緩存將小表數(shù)據(jù)加載都各個(gè)map任務(wù)中,在map端完成join澳化,map任務(wù)輸出后崔步,不需要將數(shù)據(jù)拷貝到reducer階段再進(jìn)行join,
降低的數(shù)據(jù)在網(wǎng)絡(luò)節(jié)點(diǎn)之間傳輸?shù)拈_(kāi)銷(xiāo)缎谷。多表關(guān)聯(lián)數(shù)據(jù)傾斜優(yōu)化的一種手段井濒。多表連接,如果只有一個(gè)表比較大列林,其他表都很小瑞你,
則join操作會(huì)轉(zhuǎn)換成一個(gè)只包含map的Job。運(yùn)行日志中會(huì)出現(xiàn)Number of reduce tasks is set to 0 since there's no reduce operator
沒(méi)有reduce的提示希痴。
例如:
select /*+ mapjoin(b) */ a.id, a.name from a join b on a.id = b.id;
#內(nèi)置函數(shù)
創(chuàng)建用戶評(píng)分表
create table rel.user_core_info(
user_id string,
age int,
gender string,
core int
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
load data local inpath '/home/hadoop/apps/hive_test_data/user_core.txt' into table rel.user_core_info;
1. 條件函數(shù) case when
語(yǔ)法1:CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
說(shuō)明:如果a等于b者甲,那么返回c;如果a等于d砌创,那么返回e虏缸;否則返回f
例如:
hive> select case 1 when 2 then 'two' when 1 then 'one' else 'zero' end;
one
語(yǔ)法2:CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
說(shuō)明:如果a為T(mén)RUE,則返回b嫩实;如果c為T(mén)RUE刽辙,則返回d;否則返回e
例如:
hive> select case when 1=2 then 'two' when 1=1 then 'one' else 'zero' end;
one
查詢用戶評(píng)分表甲献,每個(gè)年齡段的最大評(píng)分值
select gender,
case when age<=20 then 'p0' when age>20 and age<=50 then 'p1' when age>=50 then 'p3' else 'p0' end,
max(core) max_core
from rel.user_core_info
group by gender,
case when age<=20 then 'p0' when age>20 and age<=50 then 'p1' when age>=50 then 'p3' else 'p0' end;
#自定義UDF函數(shù)
當(dāng)Hive提供的內(nèi)置函數(shù)無(wú)法滿足你的業(yè)務(wù)處理需要時(shí)宰缤,此時(shí)就可以考慮使用用戶自定義函數(shù)(UDF:user-defined function)。
UDF 作用于單個(gè)數(shù)據(jù)行,產(chǎn)生一個(gè)數(shù)據(jù)行作為輸出。
步驟:
1. 先開(kāi)發(fā)一個(gè)java類(lèi)岖免,繼承UDF,并重載evaluate方法
2. 打成jar包上傳到服務(wù)器
3. 在使用的時(shí)候?qū)ar包添加到hive的classpath
hive>add jar /home/hadoop/apps/hive_test_data/HiveUdfPro-1.0-SNAPSHOT.jar;
4. 創(chuàng)建臨時(shí)函數(shù)與開(kāi)發(fā)好的java class關(guān)聯(lián)
hive>create temporary function age_partition as 'cn.chinahadoop.udf.AgePartitionFunction';
5. 即可在hql中使用自定義的函數(shù)
select gender,
age_partition(age),
max(core) max_core
from rel.user_core_info
group by gender,
age_partition(age);