大表關(guān)聯(lián)小表偶翅,把小表自動(dòng)加載到內(nèi)存中默勾,需要確認(rèn)以下配置為true,相當(dāng)于寫了一個(gè)mapjoin
set hive.auto.convert.join = true;
hive.mapjoin.smalltable.filesize 默認(rèn)值是25mb
實(shí)例:
insert overwrite table ${dm_bas}.day_user_play_update
select
a.datess,
a.device_id,
COALESCE(c.play_type_id,'01') as play_type_id,
from
${dm_bas}.day_user_play a
LEFT JOIN
${dim}.cms_play_code c ON a.url_first=c.play_code;
查看url占比前十的數(shù)據(jù)
select url_first,count(1) as num from dm_bas.day_user_play group by url_first order by num desc limit 10;
http://hztmedia.jxa.bcs.ottcn.com 301002483
http://ltcucdn.hvs.fj.chinamobile.com 12086326
http://otttv.bj.chinamobile.com 9189452
http://117.169.120.98 7309174
http://127.0.0.1:8090 6819233
http://gslbserv.itv.cmvideo.cn 5980901
http://218.207.213.107:80 5509762
http://111.40.205.11 4412248
http://39.134.115.221 3914061
http://120.210.193.151:8006 3636822
基表總數(shù)據(jù)量為489268186聚谁,第一個(gè)url占了61.5%
select count(1) as num from dm_bas.day_user_play;
當(dāng)day_user_play和cms_play_code做關(guān)聯(lián)的時(shí)候排序第一個(gè)url最終會(huì)分發(fā)到一個(gè)reduce中去母剥,所有的任務(wù)要等待這個(gè)reduce完成才會(huì)繼續(xù),把小的表加入內(nèi)存形导,可以配置這個(gè)參數(shù)环疼,是hive自動(dòng)根據(jù)sql,選擇使用common join或者map join
map join并不會(huì)涉及reduce操作朵耕。map端join的優(yōu)勢(shì)就是在于沒有shuffle
關(guān)聯(lián)參考
--是否自動(dòng)轉(zhuǎn)換為mapjoin
set hive.auto.convert.join = true;
--小表的最大文件大小炫隶,默認(rèn)為25000000,即25M
set hive.mapjoin.smalltable.filesize = 25000000;
--是否將多個(gè)mapjoin合并為一個(gè)
set hive.auto.convert.join.noconditionaltask = true;
--多個(gè)mapjoin轉(zhuǎn)換為1個(gè)時(shí)憔披,所有小表的文件大小總和的最大值等限。
set hive.auto.convert.join.noconditionaltask.size = 10000000;
參考:http://blog.csdn.net/yycdaizi/article/details/50158573
https://m.aliyun.com/yunqi/articles/59635