1.問題:考慮下面三個sql
sql1:
select *
from (select log.*,a.batch_order_no,a.pm_code,a.bank_code,a.pay_status
from pay_recommend_metric.recommend_log log
left join dw_transform_paycenter.all_order_info a
on (log.orderno=a.order_no and log.dt='2017-09-17' and a.dt='2017-09-17')
where a.order_no is not null and log.orderno is not null
) t1 left join pay_recommend.pay_type pt
on (t1.pm_code=pt.pm_code and t1.bank_code=pt.bank_code)
sql2:
select *
from (select log.*,a.batch_order_no,a.pm_code,a.bank_code,a.pay_status
from pay_recommend_metric.recommend_log log
left join dw_transform_paycenter.all_order_info a
on (log.orderno=a.order_no)
where log.dt='2017-09-17' and a.dt='2017-09-17' and a.order_no is not null and log.orderno is not null
) t1 left join pay_recommend.pay_type pt
on (t1.pm_code=pt.pm_code and t1.bank_code=pt.bank_code)
select *
from (select log.*,a.batch_order_no,a.pm_code,a.bank_code,a.pay_status
from (
select *
from pay_recommend_metric.recommend_log
where dt='2017-09-17' and order_no is not null
) log
left join (
select *
from dw_transform_paycenter.all_order_info
where dt='2017-09-17' and order_no is not null
)a
on (log.orderno=a.order_no)
) t1 left join pay_recommend.pay_type pt
on (t1.pm_code=pt.pm_code and t1.bank_code=pt.bank_code)
分析:這三個sql區(qū)別就在于log表和a表的過濾條件是在where字句上還是on字句上速那。
sql1的結(jié)果:
a表和log表的所有數(shù)據(jù)做連接半抱,只是在a表的dt!='2017-09-17'記錄上a表上所有字段都為空
sql2的結(jié)果:
a表和log表的所有數(shù)據(jù)先做連接拱燃,然后過濾出兩個表只在'2017-09-17'分區(qū)上的數(shù)據(jù)
sql3的結(jié)果:
a表和log表分別查詢出'2017-09-17'分區(qū)上數(shù)據(jù)神郊,兩個小數(shù)據(jù)集做關(guān)聯(lián)讥裤,結(jié)果和sql2的相同
總結(jié):
對于a join b蛔添,無論過濾條件放到on中還是where中痰催,結(jié)果相同。
對于a left join b迎瞧,過濾條件在on中:忽略a表的過濾條件陨囊,b表的過濾條件只會導(dǎo)致不滿足條件的記錄在b表的所有字段上都為空;過濾條件在where中:過濾掉所有不滿足條件的記錄夹攒。
把握核心:先做join蜘醋,left join時on中條件無過濾記錄作用,后做where條件過濾記錄
sql3的執(zhí)行效率最高咏尝,推薦使用压语;sql2的效率較低,因?yàn)樾枰黻P(guān)聯(lián)编检;sql1有bug胎食,不建議直接在on中使用過濾條件
但是注意不要將where過濾條件放到on中,除非你很了解SQL執(zhí)行后的結(jié)果允懂;另外不要將on連接條件放到where中厕怜,hive并不會像mysq那樣做連接優(yōu)化,這樣會導(dǎo)致不可控的情況