HIVE,增量更新數(shù)據(jù)

生產(chǎn)環(huán)境應(yīng)用場(chǎng)景描述

每天有很多事實(shí)表需要增量同步,在HIVE沒有開啟事務(wù)模式的條件下,需要全表重新寫入HDFS中作烟,這在需要巨大的IO時(shí)間開銷振湾,每天的增量數(shù)據(jù)占總數(shù)據(jù)的比列很小球涛,這種方式顯得非常低效劣针。
現(xiàn)在的想法是,考慮利用 INSERT OVER TABLE 語(yǔ)句對(duì)分區(qū)表進(jìn)行指定分區(qū)覆蓋插入亿扁,來實(shí)現(xiàn)增量更新的效果捺典。
1,首先創(chuàng)建測(cè)試數(shù)據(jù)集
  • a从祝、全量數(shù)據(jù)集
CREATE TABLE `ods.employees_all`( `employee_id` double,
`first_name` string,
`last_name` string,
`email` string,
`phone_number` string,
`hire_date` timestamp,
`job_id` string,
`salary` double,
`commission_pct` double,
`manager_id` double,
`department_id` int)
partitioned by (ds string) 
stored as orcfile;

INSERT INTO ods.employees_all (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(1, 'Eleni', 'Zlotkey', 'EZLOTKEY', '011.44.1344.429018', '2001-01-29 00:00:00', 'SA_MAN', 10500.0, 0.2, 100.0, 80.0, '2001');
INSERT INTO ods.employees_all (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(2, 'Mattea', 'Marvins', 'MMARVINS', '011.44.1346.329268', '2001-01-24 00:00:00', 'SA_REP', 7200.0, 0.1, 147.0, 80.0, '2001');
INSERT INTO ods.employees_all (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(3, 'David', 'Lee', 'DLEE', '011.44.1346.529268', '2002-02-23 00:00:00', 'SA_REP', 6800.0, 0.1, 147.0, 80.0, '2002');
INSERT INTO ods.employees_all (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(4, 'Sundar', 'Ande', 'SANDE', '011.44.1346.629268', '2002-03-24 00:00:00', 'SA_REP', 6400.0, 0.1, 147.0, 80.0, '2002');
INSERT INTO ods.employees_all (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(5, 'Amit', 'Banda', 'ABANDA', '011.44.1346.729268', '2003-04-21 00:00:00', 'SA_REP', 6200.0, 0.1, 147.0, 80.0, '2003');
INSERT INTO ods.employees_all (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(6, 'shabi', 'shabi', 'shabi', '590.423.4569', '2003-06-25 00:00:00.0', 'IT_PROG', 4800.0, 0.0, 103.0, 60.0, '2003');
INSERT INTO ods.employees_all (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(7, 'chunhuo', 'chunhuo', 'chunhuo', '590.423.4560', '2004-02-05 00:00:00.0', 'IT_PROG', 4800.0, 0.0, 103.0, 60.0, '2004');
INSERT INTO ods.employees_all (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(8, 'Payam', 'Kaufling', 'PKAUFLIN', '650.123.3234', '2004-05-01 00:00:00.0', 'ST_MAN', 7900.0, 0.0, 100.0, 50.0, '2004');
INSERT INTO ods.employees_all (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(9, 'David', 'Bernstein', 'DBERNSTE', '011.44.1344.345268', '2005-03-24 00:00:00.0', 'SA_REP', 9500.0, 0.25, 145.0, 80.0, '2005');
INSERT INTO ods.employees_all (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(10, 'Ellen', 'Abel', 'EABEL', '011.44.1644.429267', '2005-05-11 00:00:00.0', 'SA_REP', 11000.0, 0.3, 149.0, 80.0, '2005');
INSERT INTO ods.employees_all (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(11, 'Britney', 'Everett', 'BEVERETT', '650.501.2876', '2006-03-03 00:00:00.0', 'SH_CLERK', 3900.0, 0.0, 123.0, 50.0, '2006');
INSERT INTO ods.employees_all (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(12, 'Samuel', 'McCain', 'SMCCAIN', '650.501.3876', '2006-07-01 00:00:00.0', 'SH_CLERK', 3200.0, 0.0, 123.0, 50.0, '2006');

  • b襟己、增量數(shù)據(jù)集
CREATE TABLE `ods.employees_all`( `employee_id` double,
`first_name` string,
`last_name` string,
`email` string,
`phone_number` string,
`hire_date` timestamp,
`job_id` string,
`salary` double,
`commission_pct` double,
`manager_id` double,
`department_id` int)
partitioned by (ds string) 
stored as orcfile;

INSERT INTO ods.employees_tmp (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(4, 'Sundar_update', 'Sundar_update', 'Sundar_update', '011.44.1346.629268', '2002-03-24 00:00:00', 'SA_REP', 6400.0, 0.1, 147.0, 80.0, '2002');
INSERT INTO ods.employees_tmp (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(6, 'shabi_update', 'shabi_update', 'shabi_update', '590.423.4569', '2003-06-25 00:00:00.0', 'IT_PROG', 4800.0, 0.0, 103.0, 60.0, '2003');
INSERT INTO ods.employees_tmp (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(8, 'Payam_update', 'Payam_update', 'Payam_update', '650.123.3234', '2004-05-01 00:00:00.0', 'ST_MAN', 7900.0, 0.0, 100.0, 50.0, '2004');
INSERT INTO ods.employees_tmp (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(13, 'David_incremental', 'David_incremental', 'David_incremental', '011.44.1344.345268', '2007-03-24 00:00:00.0', 'SA_REP', 9500.0, 0.25, 145.0, 80.0, '2007');
INSERT INTO ods.employees_tmp (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(14, 'Ellen_incremental', 'Ellen_incremental', 'Ellen_incremental', '011.44.1644.429267', '2007-05-11 00:00:00.0', 'SA_REP', 11000.0, 0.3, 149.0, 80.0, '2007');
INSERT INTO ods.employees_tmp (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(15, 'Britney_incremental', 'Britney_incremental', 'Britney_incremental', '650.501.2876', '2008-03-03 00:00:00.0', 'SH_CLERK', 3900.0, 0.0, 123.0, 50.0, '2008');
INSERT INTO ods.employees_tmp (employee_id, first_name, last_name, email, phone_number, hire_date, job_id, salary, commission_pct, manager_id, department_id, ds) VALUES(16, 'Samuel_incremental', 'Samuel_incremental', 'Samuel_incremental', '650.501.3876', '2008-07-01 00:00:00.0', 'SH_CLERK', 3200.0, 0.0, 123.0, 50.0, '2008');
2,預(yù)覽測(cè)試數(shù)據(jù)
  • 全量


    全量數(shù)據(jù)
  • 增量


    增量數(shù)據(jù)
3牍陌,查看全量表HDFS分區(qū)的目錄情況
全量表HDFS分區(qū)目錄
4擎浴,用INSER OVERWRITE TABLE語(yǔ)句定向更新目標(biāo)分區(qū)

通過預(yù)覽表數(shù)據(jù),我知道全量表當(dāng)前有12條數(shù)據(jù)毒涧,6個(gè)分區(qū)贮预;增量表有7條數(shù)據(jù),5個(gè)分區(qū)契讲,其中2007和2008是新增分區(qū)仿吞,2002,2003,2004是更新分區(qū)。更新語(yǔ)句如下:

INSERT OVERWRITE TABLE ods.employees_all PARTITION(ds) 
SELECT t1.* FROM (
SELECT a.* 
FROM ods.employees_all a 
LEFT join ods.employees_tmp b ON a.employee_id = b.employee_id 
WHERE b.employee_id IS NULL 
  AND EXISTS (SELECT 1 FROM ods.employees_tmp c WHERE a.ds=c.ds )
UNION ALL 
SELECT * FROM ods.employees_tmp 
) t1

其中表t1的結(jié)果集為:


t1表數(shù)據(jù)集捡偏,各數(shù)據(jù)行變化或來源圖中已做詳細(xì)標(biāo)注

這里需要解釋一下的被動(dòng)更新行唤冈,他的意思是這些行本身并不在新增數(shù)據(jù)集中,但因?yàn)槠浞謪^(qū)與新增數(shù)據(jù)集中的某些行的分區(qū)相同霹琼,因此也被命中以便覆蓋全量數(shù)據(jù)集中的目標(biāo)分區(qū)务傲。

由t1數(shù)據(jù)集中有共有10條數(shù)據(jù)凉当,其中新增4條枣申,更新3條,被動(dòng)更新3條看杭,因此如果INSERT OVERWRITE TABLE語(yǔ)句執(zhí)行成功后忠藤,ods.employees_all中應(yīng)該有16條數(shù)據(jù)。

以下是更新后的全量數(shù)據(jù)集:


更新后的全量數(shù)據(jù)集

數(shù)據(jù)如預(yù)期的完全一致楼雹,說明INSERT OVERWRITE TABLE語(yǔ)句確實(shí)是分區(qū)表做增量更新的最優(yōu)選擇模孩,這種更新方式邏輯清晰簡(jiǎn)單,實(shí)現(xiàn)方式優(yōu)雅贮缅,絕對(duì)是不二之選榨咐。

5,更新后全量表HDFS分區(qū)的目錄情況
更新后全量表HDFS分區(qū)的目錄情況

可以看到谴供,在HDFS目錄中块茁,語(yǔ)句執(zhí)行成功后,自動(dòng)創(chuàng)建了新增的4條記錄所對(duì)應(yīng)的分區(qū)。

6数焊,總結(jié)

之前一直苦于在HIVE不開啟事務(wù)的模式下永淌,怎么做增量更新。
剛接觸hive佩耳,思維還停留在傳統(tǒng)的關(guān)系型數(shù)據(jù)庫(kù)中遂蛀,對(duì)不能DELETE,UPDATE操作的數(shù)倉(cāng)方式非常不適應(yīng)干厚,對(duì)這種無法更新數(shù)據(jù)行的數(shù)倉(cāng)深感蛋疼(認(rèn)識(shí)還處于低級(jí)水平所致)李滴。經(jīng)過這番探索后發(fā)現(xiàn),其實(shí)hive遠(yuǎn)比自己想的要強(qiáng)大的多萍诱,一般常規(guī)性的問題悬嗓,前人早已給出解決方案,自己的困惑完全是來自于低級(jí)的無知裕坊。

上面的這個(gè)例子中包竹,如果有數(shù)據(jù)庫(kù)日志更新表或原表中有一個(gè)可用的update_time時(shí)間戳,t1數(shù)據(jù)集其實(shí)可以在sqoop中的query選項(xiàng)使用籍凝,導(dǎo)入HDFS后生成一個(gè)臨時(shí)表周瞎,然后直接用這個(gè)臨時(shí)表insert overwrite table到目標(biāo)全量表。

完饵蒂。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末声诸,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子退盯,更是在濱河造成了極大的恐慌彼乌,老刑警劉巖,帶你破解...
    沈念sama閱讀 211,817評(píng)論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件渊迁,死亡現(xiàn)場(chǎng)離奇詭異慰照,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)琉朽,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,329評(píng)論 3 385
  • 文/潘曉璐 我一進(jìn)店門毒租,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人箱叁,你說我怎么就攤上這事墅垮。” “怎么了耕漱?”我有些...
    開封第一講書人閱讀 157,354評(píng)論 0 348
  • 文/不壞的土叔 我叫張陵算色,是天一觀的道長(zhǎng)。 經(jīng)常有香客問我螟够,道長(zhǎng)灾梦,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 56,498評(píng)論 1 284
  • 正文 為了忘掉前任,我火速辦了婚禮斥废,結(jié)果婚禮上椒楣,老公的妹妹穿的比我還像新娘。我一直安慰自己牡肉,他們只是感情好捧灰,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,600評(píng)論 6 386
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著统锤,像睡著了一般毛俏。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上饲窿,一...
    開封第一講書人閱讀 49,829評(píng)論 1 290
  • 那天煌寇,我揣著相機(jī)與錄音,去河邊找鬼逾雄。 笑死阀溶,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的鸦泳。 我是一名探鬼主播银锻,決...
    沈念sama閱讀 38,979評(píng)論 3 408
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼做鹰!你這毒婦竟也來了击纬?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,722評(píng)論 0 266
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤钾麸,失蹤者是張志新(化名)和其女友劉穎更振,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體饭尝,經(jīng)...
    沈念sama閱讀 44,189評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡肯腕,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,519評(píng)論 2 327
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了芋肠。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片乎芳。...
    茶點(diǎn)故事閱讀 38,654評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡遵蚜,死狀恐怖帖池,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情吭净,我是刑警寧澤睡汹,帶...
    沈念sama閱讀 34,329評(píng)論 4 330
  • 正文 年R本政府宣布,位于F島的核電站寂殉,受9級(jí)特大地震影響囚巴,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,940評(píng)論 3 313
  • 文/蒙蒙 一彤叉、第九天 我趴在偏房一處隱蔽的房頂上張望庶柿。 院中可真熱鬧,春花似錦秽浇、人聲如沸浮庐。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,762評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)审残。三九已至,卻和暖如春斑举,著一層夾襖步出監(jiān)牢的瞬間搅轿,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,993評(píng)論 1 266
  • 我被黑心中介騙來泰國(guó)打工富玷, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留璧坟,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 46,382評(píng)論 2 360
  • 正文 我出身青樓赎懦,卻偏偏與公主長(zhǎng)得像沸柔,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子铲敛,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,543評(píng)論 2 349

推薦閱讀更多精彩內(nèi)容