hive增量抽取方案

一乍赫、使用sqoop從mysql中抽取數(shù)據(jù)到hive匆篓,查看sqoop官方文檔暂吉,有如下兩種方案:
7.2.9. Incremental Imports
Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.

The following arguments control incremental imports:

Table 5. Incremental import arguments:

Argument    Description
--check-column (col)    Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR)
--incremental (mode)    Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value)    Specifies the maximum value of the check column from the previous import.

Sqoop supports two types of incremental imports: append and lastmodified. You can use the --incremental argument to specify the type of incremental import to perform.

You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value.

An alternate table update strategy supported by Sqoop is called lastmodified mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.

At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

示例:

簡(jiǎn)單說(shuō)胖秒,sqoop支持兩種增量MySql導(dǎo)入到hive的模式借笙,
一種是 append业稼,即通過(guò)指定一個(gè)遞增的列,比如:
--incremental append --check-column id --last-value 0
導(dǎo)入id>0的數(shù)
另一種是可以根據(jù)時(shí)間戳的模式叫 lastmodified ,比如:
--incremental lastmodified --check-column createTime --last-value '2012-02-01 11:0:00'
就是只導(dǎo)入createTime 比'2012-02-01 11:0:00'更大的數(shù)據(jù)。

--check-column (col)        --檢查的列
--incremental (mode)        --所選模式吩抓,append或者lastmodified 
--last-value (value)          -- 最后一次的值
```
本次采用的是時(shí)間戳方案疹娶,每天導(dǎo)入全量數(shù)據(jù),在hive中抽取最新的數(shù)據(jù)



#####二、hive增量SQL實(shí)現(xiàn)锹安。

數(shù)據(jù):data.txt
```
1,mary,18,2017-06-26 10:00:00
2,lucy,29,2017-06-26 10:00:00
3,jack,18,2017-06-26 10:00:00
4,nick,25,2017-06-26 10:00:00
4,nick,18,2017-06-27 10:00:00
5,tom,26,2017-06-26 10:00:00
5,tom,26,2017-06-27 12:00:00
```
1. 建表語(yǔ)句:
```
create table mytable(id int,name string,age int,createTime string) partitioned by (dt string) row format delimited fields terminated by ',';
```

2. 導(dǎo)入數(shù)據(jù):
```
load data local inpath '/home/ubuntu/data.txt' into table mytable partition(dt='20170626');
```

3. 查看數(shù)據(jù)
```
hive> select * from mytable where dt='20170626';
OK
1   mary    18  2017-06-26 10:00:00 20170626
2   lucy    29  2017-06-26 10:00:00 20170626
3   jack    18  2017-06-26 10:00:00 20170626
4   nick    25  2017-06-26 10:00:00 20170626
4   nick    18  2017-06-27 10:00:00 20170626
5   tom 26  2017-06-26 10:00:00 20170626
5   tom 26  2017-06-27 12:00:00 20170626
Time taken: 0.364 seconds, Fetched: 7 row(s)
hive> 
```
4. 我們發(fā)現(xiàn)20170626中有27號(hào)的增量數(shù)據(jù),所以應(yīng)該將數(shù)據(jù)更新到20160627分區(qū)超升,保留最新的數(shù)據(jù)
(注:hive中刪除分區(qū)方法:alter table mytable drop partition(dt='20170627'))
#####查詢語(yǔ)句如下:
```
SELECT id, name, age, createTime
FROM (SELECT id, name, age, createTime, row_number() OVER (PARTITION BY id ORDER BY createTime DESC) AS rn
    FROM mytable
    ) t
WHERE t.rn = 1;
```
其中:
```
select id,name,age,createTime,row_number() over (partition by id order by createTime DESC) AS rn from mytable
```
使用的hive的窗口函數(shù)row_number()盈滴,該函數(shù)作用是將原表按partition后面的字段分區(qū)后,并且按照createTime字段降序排列后,對(duì)分組內(nèi)部的行記錄進(jìn)行標(biāo)記行號(hào)咬展,分別從1-n順序標(biāo)號(hào)济瓢,
該句的查詢結(jié)果如下:
```
Total MapReduce CPU Time Spent: 2 seconds 250 msec
OK
1   mary    18  2017-06-26 10:00:00 1
2   lucy    29  2017-06-26 10:00:00 1
3   jack    18  2017-06-26 10:00:00 1
4   nick    18  2017-06-27 10:00:00 1
4   nick    25  2017-06-26 10:00:00 2
5   tom 26  2017-06-27 12:00:00 1
5   tom 26  2017-06-26 10:00:00 2
Time taken: 24.823 seconds, Fetched: 7 row(s)
hive> 
```

因此我們很容易得出20170627號(hào)有效的最新數(shù)據(jù)為行號(hào)rn為1的數(shù)據(jù)
#####三箕宙、更新數(shù)據(jù)
最后將數(shù)據(jù)更新到20170627分區(qū),SQL如下
```
INSERT INTO TABLE mytable PARTITION(dt='20170627') 
SELECT id, name, age, createTime
FROM (SELECT id, name, age, createTime, row_number() OVER (PARTITION BY id ORDER BY createTime DESC) AS rn
    FROM mytable
    ) t
WHERE t.rn = 1;
```
查看數(shù)據(jù)
```
    > select * from mytable where dt='20170627';
OK
1   mary    18  2017-06-26 10:00:00 20170627
2   lucy    29  2017-06-26 10:00:00 20170627
3   jack    18  2017-06-26 10:00:00 20170627
4   nick    18  2017-06-27 10:00:00 20170627
5   tom 26  2017-06-27 12:00:00 20170627
Time taken: 0.121 seconds, Fetched: 5 row(s)
hive> 

```
對(duì)比后發(fā)現(xiàn)柬帕,數(shù)據(jù)確實(shí)是最新的哟忍。




最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市陷寝,隨后出現(xiàn)的幾起案子锅很,更是在濱河造成了極大的恐慌,老刑警劉巖凤跑,帶你破解...
    沈念sama閱讀 221,273評(píng)論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件爆安,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡仔引,警方通過(guò)查閱死者的電腦和手機(jī)扔仓,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,349評(píng)論 3 398
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)咖耘,“玉大人翘簇,你說(shuō)我怎么就攤上這事《梗” “怎么了版保?”我有些...
    開(kāi)封第一講書人閱讀 167,709評(píng)論 0 360
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)义桂。 經(jīng)常有香客問(wèn)我找筝,道長(zhǎng),這世上最難降的妖魔是什么慷吊? 我笑而不...
    開(kāi)封第一講書人閱讀 59,520評(píng)論 1 296
  • 正文 為了忘掉前任,我火速辦了婚禮曹抬,結(jié)果婚禮上溉瓶,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好堰酿,可當(dāng)我...
    茶點(diǎn)故事閱讀 68,515評(píng)論 6 397
  • 文/花漫 我一把揭開(kāi)白布疾宏。 她就那樣靜靜地躺著,像睡著了一般触创。 火紅的嫁衣襯著肌膚如雪坎藐。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書人閱讀 52,158評(píng)論 1 308
  • 那天哼绑,我揣著相機(jī)與錄音岩馍,去河邊找鬼。 笑死抖韩,一個(gè)胖子當(dāng)著我的面吹牛蛀恩,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播茂浮,決...
    沈念sama閱讀 40,755評(píng)論 3 421
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼双谆,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了席揽?” 一聲冷哼從身側(cè)響起顽馋,我...
    開(kāi)封第一講書人閱讀 39,660評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎幌羞,沒(méi)想到半個(gè)月后趣避,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,203評(píng)論 1 319
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡新翎,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,287評(píng)論 3 340
  • 正文 我和宋清朗相戀三年程帕,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片地啰。...
    茶點(diǎn)故事閱讀 40,427評(píng)論 1 352
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡愁拭,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出亏吝,到底是詐尸還是另有隱情岭埠,我是刑警寧澤,帶...
    沈念sama閱讀 36,122評(píng)論 5 349
  • 正文 年R本政府宣布蔚鸥,位于F島的核電站惜论,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏止喷。R本人自食惡果不足惜馆类,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,801評(píng)論 3 333
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望弹谁。 院中可真熱鬧乾巧,春花似錦句喜、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書人閱讀 32,272評(píng)論 0 23
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至旷太,卻和暖如春展懈,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背供璧。 一陣腳步聲響...
    開(kāi)封第一講書人閱讀 33,393評(píng)論 1 272
  • 我被黑心中介騙來(lái)泰國(guó)打工存崖, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人嗜傅。 一個(gè)月前我還...
    沈念sama閱讀 48,808評(píng)論 3 376
  • 正文 我出身青樓金句,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親吕嘀。 傳聞我的和親對(duì)象是個(gè)殘疾皇子违寞,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,440評(píng)論 2 359

推薦閱讀更多精彩內(nèi)容