HIVE常用命令之MSCK REPAIR TABLE命令簡述

工作中發(fā)現(xiàn)很多同事連基礎(chǔ)的hive命令都不知道醇份,所以準(zhǔn)備寫一個(gè)系列把hive一些常用的命令進(jìn)行一個(gè)總結(jié)稼锅。第一個(gè)講的命令是MSCK REPAIR TABLE

MSCK REPAIR TABLE 命令是做啥的

MSCK REPAIR TABLE命令主要是用來解決通過hdfs dfs -put或者h(yuǎn)dfs api寫入hive分區(qū)表的數(shù)據(jù)在hive中無法被查詢到的問題僚纷。

我們知道hive有個(gè)服務(wù)叫metastore矩距,這個(gè)服務(wù)主要是存儲(chǔ)一些元數(shù)據(jù)信息,比如數(shù)據(jù)庫名怖竭,表名或者表的分區(qū)等等信息锥债。如果不是通過hive的insert等插入語句,很多分區(qū)信息在metastore中是沒有的痊臭,如果插入分區(qū)數(shù)據(jù)量很多的話哮肚,你用 ALTER TABLE table_name ADD PARTITION 一個(gè)個(gè)分區(qū)添加十分麻煩。這時(shí)候MSCK REPAIR TABLE就派上用場了广匙。只需要運(yùn)行MSCK REPAIR TABLE命令允趟,hive就會(huì)去檢測這個(gè)表在hdfs上的文件,把沒有寫入metastore的分區(qū)信息寫入metastore鸦致。

例子

我們先創(chuàng)建一個(gè)分區(qū)表潮剪,然后往其中的一個(gè)分區(qū)插入一條數(shù)據(jù),在查看分區(qū)信息

CREATE TABLE repair_test (col_a STRING) PARTITIONED BY (par STRING);
INSERT INTO TABLE repair_test PARTITION(par="partition_1") VALUES ("test");
SHOW PARTITIONS repair_test;



查看分區(qū)信息的結(jié)果如下

0: jdbc:hive2://localhost:10000> show partitions repair_test;
INFO  : Compiling command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21): show partitions repair_test
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21); Time taken: 0.029 seconds
INFO  : Executing command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21): show partitions repair_test
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21); Time taken: 0.017 seconds
INFO  : OK
+------------------+--+
|    partition     |
+------------------+--+
| par=partition_1  |
+------------------+--+
1 row selected (0.073 seconds)
0: jdbc:hive2://localhost:10000> 



然后我們通過hdfs的put命令手動(dòng)創(chuàng)建一個(gè)數(shù)據(jù)

[ericsson@h3cnamenode1 pcc]$ echo "123123" > test.txt
[ericsson@h3cnamenode1 pcc]$ hdfs dfs -mkdir -p /user/hive/warehouse/test.db/repair_test/par=partition_2/
[ericsson@h3cnamenode1 pcc]$ hdfs dfs -put -f test.txt /user/hive/warehouse/test.db/repair_test/par=partition_2/
[ericsson@h3cnamenode1 pcc]$ hdfs dfs -ls -R /user/hive/warehouse/test.db/repair_test
drwxrwxrwt   - ericsson hive          0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1
drwxrwxrwt   - ericsson hive          0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/.hive-staging_hive_2018-08-10_17-45-59_029_1594310228554990949-1
drwxrwxrwt   - ericsson hive          0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/.hive-staging_hive_2018-08-10_17-45-59_029_1594310228554990949-1/-ext-10000
-rwxrwxrwt   3 ericsson hive          5 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/000000_0
drwxr-xr-x   - ericsson hive          0 2018-08-10 17:57 /user/hive/warehouse/test.db/repair_test/par=partition_2
-rw-r--r--   3 ericsson hive          7 2018-08-10 17:57 /user/hive/warehouse/test.db/repair_test/par=partition_2/test.txt
[ericsson@h3cnamenode1 pcc]$ 


這時(shí)候我們查詢分區(qū)信息分唾,發(fā)現(xiàn)partition_2這個(gè)分區(qū)并沒有加入到hive中

0: jdbc:hive2://localhost:10000> show partitions repair_test;
INFO  : Compiling command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79): show partitions repair_test
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79); Time taken: 0.029 seconds
INFO  : Executing command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79): show partitions repair_test
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79); Time taken: 0.02 seconds
INFO  : OK
+------------------+--+
|    partition     |
+------------------+--+
| par=partition_1  |
+------------------+--+
1 row selected (0.079 seconds)
0: jdbc:hive2://localhost:10000>



運(yùn)行MSCK REPAIR TABLE 命令后再查詢分區(qū)信息,可以看到通過put命令放入的分區(qū)已經(jīng)可以查詢了

0: jdbc:hive2://localhost:10000> MSCK REPAIR TABLE repair_test;
INFO  : Compiling command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f): MSCK REPAIR TABLE repair_test
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f); Time taken: 0.004 seconds
INFO  : Executing command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f): MSCK REPAIR TABLE repair_test
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f); Time taken: 0.138 seconds
INFO  : OK
No rows affected (0.154 seconds)
0: jdbc:hive2://localhost:10000> show partitions repair_test;
INFO  : Compiling command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25): show partitions repair_test
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25); Time taken: 0.045 seconds
INFO  : Executing command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25): show partitions repair_test
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25); Time taken: 0.016 seconds
INFO  : OK
+------------------+--+
|    partition     |
+------------------+--+
| par=partition_1  |
| par=partition_2  |
+------------------+--+
2 rows selected (0.088 seconds)
0: jdbc:hive2://localhost:10000> select * from repair_test;
INFO  : Compiling command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38): select * from repair_test
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:repair_test.col_a, type:string, comment:null), FieldSchema(name:repair_test.par, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38); Time taken: 0.059 seconds
INFO  : Executing command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38): select * from repair_test
INFO  : Completed executing command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38); Time taken: 0.001 seconds
INFO  : OK
+--------------------+------------------+--+
| repair_test.col_a  | repair_test.par  |
+--------------------+------------------+--+
| test               | partition_1      |
| 123123             | partition_2      |
+--------------------+------------------+--+
2 rows selected (0.121 seconds)
0: jdbc:hive2://localhost:10000>






后續(xù)

后面發(fā)生了更有意思的事情抗碰。大致情況是很多人以為alter table drop partition只能刪除一個(gè)分區(qū)的數(shù)據(jù),結(jié)果用hdfs dfs -rmr 刪除hive分區(qū)表的hdfs文件绽乔。這就導(dǎo)致了一個(gè)問題hdfs上的文件雖然刪除了弧蝇,但是hive metastore中的原信息沒有刪除。如果用show parttions table_name 這些分區(qū)信息還在折砸,需要把這些分區(qū)原信息清除看疗。

后來我想看看MSCK REPAIR TABLE這個(gè)命令能否刪除已經(jīng)不存在hdfs上的表分區(qū)信息,發(fā)現(xiàn)不行鞍爱,我去jira查了下,發(fā)現(xiàn)Fix Version/s: 3.0.0, 2.4.0, 3.1.0 這幾個(gè)版本的hive才支持這個(gè)功能专酗。但由于我們的hive版本是1.1.0-cdh5.11.0睹逃, 這個(gè)方法無法使用。

附上官網(wǎng)的鏈接
Recover Partitions (MSCK REPAIR TABLE)

Recover Partitions (MSCK REPAIR TABLE)

Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command) or removed from HDFS, the metastore (and hence Hive) will not be aware of these changes to partition information unless the user runs ALTER TABLE table_name ADD/DROP PARTITION commands on each of the newly added or removed partitions, respectively.
However, users can run a metastore check command with the repair table option:
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
which will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS. See HIVE-874 and HIVE-17824 for more details. When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once. MSCK command without the REPAIR option can be used to find details about metadata mismatch metastore.
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE table_name RECOVER PARTITIONS;
Starting with Hive 1.3, MSCK will throw exceptions if directories with disallowed characters in partition values are found on HDFS. Use hive.msck.path.validation setting on the client to alter this behavior; "skip" will simply skip the directories. "ignore" will try to create partitions anyway (old behavior). This may or may not work.

HIVE-17824 是關(guān)于hive msck repair 增加清理metastore中已經(jīng)不在hdfs上的分區(qū)信息

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市沉填,隨后出現(xiàn)的幾起案子疗隶,更是在濱河造成了極大的恐慌,老刑警劉巖翼闹,帶你破解...
    沈念sama閱讀 222,000評論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件斑鼻,死亡現(xiàn)場離奇詭異,居然都是意外死亡猎荠,警方通過查閱死者的電腦和手機(jī)坚弱,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,745評論 3 399
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來关摇,“玉大人荒叶,你說我怎么就攤上這事∈涫” “怎么了些楣?”我有些...
    開封第一講書人閱讀 168,561評論 0 360
  • 文/不壞的土叔 我叫張陵,是天一觀的道長宪睹。 經(jīng)常有香客問我愁茁,道長,這世上最難降的妖魔是什么亭病? 我笑而不...
    開封第一講書人閱讀 59,782評論 1 298
  • 正文 為了忘掉前任鹅很,我火速辦了婚禮,結(jié)果婚禮上命贴,老公的妹妹穿的比我還像新娘道宅。我一直安慰自己,他們只是感情好胸蛛,可當(dāng)我...
    茶點(diǎn)故事閱讀 68,798評論 6 397
  • 文/花漫 我一把揭開白布污茵。 她就那樣靜靜地躺著,像睡著了一般葬项。 火紅的嫁衣襯著肌膚如雪泞当。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 52,394評論 1 310
  • 那天民珍,我揣著相機(jī)與錄音襟士,去河邊找鬼。 笑死嚷量,一個(gè)胖子當(dāng)著我的面吹牛陋桂,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播蝶溶,決...
    沈念sama閱讀 40,952評論 3 421
  • 文/蒼蘭香墨 我猛地睜開眼欧募,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了磷瘤?” 一聲冷哼從身側(cè)響起囱晴,我...
    開封第一講書人閱讀 39,852評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,409評論 1 318
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡鞍恢,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,483評論 3 341
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了每窖。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片帮掉。...
    茶點(diǎn)故事閱讀 40,615評論 1 352
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖岛请,靈堂內(nèi)的尸體忽然破棺而出旭寿,到底是詐尸還是另有隱情,我是刑警寧澤崇败,帶...
    沈念sama閱讀 36,303評論 5 350
  • 正文 年R本政府宣布盅称,位于F島的核電站,受9級特大地震影響后室,放射性物質(zhì)發(fā)生泄漏缩膝。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,979評論 3 334
  • 文/蒙蒙 一岸霹、第九天 我趴在偏房一處隱蔽的房頂上張望疾层。 院中可真熱鬧,春花似錦贡避、人聲如沸痛黎。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,470評論 0 24
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽湖饱。三九已至,卻和暖如春杀捻,著一層夾襖步出監(jiān)牢的瞬間井厌,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 33,571評論 1 272
  • 我被黑心中介騙來泰國打工致讥, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留仅仆,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 49,041評論 3 377
  • 正文 我出身青樓垢袱,卻偏偏與公主長得像墓拜,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個(gè)殘疾皇子请契,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,630評論 2 359

推薦閱讀更多精彩內(nèi)容