工作中發(fā)現(xiàn)很多同事連基礎(chǔ)的hive命令都不知道醇份,所以準(zhǔn)備寫一個(gè)系列把hive一些常用的命令進(jìn)行一個(gè)總結(jié)稼锅。第一個(gè)講的命令是MSCK REPAIR TABLE
。
MSCK REPAIR TABLE 命令是做啥的
MSCK REPAIR TABLE
命令主要是用來解決通過hdfs dfs -put或者h(yuǎn)dfs api寫入hive分區(qū)表的數(shù)據(jù)在hive中無法被查詢到的問題僚纷。
我們知道hive有個(gè)服務(wù)叫metastore矩距,這個(gè)服務(wù)主要是存儲(chǔ)一些元數(shù)據(jù)信息,比如數(shù)據(jù)庫名怖竭,表名或者表的分區(qū)等等信息锥债。如果不是通過hive的insert等插入語句,很多分區(qū)信息在metastore中是沒有的痊臭,如果插入分區(qū)數(shù)據(jù)量很多的話哮肚,你用 ALTER TABLE table_name ADD PARTITION
一個(gè)個(gè)分區(qū)添加十分麻煩。這時(shí)候MSCK REPAIR TABLE
就派上用場了广匙。只需要運(yùn)行MSCK REPAIR TABLE
命令允趟,hive就會(huì)去檢測這個(gè)表在hdfs上的文件,把沒有寫入metastore的分區(qū)信息寫入metastore鸦致。
例子
我們先創(chuàng)建一個(gè)分區(qū)表潮剪,然后往其中的一個(gè)分區(qū)插入一條數(shù)據(jù),在查看分區(qū)信息
CREATE TABLE repair_test (col_a STRING) PARTITIONED BY (par STRING);
INSERT INTO TABLE repair_test PARTITION(par="partition_1") VALUES ("test");
SHOW PARTITIONS repair_test;
查看分區(qū)信息的結(jié)果如下
0: jdbc:hive2://localhost:10000> show partitions repair_test;
INFO : Compiling command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21): show partitions repair_test
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21); Time taken: 0.029 seconds
INFO : Executing command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21): show partitions repair_test
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21); Time taken: 0.017 seconds
INFO : OK
+------------------+--+
| partition |
+------------------+--+
| par=partition_1 |
+------------------+--+
1 row selected (0.073 seconds)
0: jdbc:hive2://localhost:10000>
然后我們通過hdfs的put命令手動(dòng)創(chuàng)建一個(gè)數(shù)據(jù)
[ericsson@h3cnamenode1 pcc]$ echo "123123" > test.txt
[ericsson@h3cnamenode1 pcc]$ hdfs dfs -mkdir -p /user/hive/warehouse/test.db/repair_test/par=partition_2/
[ericsson@h3cnamenode1 pcc]$ hdfs dfs -put -f test.txt /user/hive/warehouse/test.db/repair_test/par=partition_2/
[ericsson@h3cnamenode1 pcc]$ hdfs dfs -ls -R /user/hive/warehouse/test.db/repair_test
drwxrwxrwt - ericsson hive 0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1
drwxrwxrwt - ericsson hive 0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/.hive-staging_hive_2018-08-10_17-45-59_029_1594310228554990949-1
drwxrwxrwt - ericsson hive 0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/.hive-staging_hive_2018-08-10_17-45-59_029_1594310228554990949-1/-ext-10000
-rwxrwxrwt 3 ericsson hive 5 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/000000_0
drwxr-xr-x - ericsson hive 0 2018-08-10 17:57 /user/hive/warehouse/test.db/repair_test/par=partition_2
-rw-r--r-- 3 ericsson hive 7 2018-08-10 17:57 /user/hive/warehouse/test.db/repair_test/par=partition_2/test.txt
[ericsson@h3cnamenode1 pcc]$
這時(shí)候我們查詢分區(qū)信息分唾,發(fā)現(xiàn)partition_2這個(gè)分區(qū)并沒有加入到hive中
0: jdbc:hive2://localhost:10000> show partitions repair_test;
INFO : Compiling command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79): show partitions repair_test
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79); Time taken: 0.029 seconds
INFO : Executing command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79): show partitions repair_test
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79); Time taken: 0.02 seconds
INFO : OK
+------------------+--+
| partition |
+------------------+--+
| par=partition_1 |
+------------------+--+
1 row selected (0.079 seconds)
0: jdbc:hive2://localhost:10000>
運(yùn)行MSCK REPAIR TABLE 命令后再查詢分區(qū)信息,可以看到通過put命令放入的分區(qū)已經(jīng)可以查詢了
0: jdbc:hive2://localhost:10000> MSCK REPAIR TABLE repair_test;
INFO : Compiling command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f): MSCK REPAIR TABLE repair_test
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f); Time taken: 0.004 seconds
INFO : Executing command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f): MSCK REPAIR TABLE repair_test
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f); Time taken: 0.138 seconds
INFO : OK
No rows affected (0.154 seconds)
0: jdbc:hive2://localhost:10000> show partitions repair_test;
INFO : Compiling command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25): show partitions repair_test
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25); Time taken: 0.045 seconds
INFO : Executing command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25): show partitions repair_test
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25); Time taken: 0.016 seconds
INFO : OK
+------------------+--+
| partition |
+------------------+--+
| par=partition_1 |
| par=partition_2 |
+------------------+--+
2 rows selected (0.088 seconds)
0: jdbc:hive2://localhost:10000> select * from repair_test;
INFO : Compiling command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38): select * from repair_test
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:repair_test.col_a, type:string, comment:null), FieldSchema(name:repair_test.par, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38); Time taken: 0.059 seconds
INFO : Executing command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38): select * from repair_test
INFO : Completed executing command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38); Time taken: 0.001 seconds
INFO : OK
+--------------------+------------------+--+
| repair_test.col_a | repair_test.par |
+--------------------+------------------+--+
| test | partition_1 |
| 123123 | partition_2 |
+--------------------+------------------+--+
2 rows selected (0.121 seconds)
0: jdbc:hive2://localhost:10000>
后續(xù)
后面發(fā)生了更有意思的事情抗碰。大致情況是很多人以為alter table drop partition只能刪除一個(gè)分區(qū)的數(shù)據(jù),結(jié)果用hdfs dfs -rmr 刪除hive分區(qū)表的hdfs文件绽乔。這就導(dǎo)致了一個(gè)問題hdfs上的文件雖然刪除了弧蝇,但是hive metastore中的原信息沒有刪除。如果用show parttions table_name 這些分區(qū)信息還在折砸,需要把這些分區(qū)原信息清除看疗。
后來我想看看MSCK REPAIR TABLE這個(gè)命令能否刪除已經(jīng)不存在hdfs上的表分區(qū)信息,發(fā)現(xiàn)不行鞍爱,我去jira查了下,發(fā)現(xiàn)Fix Version/s: 3.0.0, 2.4.0, 3.1.0 這幾個(gè)版本的hive才支持這個(gè)功能专酗。但由于我們的hive版本是1.1.0-cdh5.11.0睹逃, 這個(gè)方法無法使用。
附上官網(wǎng)的鏈接
Recover Partitions (MSCK REPAIR TABLE)
Recover Partitions (MSCK REPAIR TABLE)
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using
hadoop fs -put
command) or removed from HDFS, the metastore (and hence Hive) will not be aware of these changes to partition information unless the user runsALTER TABLE table_name ADD/DROP PARTITION
commands on each of the newly added or removed partitions, respectively.
However, users can run a metastore check command with the repair table option:
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
which will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS. See HIVE-874 and HIVE-17824 for more details. When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once. MSCK command without the REPAIR option can be used to find details about metadata mismatch metastore.
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE table_name RECOVER PARTITIONS;
Starting with Hive 1.3, MSCK will throw exceptions if directories with disallowed characters in partition values are found on HDFS. Use hive.msck.path.validation setting on the client to alter this behavior; "skip" will simply skip the directories. "ignore" will try to create partitions anyway (old behavior). This may or may not work.
HIVE-17824 是關(guān)于hive msck repair 增加清理metastore中已經(jīng)不在hdfs上的分區(qū)信息