上次講過HIVE 的一個常用命令 MSCK REPAIR TABLE , 這次講講HIVE的ANALYZE TABLE
命令卵皂,接下來還會講下Impala的 COMPUTE STATS
命令。這幾個命令都是用來統(tǒng)計表的信息的,用于加速查詢。
HIVE的 ANALYZE TABLE 命令是做啥的
Statistics such as the number of rows of a table or partition and the histograms of a particular interesting column are important in many ways. One of the key use cases of statistics is query optimization. Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them. Statistics may sometimes meet the purpose of the users' queries. Users can quickly get the answers for some of their queries by only querying stored statistics rather than firing long-running execution plans. Some examples are getting the quantile of the users' age distribution, the top 10 apps that are used by people, and the number of distinct sessions.
其實主要就是為了優(yōu)化查詢,加快查詢的速度式撼。
HIVE的主要統(tǒng)計數(shù)據(jù)
1. 表和分區(qū)的統(tǒng)計數(shù)據(jù) (Table and Partition Statistics):
- 行數(shù) (Number of rows)
- 文件數(shù) (Number of files)
- 文件大小 (Size in Bytes)
2. 列的統(tǒng)計數(shù)據(jù) (Column Statistics): 這是Hive 0.10.0后加入的
3. Top K Statistics
語法
analyze table t [partition p] compute statistics for [columns c,...];
例子1 對hive進(jìn)行表和分區(qū)的分析
#執(zhí)行語句
ANALYZE TABLE dw_wy_srvcc_customized_drilldown_table_daily PARTITION(date_time='20190228') COMPUTE STATISTICS;
#查看結(jié)果
DESCRIBE extended dw_wy_srvcc_customized_drilldown_table_daily PARTITION(date_time='20190228');
結(jié)果如下 numRows=12552, totalSize=1564884, rawDataSize=1552332
Partition(values:[20190228], dbName:coordinator_test, tableName:dw_wy_srvcc_customized_drilldown_table_daily, createTime:1551344149, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:level, type:bigint, comment:null), FieldSchema(name:pi_name, type:string, comment:null), FieldSchema(name:dimension_type, type:string, comment:null), FieldSchema(name:demension_item, type:string, comment:null), FieldSchema(name:numerator_value, type:double, comment:null), FieldSchema(name:denominator_value, type:double, comment:null), FieldSchema(name:pi_value, type:double, comment:null), FieldSchema(name:degrad_weight, type:double, comment:null), FieldSchema(name:date_time, type:string, comment:null)], location:hdfs://dlbdn3:8020/user/hive/warehouse/coordinator_test.db/dw_wy_srvcc_customized_drilldown_table_daily/date_time=20190228, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{numFiles=20, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1551408396, numRows=12552, totalSize=1564884, rawDataSize=1552332})
例子2 對hive進(jìn)行列的分析
#執(zhí)行語句
analyze table dw_wy_srvcc_customized_drilldown_table_daily compute statistics for columns level;
#查看結(jié)果
describe formatted dw_wy_srvcc_customized_drilldown_table_daily level;