1. 背景
hive是基于hadoop的一個(gè)數(shù)據(jù)倉(cāng)庫(kù)工具嚎京,用來(lái)進(jìn)行數(shù)據(jù)提取、轉(zhuǎn)化业舍、加載抖拦,這是一種可以存儲(chǔ)、查詢(xún)和分析存儲(chǔ)在Hadoop中的大規(guī)模數(shù)據(jù)的機(jī)制舷暮。hive數(shù)據(jù)倉(cāng)庫(kù)工具能將結(jié)構(gòu)化的數(shù)據(jù)文件映射為一張數(shù)據(jù)庫(kù)表态罪,并提供sql查詢(xún)功能,能將sql轉(zhuǎn)變成MapReduce任務(wù)來(lái)執(zhí)行下面。Hive的優(yōu)點(diǎn)是學(xué)習(xí)成本低复颈,可以通過(guò)類(lèi)似SQL語(yǔ)句實(shí)現(xiàn)快速M(fèi)apReduce統(tǒng)計(jì),使MapReduce變得更加簡(jiǎn)單沥割,而不必開(kāi)發(fā)專(zhuān)門(mén)的MapReduce應(yīng)用程序耗啦。hive是十分適合數(shù)據(jù)倉(cāng)庫(kù)的統(tǒng)計(jì)分析和Windows注冊(cè)表文件凿菩。
2. 目標(biāo)
本文的目標(biāo),從一個(gè)簡(jiǎn)單的語(yǔ)法DESC TABLE來(lái)梳理Hive的代碼帜讲,介紹hive針對(duì)這個(gè)語(yǔ)法的編譯和運(yùn)行原理衅谷。從而讓大家對(duì)hive有個(gè)整體的了解。
3. Apache Hive代碼初讀?
3.1 入口函數(shù)
/hive/cli/src/java/org/apache/hadoop/hive/cli/CliDriver.java
public static void main(String[] args)throws Exception {
int ret =new CliDriver().run(args);
System.exit(ret);
}
main函數(shù)是入口似将,新建了一個(gè)CliDriver對(duì)象获黔,并且傳入?yún)?shù)執(zhí)行了run函數(shù)。
之后的執(zhí)行路徑如下玩郊,不在贅述:
a.?/hive/cli/src/java/org/apache/hadoop/hive/cli/CliDriver.java:run()
b.?/hive/cli/src/java/org/apache/hadoop/hive/cli/CliDriver.java:executeDriver()
c.?/hive/cli/src/java/org/apache/hadoop/hive/cli/CliDriver.java:processLine()
d.?/hive/cli/src/java/org/apache/hadoop/hive/cli/CliDriver.java:processCmd
e.?/hive/cli/src/java/org/apache/hadoop/hive/cli/CliDriver.java:?processLocalCmd
f./hive/ql/src/java/org/apache/hadoop/hive/ql/Driver.java: run()
g.?/hive/ql/src/java/org/apache/hadoop/hive/ql/Driver.java:runInternal()
h.?/hive/ql/src/java/org/apache/hadoop/hive/ql/Driver.java:compileInternal()
i.?/hive/ql/src/java/org/apache/hadoop/hive/ql/Driver.java:compile()
3.2 編譯過(guò)程
a. 生成抽象語(yǔ)法樹(shù):這一步主要是把describe table這樣的sql語(yǔ)句轉(zhuǎn)成一個(gè)語(yǔ)法書(shū)肢执,過(guò)程比較復(fù)雜,涉及到編譯原理的知識(shí)译红,不在此展開(kāi)预茄。大致是檢查語(yǔ)法是否錯(cuò)誤,識(shí)別sql中包含的操作實(shí)體侦厚,以及對(duì)每個(gè)實(shí)體的操作耻陕。
ParseDriver pd =new ParseDriver();
ASTNode tree = pd.parse(command,ctx);
b. 語(yǔ)義分析
sem.analyze(tree,ctx);
由于desc是DDL操作,實(shí)際進(jìn)行語(yǔ)義分析的類(lèi)是:/hive/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java:analyzeDescribeTable()刨沦,詳細(xì)分析下這個(gè)函數(shù):
private void analyzeDescribeTable(ASTNode ast)throws SemanticException {
ASTNode tableTypeExpr = (ASTNode) ast.getChild(0);
//找出表的全路徑名如:sytem.availability
String qualifiedName =
QualifiedNameUtil.getFullyQualifiedName((ASTNode) tableTypeExpr.getChild(0));
//從metastore里面驗(yàn)證該表是否存在
String tableName =
QualifiedNameUtil.getTableName(db, (ASTNode)(tableTypeExpr.getChild(0)));
//從metastore里面驗(yàn)證數(shù)據(jù)庫(kù)是否存在
String dbName =
QualifiedNameUtil.getDBName(db, (ASTNode)(tableTypeExpr.getChild(0)));
//從metastore里面驗(yàn)證分區(qū)是否存在
Map partSpec =
QualifiedNameUtil.getPartitionSpec(db, tableTypeExpr, tableName);
//describe操作可以精確到列诗宣,查看列的創(chuàng)建情況
String colPath = QualifiedNameUtil.getColPath(
db, tableTypeExpr, (ASTNode) tableTypeExpr.getChild(0), qualifiedName, partSpec);
// if database is not the one currently using
// validate database
? if (dbName !=null) {
validateDatabase(dbName);
}
if (partSpec !=null) {
validateTable(tableName, partSpec);
}
//創(chuàng)建一個(gè)DesctableDesc對(duì)象,該對(duì)象指示執(zhí)行describe任務(wù)時(shí)想诅,要做的事召庞。支持參數(shù):formatted, extended, pretty等3中描述方式
DescTableDesc descTblDesc =new DescTableDesc(
ctx.getResFile(), tableName, partSpec, colPath);
boolean showColStats =false;
if (ast.getChildCount() ==2) {
int descOptions = ast.getChild(1).getType();
descTblDesc.setFormatted(descOptions == HiveParser.KW_FORMATTED);
descTblDesc.setExt(descOptions == HiveParser.KW_EXTENDED);
descTblDesc.setPretty(descOptions == HiveParser.KW_PRETTY);
// in case of "DESCRIBE FORMATTED tablename column_name" statement, colPath
// will contain tablename.column_name. If column_name is not specified
// colPath will be equal to tableName. This is how we can differentiate
// if we are describing a table or column
? ? if (!colPath.equalsIgnoreCase(tableName) && descTblDesc.isFormatted()) {
showColStats =true;
}
}
inputs.add(new ReadEntity(getTable(tableName)));
//創(chuàng)建task,這里創(chuàng)建DDLTask
Task ddlTask = TaskFactory.get(new DDLWork(getInputs(), getOutputs(),
descTblDesc),conf);
rootTasks.add(ddlTask);
String schema = DescTableDesc.getSchema(showColStats);
setFetchTask(createFetchTask(schema));
LOG.info("analyzeDescribeTable done");
}
語(yǔ)義分析的過(guò)程是:根據(jù)ASTNode區(qū)分要執(zhí)行的操作来破,如DDL篮灼,EXPLAIN, DML等,分別對(duì)待這些操作徘禁,DDL這類(lèi)操作不需要執(zhí)行mapreduce任務(wù)诅诱,直接從metastore里面取出數(shù)據(jù)返回。最后生成對(duì)應(yīng)的Task
3.2 任務(wù)執(zhí)行過(guò)程
生成DDLTask之后就會(huì)去運(yùn)行任務(wù)送朱,執(zhí)行DDLTask里的execute()函數(shù):/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java:execute()娘荡,接著根據(jù)describe talbe操作調(diào)用:describeTable()函數(shù)。
private int describeTable(Hive db, DescTableDesc descTbl)throws HiveException {
String colPath = descTbl.getColumnPath();
String tableName = descTbl.getTableName();
// describe the table - populate the output stream
? Table tbl = db.getTable(tableName,false);
Partition part =null;
DataOutputStream outStream =null;
try {
Path resFile =new Path(descTbl.getResFile());
if (tbl ==null) {
FileSystem fs = resFile.getFileSystem(conf);
outStream = fs.create(resFile);
outStream.close();
outStream =null;
throw new HiveException(ErrorMsg.INVALID_TABLE, tableName);
}
if (descTbl.getPartSpec() !=null) {
part = db.getPartition(tbl, descTbl.getPartSpec(),false);
if (part ==null) {
FileSystem fs = resFile.getFileSystem(conf);
outStream = fs.create(resFile);
outStream.close();
outStream =null;
throw new HiveException(ErrorMsg.INVALID_PARTITION,
StringUtils.join(descTbl.getPartSpec().keySet(),','), tableName);
}
tbl = part.getTable();
}
}catch (IOException e) {
throw new HiveException(e, ErrorMsg.GENERIC_ERROR, tableName);
}finally {
IOUtils.closeStream(outStream);
}
try {
LOG.info("DDLTask: got data for " + tbl.getTableName());
Path resFile =new Path(descTbl.getResFile());
FileSystem fs = resFile.getFileSystem(conf);
outStream = fs.create(resFile);
List cols =null;
List colStats =null;
if (colPath.equals(tableName)) {
cols = (part ==null || tbl.getTableType() == TableType.VIRTUAL_VIEW) ?
tbl.getCols() : part.getCols();
if (!descTbl.isFormatted()) {
cols.addAll(tbl.getPartCols());
}
}else {
Deserializer deserializer = tbl.getDeserializer(true);
if (deserializerinstanceof AbstractSerDe) {
String errorMsgs = ((AbstractSerDe) deserializer).getConfigurationErrors();
if (errorMsgs !=null && !errorMsgs.isEmpty()) {
throw new SQLException(errorMsgs);
}
}
cols = Hive.getFieldsFromDeserializer(colPath, deserializer);
if (descTbl.isFormatted()) {
// when column name is specified in describe table DDL, colPath will
// will be table_name.column_name
? ? ? ? String colName = colPath.split("\\.")[1];
String[] dbTab = Utilities.getDbTableName(tableName);
List colNames =new ArrayList();
colNames.add(colName.toLowerCase());
if (null == part) {
colStats = db.getTableColumnStatistics(dbTab[0].toLowerCase(), dbTab[1].toLowerCase(), colNames);
}else {
List partitions =new ArrayList();
partitions.add(part.getName());
colStats = db.getPartitionColumnStatistics(dbTab[0].toLowerCase(), dbTab[1].toLowerCase(), partitions, colNames).get(part.getName());
}
}
}
fixDecimalColumnTypeName(cols);
// In case the query is served by HiveServer2, don't pad it with spaces,
// as HiveServer2 output is consumed by JDBC/ODBC clients.
? ? boolean isOutputPadded = !SessionState.get().isHiveServerQuery();
formatter.describeTable(outStream, colPath, tableName, tbl, part,
cols, descTbl.isFormatted(), descTbl.isExt(),
descTbl.isPretty(), isOutputPadded, colStats);
LOG.info("DDLTask: written data for " + tbl.getTableName());
outStream.close();
outStream =null;
}catch (SQLException e) {
throw new HiveException(e, ErrorMsg.GENERIC_ERROR, tableName);
}catch (IOException e) {
throw new HiveException(e, ErrorMsg.GENERIC_ERROR, tableName);
}finally {
IOUtils.closeStream(outStream);
}
return 0;
}
這段代碼主要是根據(jù)tableName從metastore里面讀出創(chuàng)建表的元數(shù)據(jù)驶沼,按照schema格式輸出炮沐。細(xì)節(jié)大家可以再細(xì)讀。
4.總結(jié)
本文主要介紹了describe table這個(gè)操作在Hive中的編譯運(yùn)行過(guò)程回怜〈竽辏可以看出來(lái)decribe table操作比較簡(jiǎn)單,從metastore里面取出表的元數(shù)據(jù)輸出。通過(guò)這個(gè)操作的代碼追蹤我們也可以得出一些結(jié)論:
a. hive根據(jù)不同的操作鲜戒,區(qū)分了不同的類(lèi)型,如DDL, EXPLAIN, DML操作等抹凳,分別有不同的編譯和運(yùn)行方法
b. 對(duì)于DDL操作遏餐,無(wú)需執(zhí)行mapreduce任務(wù),直接從metastore里面讀取數(shù)據(jù)