最近在執(zhí)行Hive insert/select語句的過程碰到下面這種類型的異常:
# 異常1:
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:81)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at org.apache.hadoop.hive.ql.exec.LimitOperator.process(LimitOperator.java:54)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:235)
... 7 more
# 異常2:
Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.BinaryComparable
本文已上述的錯誤為切入點(diǎn)孝宗,分析下異常原因以及Hive相關(guān)的關(guān)于Format的異常。主要內(nèi)容如下:
1. 異常的原因分析及解決方法
2. FAQ
1. 異常的原因分析及解決方法
1.1 異常1分析
# 異常1:
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:81)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at org.apache.hadoop.hive.ql.exec.LimitOperator.process(LimitOperator.java:54)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:235)
... 7 more
該異常發(fā)生在insert overwrite階段忍抽,即select出來的數(shù)據(jù)插入目標(biāo)表時拋出異常。從異常棧中可以清楚地看到OrcOutputFormat董朝、java.lang.ClassCastException這些信息鸠项,可見這是Reduce任務(wù)將最終結(jié)果進(jìn)行持久化(寫入HDFS文件系統(tǒng))時出現(xiàn)錯誤。首先子姜,我們需要明確這個數(shù)據(jù)持久化的大體流程是什么樣的祟绊?如下圖所示:
Read過程:InputFormat將輸入流(InputStream)分割成紀(jì)錄(<key,value>)哥捕,Deserializer將紀(jì)錄(<key牧抽,value>)解析成列對象。
Write過程:Serializer將列對象轉(zhuǎn)化為紀(jì)錄(<key遥赚,value>)扬舒,OutputFormat將紀(jì)錄(<key,value>)格式化為輸出流(OutputStream)凫佛。
上圖中描繪的分別是數(shù)據(jù)載入內(nèi)存和持久化的過程讲坎。異常信息中的OrcOutputFormat說明錯誤出在數(shù)據(jù)持久化過程中。從圖中可知愧薛,序列化器Serializer的輸出數(shù)據(jù)晨炕,就是OutputFormat的輸入數(shù)據(jù)。接下來就是確定目標(biāo)表的SerDe/InputFormat/OutputFormat分別是什么毫炉。通過下面命令查看瓮栗。
desc formatted $table
結(jié)果如下:
# desc formatted $table
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
由上可知,目標(biāo)表的SerDe為LazySimpleSerDe碘箍,而其Input/OutputFormat是orc的遵馆。所以異常1的原因也就得出:
異常1原因:序列化/反序列化器LazySimpleSerDe在執(zhí)行serialize后的結(jié)果類型是Text,而OrcOutputFormat的接收數(shù)據(jù)類型必須是OrcSerdeRow丰榴。這就造成了ClassCastException。
下面是OrcOutputFormat的write方法源碼:
public class OrcOutputFormat extends ... {
@Override
public void write(Writable row) throws IOException {
// 若類型不匹配秆撮,會拋出異常四濒。
OrcSerdeRow serdeRow = (OrcSerdeRow) row;
if (writer == null) {
options.inspector(serdeRow.getInspector());
writer = OrcFile.createWriter(path, options);
}
writer.addRow(serdeRow.getRow());
}
。。盗蟆。
}
原因找到后戈二,解決辦法就很簡單了,將該table的fileformat修改為orc即可喳资,如下所示:
ALTER TABLE $table SET FILEFORMAT ORC;
1.2 異常2分析
# 異常2:
Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.BinaryComparable
通過異常1的分析后觉吭,這個異常的原因也就很容易定位了,數(shù)據(jù)讀取階段:OrcInputFormat的輸出結(jié)果是OrcStruct類型仆邓,其作為輸入數(shù)據(jù)傳給LazySimpleSerDe的deserialize方法鲜滩,很明顯,deserialize中進(jìn)行類型轉(zhuǎn)換時拋出該異常节值。下面是LazySimpleSerDe的doDeserialize方法源碼:
@Override
public Object doDeserialize(Writable field) throws SerDeException {
if (byteArrayRef == null) {
byteArrayRef = new ByteArrayRef();
}
// OrcStruct -> BinaryComparable
BinaryComparable b = (BinaryComparable) field;
byteArrayRef.setData(b.getBytes());
cachedLazyStruct.init(byteArrayRef, 0, b.getLength());
lastOperationSerialize = false;
lastOperationDeserialize = true;
return cachedLazyStruct;
}
下圖是已TEXTFILE格式作為存儲格式時的讀取流程:
現(xiàn)在將TextInputFormat換成OrcInputFormat后:
小結(jié):以上兩種異常的根本原因都是由于序列化/反序列化器SerDe和InputFormat/OutputFormat不匹配造成的徙硅。這通常是由于創(chuàng)建表時沒有正確指定這三個配置項(xiàng)造成的。
FAQ
1. stored as orc 和 stored as INPUTFORMAT ... OUTPUTFORMAT ...的區(qū)別搞疗?
當(dāng)我們使用stored as orc的時候嗓蘑,其實(shí)已經(jīng)隱式的指定了下面三個配置:
- SERDE:org.apache.hadoop.hive.ql.io.orc.OrcSerde
- INPUTFORMAT:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
- OUTPUTFORMAT:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
當(dāng)我們顯示的指定STORED AS INPUTFORMAT/OUTPUTFORMAT:
STORED AS INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat’
此時SERDE并沒有指定,會使用默認(rèn)的serde匿乃,在hive cli中可以通過下面cmd查看:
set hive.default.serde;
hive.default.serde=org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
當(dāng)然了桩皿,如果hive-site.xml中已經(jīng)配置了hive.default.fileformat,那么不知道stored as的情況下幢炸,會使用hive.default.fileformat指定的文件格式泄隔。
<property>
<name>hive.default.fileformat</name>
<value>ORC</value>
</property>