0003-如何在CDH中使用LZO壓縮

Fayson的github： https://github.com/fayson/cdhproject
推薦關(guān)注微信公眾號：“Hadoop實操”酬蹋，ID：gh_c4c535955d0f锌杀，或者掃描文末二維碼萌京。

1.問題描述

CDH中默認不支持Lzo壓縮編碼虽缕，需要下載額外的Parcel包，才能讓Hadoop相關(guān)組件如HDFS暂刘，Hive谈况，Spark支持Lzo編碼。

具體請參考：
Configuring Services to Use the GPL Extras Parcel
Installing the GPL Extras Parcel

首先我在沒做額外配置的情況下早抠，生成Lzo文件并讀取霎烙。我們在Hive中創(chuàng)建兩張表，test_table和test_table2，test_table是文本文件的表悬垃，test_table2是Lzo壓縮編碼的表游昼。如下：

create external table test_table
(
s1 string,
s2 string
)
row format delimited fields terminated by '#'
location '/lilei/test_table';

insert into test_table values('1','a'),('2','b');

create external table test_table2
(
s1 string,
s2 string
)
row format delimited fields terminated by '#'
location '/lilei/test_table2';

通過beeline訪問Hive并執(zhí)行上面命令：

查詢test_table中的數(shù)據(jù)：

將test_table中的數(shù)據(jù)插入到test_table2，并設(shè)置輸出文件為lzo壓縮：

set mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;

insert overwrite table test_table2 select * from test_table;

在Hive中執(zhí)行報錯如下：

Error:Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)

通過Yarn的8088可以發(fā)現(xiàn)是因為找不到Lzo壓縮編碼：

Compression codec com.hadoop.compression.lzo.LzoCodec was not found.

2.解決辦法

通過Cloudera Manager的Parcel頁面配置Lzo的Parcel包地址：

注意：如果集群無法訪問公網(wǎng)尝蠕，需要提前下載好Parcel包并發(fā)布到httpd

下載->分配->激活

配置HDFS的壓縮編碼加入Lzo：

com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec

保存更改烘豌，部署客戶端配置，重啟整個集群看彼。

等待重啟成功：

再次插入數(shù)據(jù)到test_table2廊佩，設(shè)置為Lzo編碼格式：

set mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;

insert overwrite table test_table2 select * from test_table;

插入成功：

2.1.Hive驗證

首先確認test_table2中的文件為Lzo格式：

在Hive的beeline中進行測試：

Hive基于Lzo壓縮文件運行正常。

2.2.Spark SQL驗證

var textFile=sc.textFile("hdfs://ip-172-31-8-141:8020/lilei/test_table2/000000_0.lzo_deflate")

textFile.count()

sqlContext.sql("select * from test_table2")

SparkSQL基于Lzo壓縮文件運行正常靖榕。

為天地立心标锄，為生民立命，為往圣繼絕學(xué)茁计，為萬世開太平鸯绿。

推薦關(guān)注Hadoop實操，第一時間簸淀，分享更多Hadoop干貨瓶蝴，歡迎轉(zhuǎn)發(fā)和分享。

原創(chuàng)文章租幕，歡迎轉(zhuǎn)載舷手，轉(zhuǎn)載請注明：轉(zhuǎn)載自微信公眾號Hadoop實操

最后編輯于：2018.12.08 22:21:35

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者