軟件版本
Mysql: 5.7
Hadoop: 3.1.3
Flink: 1.12.2
Hudi: 0.9.0
Hive: 2.3.7
1.Mysql建表并開啟bin_log
create table users(
id bigint auto_increment primary key,
name varchar(20) null,
birthday timestamp default CURRENT_TIMESTAMP not null,
ts timestamp default CURRENT_TIMESTAMP not null
);
2.安裝Hadoop
(1)解壓hadoop安裝包:tar -zxvf hadoop-3.1.3.tar.gz
(2)配置環(huán)境變量
export HADOOP_HOME=/Users/xxx/hadoop/hadoop-3.1.3
export HADOOP_COMMON_HOME=$HADOOP_HOME
export PATH=$HADOOP_HOME/bin:$PATH
#添加hadoop classpath
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
3.下載安裝Flink
(1)在Flink官網(wǎng)下載flink軟件包:https://flink.apache.org/downloads.html
(2)解壓:tar -zxvf flink-1.12.2-bin-scala_2.11.tgz
(3)配置flink(vim conf/flink-conf.yaml)癌蓖,開啟checkpoint(flink-cdc需要開啟checkpoint才能生成hudi commit科阎,提交數(shù)據(jù))
state.backend: filesystem
execution.checkpointing.interval: 10000
state.checkpoints.dir: file:///Users/xxx/flink/flink-1.12.2/hudi/flink-checkpoints
state.savepoints.dir: file:///Users/xxx/flink/flink-1.12.2/hudi/flink-savepoints
(4)配置flink(vim conf/flink-conf.yaml),增加slot數(shù)
taskmanager.numberOfTaskSlots: 4
vim workers
1 localhost
2 localhost
3 localhost
4 localhost
(4)啟動Flink:bin/start-cluster.sh
4.編譯Hudi业栅,拷貝jar包
(1)下載Hudi源碼:git clone https://github.com/apache/hudi.git
(2)切換到0.9.0分支:git checkout origin release-0.9.0
(3)編譯:mvn clean package -DskipTests
(4)編譯完成后,會在packaging/hudi-flink-bundle/target目錄下生成對應的jar包(hudi-flink-bundle_2.11-0.9.0.jar)间影,將此jar包拷貝至flink的lib目錄中:
cp hudi-flink-bundle_2.11-0.9.0.jar ~/flink/lib
5.將其他相關jar包拷貝至flink/lib目錄下
(1)flink-sql-connector-mysql-cdc-1.2.0.jar:用于連接mysql
(2)aws-java-sdk-bundle-1.11.874.jar/hadoop-aws-3.1.3.jar:用于連接aws s3
6.啟動sql-client
1.bin/sql-client.sh embedded
2.建立mysql 映射表
create table mysql_users(
id bigint primary key not enforced,
name string,
birthday timestamp(3),
ts timestamp(3)
) with (
'connector' = 'mysql-cdc',
'hostname' = '127.0.0.1',
'port' = '3306',
'username' = 'root',
'password' = '123456',
'database-name' = 'test_cdc',
'table-name' = 'users'
);
3.建立hudi映射表
create table hudi_users(
id bigint primary key not enforced,
name string,
birthday timestamp(3),
ts timestamp(3),
`partition` varchar(20)
) partitioned by (`partition`) with (
'connector' = 'hudi',
'table.type' = 'COPY_ON_WRITE',
'path' = 's3a://xxx/yyy/hudi_users',
'read.streaming.enabled' = 'true',
'read.streaming.check-interval' = '1'
);
4.創(chuàng)建任務
insert into hudi_users select *, date_format(birthday, 'yyyyMMdd') from mysql_users;
檢查s3上是否生成了數(shù)據(jù)窒朋;
7.Hive建立external table
1.通過beeline連接hive
!connect jdbc:hive2://[ELB-DEV-Presto-hs2-s0000e2c5-06a22927ec8bb2f6.elb.us-east-1.amazonaws.com:10000/default;auth=noSasl](http://elb-dev-presto-hs2-s0000e2c5-06a22927ec8bb2f6.elb.us-east-1.amazonaws.com:10000/default;auth=noSasl)
CREATE EXTERNAL TABLE `hudi_user_mor`(
`_hoodie_commit_time` string,
`_hoodie_commit_seqno` string,
`_hoodie_record_key` string,
`_hoodie_partition_path` string,
`_hoodie_file_name` string,
`id` bigint,
`name` string,
`birthday` bigint,
`ts` bigint)
PARTITIONED BY (
`partition` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://xxx/yyy/hudi_users';
添加分區(qū):
alter table hudi_user_mor add if not exists partition(`partition`='par1') location 's3a://fw-itf/DFMOD-c34db792/target_table/par1';
8.通過presto查詢數(shù)據(jù)
1.進入presto
./presto-cli-0.248-executable.jar --server ELB-DEV-Presto-master-s0000eca1-efaff1be86b6ffa3.elb.us-east-1.amazonaws.com:9106 --catalog db
2.查詢數(shù)據(jù)
select * from hudi_user_mor where partition = 'par1' limit 5;
8.測試同步
在mysql中執(zhí)行增、刪艘绍、改語句,并在Hive或presto中進行查詢秫筏,可以實時的查詢到改動诱鞠。