jdk1.8 + Hadoop2.7.3 + Spark2.2.0 + Scala2.11.8
hadoop 2.7之后的tar.gz包都是64位的
1 clone之前
1.1 安裝vmware蔓同,安裝centos7
網(wǎng)絡(luò)連接選host-only
centos7選基礎(chǔ)設(shè)施服務(wù)器(Infrastructure Server)
1.2 修改hostname豆巨,改網(wǎng)絡(luò)配置冯键,克隆之后需要分別改
在宿主機用ifconfig(/sbin/ifconfig)查看vmnet1虛擬網(wǎng)卡(對應(yīng)于vmware的host-only模式)對應(yīng)的網(wǎng)關(guān)(inet addr)户魏,這里是192.168.176.1
打算安裝一臺master
192.168.176.100 master
兩臺slave
192.168.176.101 slave1
192.168.176.102 slave2
hostnamectl set-hostname master
systemctl stop firewalld
systemctl disable firewalld
vi /etc/sysconfig/network-scripts/ifcfg-ens33 // "ens33" it depends
TYPE=Ethernet
IPADDR=192.168.176.100
NETMASK=255.255.255.0
GATEWAY=192.168.176.1
PEERDNS=no
vi /etc/sysconfig/network
NETWORKING=yes
GATEWAY=192.168.176.1
vi /etc/resolv.conf
nameserver 192.168.1.1
service network restart
現(xiàn)在已經(jīng)可以ping通宿主機媚媒,用sftp上傳安裝文件榆纽,用ssh操作master, slave1, slave2
1.3 改hosts
vi /etc/hosts
192.168.176.100 master
192.168.176.101 slave1
192.168.176.102 slave2
1.4 解壓jdk蒿叠,hadoop,spark券册,scala...
cd /usr/local
tar -zxvf ... // it depends
修改profile
vim /etc/profile
JAVA_HOME=/usr/java/jdk1.8.0_144
JRE_HOME=$JAVA_HOME/jre
DERBY_HOME=$JAVA_HOME/db
PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
CLASSPATH=:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
export JAVA_HOME JRE_HOME DERBY_HOME PATH CLASSPATH
export HADOOP_HOME=/usr/local/hadoop-2.7.3
export SCALA_HOME=/usr/local/scala-2.11.8
export SPARK_HOME=/usr/local/spark-2.2.0-bin-hadoop2.7
export HIVE_HOME=/usr/local/apache-hive-2.3.0-bin
export HBASE_HOME=/usr/local/hbase-2.0.0-alpha-1
export ZOOKEEPER_HOME=/usr/local/zookeeper-3.4.10
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$SPARK_HOME/bin:$HIVE_HOME/bin:$HBASE_HOME/bin:$ZOOKEEPER_HOME/bin
1.5 配置hadoop
mkdir tmp hdfs hdfs/data hdfs/name
分別修改 hadoop-env.sh, core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, slaves
默認(rèn)配置查看如下:
core-default.xml
hdfs-site.xml
mapred-default.xml
yarn-default.xml
cd /usr/local/hadoop-2.7.3/etc/hadoop
vi hadoop-env.sh
// 修改JAVA_HOME=/usr/java/jdk1.8.0_144
vi core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop-2.7.3/tmp</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
vi hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop-2.7.3/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop-2.7.3/hdfs/data</value>
</property>
</configuration>
vi yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
// cp mapred-site.xml.templete mapred-site.xml
vi mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
vi slaves
slave1
slave2
1.6 建立非root用戶hadoop
useradd hadoop
passwd hadoop
// 給hadoop用戶開文件權(quán)限
chown -R hadoop:hadoop ./hadoop-2.7.3
1.7 從CST轉(zhuǎn)換為UTC:
cp -af /usr/share/zoneinfo/UTC /etc/localtime
date
1.8 spark配置
cd /usr/local/spark-2.2.0-bin-hadoop2.7/conf
vi slaves
slave1
slave2
vi spark-env.sh
# spark setting
export JAVA_HOME=/usr/java/jdk1.8.0_144
export SCALA_HOME=/usr/local/scala-2.11.8
export SPARK_MASTER_IP=master
export SPARK_WORKER_MEMORY=8g
export SPAKR_WORKER_CORES=4
export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
2 clone之后
得到slave1, slave2频轿,修改slave1和slave2的hostname和網(wǎng)絡(luò)配置
2.1 root(或者h(yuǎn)adoop)用戶ssh免密碼互聯(lián)master, slave1, slave2
// cp ~
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
在root(hadoop)下分別把id_rsa.pub分別復(fù)制到其他兩臺機器的authorized_keys中,用ssh命令互相連接測試 ssh slave1, ssh slave2, ssh master
3 其他
3.1 宿主機為linux汁掠、windows分別實現(xiàn)VMware三種方式上網(wǎng)
3.2 常用命令
jps
start-dfs.sh
start-yarn.sh
start-all.sh
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
netstat -ntlp
hadoop dfsadmin -report | more
hadoop
// web ui
http://192.168.176.100:50070
3.3 Hadoop FileSystem
hadoop fs -cat URI [URI …]
hadoop fs -cp URI [URI …] <dest>
hadoop fs -copyFromLocal <localsrc> URI // 除了限定源路徑是一個本地文件外略吨,和put命令相似。
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst> // 除了限定目標(biāo)路徑是一個本地文件外考阱,和get命令類似。
hadoop fs -du URI [URI …]
hadoop fs -dus <args>
hadoop fs -get <from> <to>
hadoop fs -put
hadoop fs -ls <args>
hadoop fs -lsr <args> // 遞歸版的ls
hadoop fs -mkdir <paths> // 只能一級級的建目錄
hadoop fs -mv URI [URI …] <dest> // 將文件從源路徑移動到目標(biāo)路徑
hadoop fs -rm
hadoop fs -rmr // 遞歸版的rm
hadoop dfs, hadoop fs, hdfs dfs的區(qū)別
Hadoop fs:使用面最廣鞠苟,可以操作任何文件系統(tǒng)乞榨。
hadoop dfs與hdfs dfs:只能操作HDFS文件系統(tǒng)相關(guān)(包括與Local FS間的操作)秽之,前者已經(jīng)Deprecated,一般使用后者吃既。
4. Hive部署
在安裝Hive前考榨,先安裝MySQL,以MySQL作為元數(shù)據(jù)庫鹦倚,Hive默認(rèn)的元數(shù)據(jù)庫是內(nèi)嵌的Derby河质,但因其有單會話限制,所以選用MySQL震叙。MySQL部署在hadoop-master節(jié)點上掀鹅,Hive服務(wù)端也安裝在hive-master上
元數(shù)據(jù)(Metadata),又稱中介數(shù)據(jù)媒楼、中繼數(shù)據(jù)乐尊,為描述數(shù)據(jù)的數(shù)據(jù)(data about data),主要是描述數(shù)據(jù)屬性(property)的信息划址,用來支持如指示存儲位置扔嵌、歷史數(shù)據(jù)、資源查找夺颤、文件記錄等功能痢缎。元數(shù)據(jù)算是一種電子式目錄,為了達(dá)到編制目錄的目的世澜,必須在描述并收藏數(shù)據(jù)的內(nèi)容或特色独旷,進而達(dá)成協(xié)助數(shù)據(jù)檢索的目的。
4.1 Hive環(huán)境變量配置(見1.4)
4.2 Hive配置
將$HIVE_HOME/conf/下的兩個文件重命名:
mv hive-default.xml.template hive-site.xml
mv hive-env.sh.template hive-env.sh
vim hive-env.sh
配置其中的HADOOP_HOME宜狐,將HADOOP_HOME前面的#號去掉
vim hive-site.xml(到處參考)
hive.metastore.schema.verification // false
// 在hive目錄下創(chuàng)建tmp文件夾
${system:java.io.tmpdir} 改為tmp目錄
${system:user.name} 改為用戶名势告,這里是root
修改連接mysql的jdbc
啟動Hive 的 Metastore Server服務(wù)進程
nohup是永久執(zhí)行,執(zhí)行結(jié)果會在當(dāng)前目錄生成一個nohup.out日志文件抚恒,可以查看執(zhí)行信息
&是指在后臺運行
hive --service metastore &
// 推薦使用nohup啟動咱台,不會隨著對話結(jié)束而停止
nohup hive --service metastore &
Hive第一次登錄需要初始化(*)
schematool -dbType mysql -initSchema
4.3 MySQL安裝方式,建議第一種方式
4.3.1 Linux-Generic
官網(wǎng)下載MySQL Community Server俭驮,操作系統(tǒng)選Linux-Generic
https://www.bilibili.com/video/av6147498/?from=search&seid=673467972510968006
http://blog.csdn.net/u013980127/article/details/52261400
自己用這種方式安裝的
- 安裝
檢查庫文件是否存在回溺,如有刪除。
rpm -qa | grep mysql
官網(wǎng)下載Linux - Generic (glibc 2.12) (x86, 64-bit), Compressed TAR Archive
解壓即可
tar -xvf mysql-5.7.19-linux-glibc2.12-i686.tar.gz
- 檢查mysql組和用戶是否存在混萝,如無創(chuàng)建mysql:mysql
cat /etc/group | grep mysql
cat /etc/passwd | grep mysql
groupadd mysql
useradd -r -g mysql mysql
- 修改資源使用配置文件
sudo vim /etc/security/limits.conf
mysql hard nofile 65535
mysql soft nofile 65535
- 初始化遗遵,啟動一個實例
vim /etc/my.cnf
[mysqld]
port=3306
socket=/tmp/mysql.sock
user=mysql
datadir=...
...
啟動實例,注意修改里面的內(nèi)容
cd to top dir of mysql
bin/mysql_install_db --user=mysql --basedir=/usr/local/mysql/ --datadir=/usr/local/mysql/data/
- 初始化root用戶的密碼為12345
第一次啟動逸嘀,使用初始化密碼
cat /root/.mysql_secret
啟動mysql實例车要,敲入/root/.mysql_secret中的密碼
mysql uroot -p // 這里myql加入到了環(huán)境變量中
添加mysql環(huán)境變量
export PATH=$PATH:/usr/local/mysql/bin
進去之后修改密碼
SET PASSWORD = PASSWORD('12345');
flush privileges;
下次啟動時使用修改后的密碼
mysql uroot -p // 密碼12345
- 繼續(xù),添加遠(yuǎn)程訪問權(quán)限
use mysql;
update user set host = '%' where user = 'root';
重啟服務(wù)生效
/etc/init.d/mysqld restart
- 為master創(chuàng)建hive用戶崭倘,密碼為12345翼岁,用來鏈接hive
mysql>CREATE USER 'hive' IDENTIFIED BY '12345';
mysql>GRANT ALL PRIVILEGES ON *.* TO 'hive'@'master' WITH GRANT OPTION;
mysql>flush privileges;
啟動方式
mysql -h master -uhive -p
- 設(shè)置為開機自啟動
sudo chkconfig mysql on
4.3.2 Yum Repository
wget http://repo.mysql.com/mysql57-community-release-el7-11.noarch.rpm
// 或者到https://dev.mysql.com/downloads/repo/yum/下載rpm
rpm -ivh mysql57-community-release-el7-11.noarch.rpm
yum install mysql-server
4.4 spark sql 支持hive
按照官方doc的說法类垫,只需要把$HIVE_HOME/conf下hive-site.xml, core-site.xml文件copy到$SPARK_HOME/conf下即可,同時用scp傳到slave機器上
Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.
Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.
When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.
4.5 Hive 操作
Hive四種數(shù)據(jù)導(dǎo)入方式
http://blog.csdn.net/lifuxiangcaohui/article/details/40588929
分區(qū):在Hive中琅坡,表的每一個分區(qū)對應(yīng)表下的相應(yīng)目錄悉患,所有分區(qū)的數(shù)據(jù)都是存儲在對應(yīng)的目錄中。比如wyp表有dt和city兩個分區(qū)榆俺,則對應(yīng)dt=20131218,city=BJ對應(yīng)表的目錄為/user/hive/warehouse/dt=20131218/city=BJ售躁,所有屬于這個分區(qū)的數(shù)據(jù)都存放在這個目錄中。
UDF(User-Defined-Function)茴晋,用戶自定義函數(shù)對數(shù)據(jù)進行處理陪捷。UDF函數(shù)可以直接應(yīng)用于select語句,對查詢結(jié)構(gòu)做格式化處理后晃跺,再輸出內(nèi)容揩局。自定義UDF需要繼承org.apache.hadoop.hive.ql.UDF。需要實現(xiàn)evaluate函數(shù)掀虎。evaluate函數(shù)支持重載凌盯。
spark UDF org.apache.spark.sql.expressions.UserDefinedAggregateFunction
Hive 創(chuàng)建名為dual的測試表
create table dual (dummy string);
// 退出hive進入bash
echo 'X' > dual.txt
// 進入hive
load data local inpath '/home/hadoop/dual.txt' overwrite into table daul;
Hive 正則表達(dá)式
HIVE json格式數(shù)據(jù)的處理
5. 常用問題
hadoop多次格式化后,導(dǎo)致datanode啟動不了
MapReduce任務(wù)運行到running job卡住