0. 普通分布式文件系統(tǒng)設(shè)計(jì)思路
- 文件以多副本的方式,整個(gè)文件存放到單機(jī)中剧浸。
-
缺點(diǎn):
- 文件不管有多大都存儲(chǔ)在一個(gè)節(jié)點(diǎn)上锹引,在進(jìn)行數(shù)據(jù)處理的時(shí)候很難進(jìn)行并行處理,節(jié)點(diǎn)可能就稱為網(wǎng)絡(luò)瓶頸唆香,很難進(jìn)行大數(shù)據(jù)處理嫌变;
- 存儲(chǔ)負(fù)載很難均衡,每個(gè)節(jié)點(diǎn)的利用率很低躬它;
1. HDFS概述及設(shè)計(jì)目標(biāo)
-
什么是HDFS
- Hadoop實(shí)現(xiàn)了一個(gè)分布式文件系統(tǒng)(Hadoop Distributed File System)腾啥,簡(jiǎn)稱HDFS。
- 源自于Google的GFS論文。
- 發(fā)表于2003年倘待,HDFS是GFS的克隆版疮跑。
-
HDFS的設(shè)計(jì)目標(biāo)
- 非常巨大的分布式文件系統(tǒng)。
- 運(yùn)行在普通廉價(jià)的硬件上凸舵。
- 易擴(kuò)展祖娘、為用戶提供性能不錯(cuò)的文件系統(tǒng)服務(wù)。
2. HDFS架構(gòu)
hdfs架構(gòu).jpg
- HDFS是一個(gè)Master(NameNode/NN)/Slave(DateNode/DN)(一個(gè)Master帶多個(gè)Slaves)
- 一個(gè)文件會(huì)被拆分成多個(gè)Block(BlockSize啊奄,按照大小拆分)渐苏,存放多節(jié)點(diǎn);
-
NameNode:
- 負(fù)責(zé)客戶端請(qǐng)求的相應(yīng)
- 負(fù)責(zé)元數(shù)據(jù)(文件的名稱增热、副本系數(shù)整以、Block存放的DataNode)的管理
- 管理副本的復(fù)制
-
DateNode:
- 存儲(chǔ)用戶的文件對(duì)應(yīng)的數(shù)據(jù)塊(Block)。
- 要定期想NameNode發(fā)送心跳信息峻仇,匯報(bào)本身及其所有的block信息公黑,健康狀況。
- A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software.(一個(gè)NameNode多個(gè)DateNode)
- The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.(生產(chǎn)建議NameNode和DateNode部署在不同機(jī)器上)摄咆;
- replication factor:副本系統(tǒng)凡蚜、副本因子
- All blocks in a file except the last block are the same size.
3. HDFS副本機(jī)制
-
創(chuàng)建和配置
hdfs副本機(jī)制.jpg -
副本存放策略
-
會(huì)隨機(jī)挑選機(jī)架和機(jī)器(優(yōu)先選擇不同機(jī)架和機(jī)器存放數(shù)據(jù))
hdfs副本存放策略.jpg
-
4. HDFS環(huán)境搭建(偽分布式)
-
JDK安裝
- 解壓:tar -zxvf jdk-7u79-linux-x64.tar.gz -C ~/app
- 添加到系統(tǒng)環(huán)境變量: ~/.bash_profile
- export JAVA_HOME=/home/hadoop/app/jdk1.7.0_70
- export PATH=$JAVA_HOME/bin:$PATH
- 使環(huán)境變量生效: source ~/.bash_profile
- 驗(yàn)證java是否安裝完成:java -v
-
Hadoop安裝
使用ftp工具上傳至Linux服務(wù)器
tar -zxvf解壓到/usr/local/hadoop目錄
-
修改/usr/local/hadoop/etc/hadoop/core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
-
修改修改/usr/local/hadoop/etc/hadoop/core-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
-
SSH安裝
- 使用yum安裝ssh客戶端: yum install ssh
- 設(shè)置linux免密登錄
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys
-
初始化HDFS
$ bin/hdfs namenode -format
-
啟動(dòng)HDFS
$ sbin/start-dfs.sh
-
驗(yàn)證是否啟動(dòng)成功
$ jps 2356 NameNode 2695 SecondaryNameNode 12314 Jps 2477 DataNode
- 出現(xiàn)NameNode,DataNode吭从,SecondaryNameNode表示啟動(dòng)成功朝蜘。
5. HDFS shell
-
HDFS Shell命令列表
hdfs dfs -appendToFile <localsrc> ... <dst> # hdfs dfs -cat [-ignoreCrc] <src> ... #查看文件內(nèi)容 hdfs dfs -checksum <src> ... hdfs dfs -chgrp [-R] GROUP PATH... hdfs dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH... hdfs dfs -chown [-R] [OWNER][:[GROUP]] PATH... hdfs dfs -copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst> hdfs dfs -copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst> hdfs dfs -count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ... hdfs dfs -cp [-f] [-p | -p[topax]] [-d] <src> ... <dst> #拷貝 hdfs dfs -createSnapshot <snapshotDir> [<snapshotName>] hdfs dfs -deleteSnapshot <snapshotDir> <snapshotName> hdfs dfs -df [-h] [<path> ...] hdfs dfs -du [-s] [-h] [-v] [-x] <path> ... hdfs dfs -expunge hdfs dfs -find <path> ... <expression> ... hdfs dfs -get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst> hdfs dfs -getfacl [-R] <path> hdfs dfs -getfattr [-R] {-n name | -d} [-e en] <path> hdfs dfs -getmerge [-nl] [-skip-empty-file] <src> <localdst> hdfs dfs -head <file> hdfs dfs -help [cmd ...] hdfs dfs -ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ... #查看文件列表 hdfs dfs -mkdir [-p] <path> ... #創(chuàng)建目錄 hdfs dfs -moveFromLocal <localsrc> ... <dst> hdfs dfs -moveToLocal <src> <localdst> hdfs dfs -mv <src> ... <dst> hdfs dfs -put [-f] [-p] [-l] [-d] <localsrc> ... <dst> #上傳 hdfs dfs -renameSnapshot <snapshotDir> <oldName> <newName> hdfs dfs -rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ... hdfs dfs -rmdir [--ignore-fail-on-non-empty] <dir> ... hdfs dfs -setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>] hdfs dfs -setfattr {-n name [-v value] | -x name} <path> hdfs dfs -setrep [-R] [-w] <rep> <path> ... hdfs dfs -stat [format] <path> ... hdfs dfs -tail [-f] <file> hdfs dfs -test -[defsz] <path> hdfs dfs -text [-ignoreCrc] <src> ... #查看文件內(nèi)容 hdfs dfs -touchz <path> ... hdfs dfs -truncate [-w] <length> <path> ... hdfs dfs -usage [cmd ...]
6. Java API操作HDFS
-
Windows遠(yuǎn)程連接Linux虛擬機(jī)HDFS環(huán)境配置
下載Hadoop并解壓
修改/etc/hadoop/hadoop-env.cmd,設(shè)置環(huán)境變量
-
配置Hadoop環(huán)境變量涩金,HADOOP_HOME
@rem The java implementation to use. Required. set JAVA_HOME=D:\software\jdk8u131
- 下載winutils谱醇,Windows Hadoop工具。
- 選擇版本步做,解壓拷貝到hadoop/bin目錄下
- 查看是否配置成功副渴,cmd下hadoop version輸出版本信息
-
Windows Hadoop開(kāi)發(fā)環(huán)境搭建,使用Maven
-
Pom配置
<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <hadoop.version>3.1.0</hadoop.version> <junit.version>5.2.0</junit.version> </properties> <dependencies> <!--添加Hadoop依賴--> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <!--添加單元測(cè)試的依賴--> <dependency> <groupId>org.junit.jupiter</groupId> <artifactId>junit-jupiter-api</artifactId> <version>${junit.version}</version> <scope>test</scope> </dependency> <!--logback 非必須引用--> <dependency> <groupId>com.typesafe.play</groupId> <artifactId>play-logback_2.12</artifactId> <version>2.6.15</version> </dependency> </dependencies> <repositories> <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> </repositories>
-
將Linux中Hadoop配置文件拷貝到工程/resources目錄下
- core-site.xml
- hdfs-site.xml
-
Junit Tests測(cè)試用例
import org.apache.commons.lang.ArrayUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.*; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.util.Progressable; import org.junit.jupiter.api.AfterEach; import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.Test; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.*; import java.net.URI; /** * Hadoop HDFS Java API 操作 */ class HDFSApplicationTests { private static final Logger _LOGGER = LoggerFactory.getLogger(HDFSApplicationTests.class); private FileSystem fileSystem = null; private Configuration configuration = null; /** * <p>初始化HDFS</p> * @throws Exception */ @BeforeEach void setUp () throws Exception { _LOGGER.info("HDFSApplicationTests#setUp"); configuration = new Configuration(); //修改host文件 URI uri = new URI("hdfs://ko-hadoop:8020"); fileSystem = FileSystem.get(uri, configuration, "K.O"); } /** * <p>注銷HDFS</p> * @throws Exception */ @AfterEach void tearDown () throws Exception { configuration = null; fileSystem = null; _LOGGER.info("HDFSApplicationTests#tearDown"); } /** * <p>創(chuàng)建文件夾</p> * @throws IOException */ @Test void mkdir () throws IOException { fileSystem.mkdirs(new Path("/hdfsapi/test")); } /** * <p>創(chuàng)建文件</p> * <p>注意DataNode端口是否開(kāi)啟</p> * @throws IOException */ @Test void createFile () throws IOException { FSDataOutputStream outputStream = fileSystem.create(new Path("/hdfsapi/test/a.txt")); outputStream.write("Hello Hadoop!".getBytes()); outputStream.flush(); outputStream.close(); } /** * <p>讀取hdfs中文件數(shù)據(jù)</p> * @throws Exception */ @Test void cat () throws Exception { FSDataInputStream inputStream = fileSystem.open(new Path("/hdfsapi/test/a.txt")); byte[] b = new byte[1024]; inputStream.read(b); inputStream.close(); _LOGGER.info("HDFSApplicationTests#cat: {}", new String(b)); } /** * <p>重命名文件</p> * @throws Exception */ @Test void rename () throws Exception { Path oldPath = new Path("/hdfsapi/test/a.txt"); Path newPath = new Path("/hdfsapi/test/b.txt"); boolean success = fileSystem.rename(oldPath, newPath); assert success; } /** * <p>上傳本地文件到HDFS服務(wù)器</p> * @throws Exception */ @Test void copyFromLocalFile () throws Exception { Path localPath = new Path("D:/tmp/hadoop.txt"); Path hdfsPath = new Path("/hdfsapi/test"); fileSystem.copyFromLocalFile(localPath, hdfsPath); } /** * <p>上傳本地文件到HDFS服務(wù)器, 帶進(jìn)度條</p> * @throws Exception */ @Test void copyFromLocalFileWithProgress () throws Exception { InputStream in = new BufferedInputStream( new FileInputStream( new File("D:/install/VMware-workstation-full-14.1.1-7528167.exe"))); FSDataOutputStream out = fileSystem.create(new Path("/hdfsapi/test/vmware-14.exe"), new Progressable() { public void progress() { System.out.print("-"); //帶進(jìn)度提醒信息 } }); IOUtils.copyBytes(in, out, 4096); } /** * <p>下載HDFS文件</p> * @throws Exception */ @Test void copyToLocalFile () throws Exception { Path hdfsPath = new Path("/hdfsapi/test/b.txt"); Path localPath = new Path("D:/tmp/h.txt"); fileSystem.copyToLocalFile(hdfsPath, localPath); } /** * <p>打印HDFS文件列表</p> * @throws Exception */ @Test void listFiles () throws Exception { FileStatus[] fileStatuses = fileSystem.listStatus(new Path("/hdfsapi/test")); if (ArrayUtils.isNotEmpty(fileStatuses)) { for (FileStatus fileStatus : fileStatuses) { //1. 是否是文件夾 String isDir = fileStatus.isDirectory() ? "文件夾" : "文件"; //2. 有幾個(gè)副本 short replication = fileStatus.getReplication(); //3. 文件大小 long len = fileStatus.getLen(); //4. 全路徑 String path = fileStatus.getPath().toString(); //5. 輸出 _LOGGER.info("文件夾還是文件? {}", isDir); _LOGGER.info("有{}個(gè)副本.", replication); _LOGGER.info("文件大小: {}kb", len); _LOGGER.info("全路徑: {}", path); } } } /** * <p>刪除HDFS文件</p> * @throws Exception */ @Test void delete () throws Exception { boolean success = fileSystem.delete(new Path("/hdfsapi/test"), true); assert success; } }
-
7. HDFS文件讀寫流程
8. HDFS的優(yōu)缺點(diǎn)
- HDFS的優(yōu)點(diǎn)
- 數(shù)據(jù)冗余全度、硬件容錯(cuò)
- 處理流式的數(shù)據(jù)訪問(wèn)
- 適合處理大文件
- 存儲(chǔ)在廉價(jià)的機(jī)器上
- HDFS的缺點(diǎn)
- 低延遲的數(shù)據(jù)訪問(wèn)
- 不適合小文件存儲(chǔ)