一、問(wèn)題背景:
1.云主機(jī)是 Linux 環(huán)境叛赚,搭建 Hadoop 偽分布式
公網(wǎng) IP:139.198.18.xxx
內(nèi)網(wǎng) IP:192.168.137.2
主機(jī)名:hadoop001
2.本地的core-site.xml配置如下:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop001:9001</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>hdfs://hadoop001:9001/hadoop/tmp</value>
</property>
</configuration>
3.本地的hdfs-site.xml配置如下:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
4.云主機(jī)hosts文件配置:
[hadoop@hadoop001 ~]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
# hostname loopback address
192.168.137.2 hadoop001
云主機(jī)將內(nèi)網(wǎng)IP和主機(jī)名hadoop001做了映射
5.本地hosts文件配置
139.198.18.XXX hadoop001
本地已經(jīng)將公網(wǎng)IP和域名hadoop001做了映射
二、問(wèn)題癥狀
1.在云主機(jī)上開啟 HDFS锤灿,JPS 查看進(jìn)程都沒有異常赘艳,通過(guò) Shell 操作 HDFS 文件也沒有問(wèn)題
2.通過(guò)瀏覽器訪問(wèn) 50070 端口管理界面也沒有問(wèn)題
3.在本地機(jī)器上使用 Java API 操作遠(yuǎn)程 HDFS 文件酌毡,URI 使用公網(wǎng) IP克握,代碼如下:
val uri = new URI("hdfs://hadoop001:9001")
val fs = FileSystem.get(uri,conf)
val listfiles = fs.listFiles(new Path("/data"),true)
while (listfiles.hasNext) {
val nextfile = listfiles.next()
println("get file path:" + nextfile.getPath().toString())
}
------------------------------運(yùn)行結(jié)果---------------------------------
get file path:hdfs://hadoop001:9001/data/infos.txt
4.在本地機(jī)器使用SparkSQL讀取hdfs上的文件并轉(zhuǎn)換為DF的過(guò)程中
object SparkSQLApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("SparkSQLApp").master("local[2]").getOrCreate()
val info = spark.sparkContext.textFile("/data/infos.txt")
import spark.implicits._
val infoDF = info.map(_.split(",")).map(x=>Info(x(0).toInt,x(1),x(2).toInt)).toDF()
infoDF.show()
spark.stop()
}
case class Info(id:Int,name:String,age:Int)
}
出現(xiàn)如下報(bào)錯(cuò)信息:
....
....
....
19/02/23 16:07:00 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/02/23 16:07:00 INFO HadoopRDD: Input split: hdfs://hadoop001:9001/data/infos.txt:0+17
19/02/23 16:07:21 WARN BlockReaderFactory: I/O error constructing remote block reader.
java.net.ConnectException: Connection timed out: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
.....
....
19/02/23 16:07:21 INFO DFSClient: Could not obtain BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 from any node: java.io.IOException: No live nodes contain block BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 after checking nodes = [DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Will get new block locations from namenode and retry...
19/02/23 16:07:21 WARN DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 272.617680460432 msec.
19/02/23 16:07:42 WARN BlockReaderFactory: I/O error constructing remote block reader.
java.net.ConnectException: Connection timed out: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
...
...
19/02/23 16:07:42 WARN DFSClient: Failed to connect to /192.168.137.2:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection timed out: no further information
java.net.ConnectException: Connection timed out: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499)
...
...
19/02/23 16:08:12 WARN DFSClient: Failed to connect to /192.168.137.2:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection timed out: no further information
java.net.ConnectException: Connection timed out: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
...
...
19/02/23 16:08:12 INFO DFSClient: Could not obtain BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 from any node: java.io.IOException: No live nodes contain block BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 after checking nodes = [DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Will get new block locations from namenode and retry...
19/02/23 16:08:12 WARN DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 11918.913311370841 msec.
19/02/23 16:08:45 WARN BlockReaderFactory: I/O error constructing remote block reader.
java.net.ConnectException: Connection timed out: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
...
...
19/02/23 16:08:45 WARN DFSClient: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Throwing a BlockMissingException
19/02/23 16:08:45 WARN DFSClient: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Throwing a BlockMissingException
19/02/23 16:08:45 WARN DFSClient: DFS Read
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001)
...
...
19/02/23 16:08:45 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:648)
...
...
19/02/23 16:08:45 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
19/02/23 16:08:45 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
19/02/23 16:08:45 INFO TaskSchedulerImpl: Cancelling stage 0
19/02/23 16:08:45 INFO DAGScheduler: ResultStage 0 (show at SparkSQLApp.scala:30) failed in 105.618 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001)
...
...
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001)
...
...
三、問(wèn)題分析
1.本地 Shell 可以正常操作阔馋,排除集群搭建和進(jìn)程沒有啟動(dòng)的問(wèn)題
2.云主機(jī)沒有設(shè)置防火墻玛荞,排除防火墻沒關(guān)的問(wèn)題
3.云服務(wù)器防火墻開放了 DataNode 用于數(shù)據(jù)傳輸服務(wù)端口 默認(rèn)是 50010
4.我在本地搭建了另一臺(tái)虛擬機(jī),該虛擬機(jī)和本地在同一局域網(wǎng)呕寝,本地可以正常操作該虛擬機(jī)的hdfs勋眯,基本確定了是由于內(nèi)外網(wǎng)的原因。
5.查閱資料發(fā)現(xiàn) HDFS 中的文件夾和文件名都是存放在 NameNode 上下梢,操作不需要和 DataNode 通信客蹋,因此可以正常創(chuàng)建文件夾和創(chuàng)建文件說(shuō)明本地和遠(yuǎn)程 NameNode 通信沒有問(wèn)題。那么很可能是本地和遠(yuǎn)程 DataNode 通信有問(wèn)題
四孽江、問(wèn)題猜想
由于本地測(cè)試和云主機(jī)不在一個(gè)局域網(wǎng)讶坯,hadoop配置文件是以內(nèi)網(wǎng)ip作為機(jī)器間通信的ip。在這種情況下,我們能夠訪問(wèn)到namenode機(jī)器岗屏,namenode會(huì)給我們數(shù)據(jù)所在機(jī)器的ip地址供我們?cè)L問(wèn)數(shù)據(jù)傳輸服務(wù)辆琅,但是當(dāng)寫數(shù)據(jù)的時(shí)候,NameNode 和DataNode 是通過(guò)內(nèi)網(wǎng)通信的这刷,返回的是datanode內(nèi)網(wǎng)的ip,我們無(wú)法根據(jù)該IP訪問(wèn)datanode服務(wù)器婉烟。
我們來(lái)看一下其中一部分報(bào)錯(cuò)信息:
19/02/23 16:07:21 WARN BlockReaderFactory: I/O error constructing remote block reader.
java.net.ConnectException: Connection timed out: no further information
...
19/02/23 16:07:42 WARN DFSClient: Failed to connect to /192.168.137.2:50010 for block, add to deadNodes and continue....
從報(bào)錯(cuò)信息中可以看出,連接不到192.168.137.2:50010暇屋,也就是datanode的地址似袁,因?yàn)橥饩W(wǎng)必須訪問(wèn)“139.198.18.XXX:50010”才能訪問(wèn)到datanode。
為了能夠讓開發(fā)機(jī)器訪問(wèn)到hdfs咐刨,我們可以通過(guò)域名訪問(wèn)hdfs昙衅,讓namenode返回給我們datanode的域名。
五定鸟、問(wèn)題解決
1.嘗試一:
在開發(fā)機(jī)器的hosts文件中配置datanode對(duì)應(yīng)的外網(wǎng)ip和域名(上文已經(jīng)配置)而涉,并且在與hdfs交互的程序中添加如下代碼:
val conf = new Configuration()
conf.set("dfs.client.use.datanode.hostname", "true")
報(bào)錯(cuò)依舊
2.嘗試二:
val spark = SparkSession
.builder()
.appName("SparkSQLApp")
.master("local[2]")
.config("dfs.client.use.datanode.hostname", "true")
.getOrCreate()
報(bào)錯(cuò)依舊
3.嘗試三:
在hdfs-site.xml中添加如下配置:
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
運(yùn)行成功
通過(guò)查閱資料,建議在hdfs-site.xml中增加dfs.datanode.
use.datanode.hostname屬性联予,表示datanode之間的通信也通過(guò)域名方式
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>true</value>
</property>
這樣能夠使得更換內(nèi)網(wǎng)IP變得十分簡(jiǎn)單啼县、方便,而且可以讓特定datanode間的數(shù)據(jù)交換變得更容易躯泰。但與此同時(shí)也存在一個(gè)副作用谭羔,當(dāng)DNS解析失敗時(shí)會(huì)導(dǎo)致整個(gè)Hadoop不能正常工作华糖,所以要保證DNS的可靠
總結(jié):將默認(rèn)的通過(guò)IP訪問(wèn)麦向,改為通過(guò)域名方式訪問(wèn)。
六客叉、參考資料
https://blog.csdn.net/vaf714/article/details/82996860
https://www.cnblogs.com/krcys/p/9146329.html
https://blog.csdn.net/dominic_tiger/article/details/71773656
https://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-running-in-ec2-using-public-ip-addresses/