客戶在使用我們的SPARK(用的是 spark thrift server
)產(chǎn)品后寨腔,反饋說息拜,使用一天后就報錯素征。重啟一下spark thrift server
。但是這種方式治標不治本哮肚,本質(zhì)問題還是的挖出來解決掉的。
報錯信息如下:
-
因為報錯的日志是hdfs
中的BlockManager
广匙,于是查看源碼中該類的方法chooseTarget4NewBlock
:如下
/**
* Choose target datanodes for creating a new block.
*
* @throws IOException
* if the number of targets < minimum replication.
* @see BlockPlacementPolicy#chooseTarget(String, int, Node,
* Set, long, List, BlockStoragePolicy)
*/
public DatanodeStorageInfo[] chooseTarget4NewBlock(final String src,
final int numOfReplicas, final Node client,
final Set<Node> excludedNodes,
final long blocksize,
final List<String> favoredNodes,
final byte storagePolicyID) throws IOException {
List<DatanodeDescriptor> favoredDatanodeDescriptors =
getDatanodeDescriptors(favoredNodes);
final BlockStoragePolicy storagePolicy = storagePolicySuite.getPolicy(storagePolicyID);
final DatanodeStorageInfo[] targets = blockplacement.chooseTarget(src,
numOfReplicas, client, excludedNodes, blocksize,
favoredDatanodeDescriptors, storagePolicy);
if (targets.length < minReplication) {
throw new IOException("File " + src + " could only be replicated to "
+ targets.length + " nodes instead of minReplication (="
+ minReplication + "). There are "
+ getDatanodeManager().getNetworkTopology().getNumOfLeaves()
+ " datanode(s) running and "
+ (excludedNodes == null? "no": excludedNodes.size())
+ " node(s) are excluded in this operation.");
}
return targets;
}
- 從改方法中可以看出來是
hdfs block
的問題,于是執(zhí)行:hdfs dfsadmin -report
發(fā)現(xiàn)有兩臺機器DFS Remaining
和DFS Remaining%
空間嚴重不足恼策。當然也可以通過查看datanode
的日志發(fā)現(xiàn)一些問題滴鸦致。 - 解決問題的思路:
1. 新增DataNode
節(jié)點
2. 在空間不足的DataNode
增加硬盤(也有可能是有足夠的硬盤空間,但是沒有成功的掛載到HDFS
上)