RDD的存儲機制:
-
其數(shù)據(jù)分布存儲在多臺機器上烦周,都是以block的形式存儲在服務(wù)器上。每個Executor都會啟動一個BlockManagerSlave怎顾,并且管理一部分block;block的元數(shù)據(jù)保存在Driver的BlockManagerMaster上读慎,當(dāng)BlockManagerSlave產(chǎn)生block之后就會向BlockManagerMaster注冊該block。RDD的partion是一個邏輯數(shù)據(jù)塊槐雾,對應(yīng)相應(yīng)的物理數(shù)據(jù)塊block贪壳。具體流程如下:
RDD數(shù)據(jù)存儲機制
1.分區(qū)列表性
/**
* Implemented by subclasses to return the set of partitions in this RDD. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*
* The partitions in this array must satisfy the following property:
* `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
*/
protected def getPartitions: Array [Partition]
2.依賴其他RDD,血緣關(guān)系
/**
* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*/
protected def getDependencies: Seq [Dependency[_]]= deps
3.每個分區(qū)都有一個函數(shù)計算
/**
* :: DeveloperApi ::
* Implemented by subclasses to compute a given partition.
*/
@DeveloperApi
def compute(split: Partition, context: TaskContext): Iterator[T]
4.key-value的分區(qū)器
/** Optionally overridden by subclasses to specify how they are partitioned. */
@transient val partitioner: Option[Partitioner] = None
5.每個分區(qū)都有一個分區(qū)位置列表
/**
* Optionally overridden by subclasses to specify placement preferences.
* 可選的蚜退,輸入?yún)?shù)是 split分片,返回的是一組優(yōu)先的節(jié)點位置彪笼。例如hdfs的block位置
*/
protected def getPreferredLocations(split: Partition): Seq[String] = Nil