- 官網(wǎng)介紹
One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations
spark的最重要的一個功能就是跨操作的在內(nèi)存中持久化(緩存)一個數(shù)據(jù)集
When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x)
當(dāng)你持久化一個RDD, 每一個node存儲RDD的所有的分區(qū)信息,
這樣就可以在以內(nèi)存的方式進行計算并且在以后的作用在該dataset
(或者來源自該dataset的數(shù)據(jù)集)的action中進行重用象泵。
這樣以后再使用該action寞秃,該action執(zhí)行的更快(通常超過原來的10倍)
Caching is a key tool for iterative algorithms and fast interactive use
Caching對于迭代算法和快速交互使用的關(guān)鍵工具
You can mark an RDD to be persisted using the persist() or cache() methods on it
可以使用persist()方法或者cache()方法來標(biāo)識某個RDD是持久化的
- cache()
2.1 源碼