makeRdd的創(chuàng)建
RDD的創(chuàng)建
rdd的創(chuàng)建方式大致分為3種:從集合中創(chuàng)建rdd,從外部存儲(chǔ)恨搓,從其他rdd創(chuàng)建
從集合中創(chuàng)建
分為parallelize和makeRdd的2種方式痹换,異同點(diǎn)在于makeRdd還可以指定數(shù)據(jù)的分區(qū)位置
val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8))
val rdd = sc.makeRdd(Array(1,2,3,4,5,6,7,8))
val seq = List((1, List("slave01")),| (2, List("slave02")))
val guigu3 = sc.makeRDD(seq)
//scala>guigu3.preferredLocations(guigu3.partitions(1))
//res26: Seq[String] = List(slave02)
TransFormation
def map[U:classTag](f: T => U)
一對(duì)一轉(zhuǎn)換
def filter(f: T=>Boolean): RDD[T]
傳入一個(gè)Boolean的方法蝗羊,過濾數(shù)據(jù)
def flatmap[U:ClassTag](f: T => TraversableOnce): RDD[U]
一對(duì)多 并且將多壓平
def mapPartition[U:ClassTag](f:Iterator[T] => Iterator[U],preservesPartitioning: Boolean = false):RDD[U]
對(duì)于一個(gè)分區(qū)的數(shù)據(jù)執(zhí)行一個(gè)函數(shù),性能比map要高
def mapPartitionsWithIndex[U:ClassTag](f:(Int,Iterator[T])=>Iterator[U],preservesPartitioning:Boolean = false):RDD[U]
指定某一個(gè)分區(qū)
def sample(withReplacement:Boolean,fraction:Double,seed: Long = Utils.random.nextLong):RDD[T]
有放回和無放回抽樣
def union(other: RDD[T]): RDD[T]
聯(lián)合一個(gè)RDD统翩,返回組合的RDD
def intersection(other: RDD[T]):RDD[T]
rdd求交集
def distinct():RDD[T]
去重
def partitionBy(partitioner: Partitioner):RDD[(k,v)]
用提供的分區(qū)器分區(qū)
def reduceByKey(func:(v,v) => v):RDD[(k,v)]
根據(jù)Key進(jìn)行聚合 預(yù)聚合
def groupByKey(partitioner:Partitioner):RDD[(k,Iterable[v])]
將key相同的value聚合在一起
def combinByKey[C]( createCombiner:V => C, mergeValue:(C,V) => C, mergeCombiners:(C,C) => C, numPartitions: Int):RDD[(K,C)]
def aggregateByKey[U:ClassTag](zeroValue:U,partitioner:Partitioner)(seqOp:(U,V)=>U,combOp:(U,U)=>U):RDD[(k,u)]
是CombineByKey的簡(jiǎn)化版仙蚜,可以通過zeroValue直接提供一個(gè)初始值
def foldBykey(zeroValue:V,partitioner:Partitioner)(func:(v,v)=>v):RDD(K,V))
該函數(shù)為aggeregateByKey的簡(jiǎn)化版,seqOp和combOp一樣,相同
def sortByKey
根據(jù)Key來進(jìn)行排序厂汗,如果key目前不支持排序委粉,需求with Order接口,實(shí)現(xiàn)compare方法娶桦,告訴spark key的大小判定
def sortBy[K](f:(T)=>K,ascending:Boolean=true,numPartitons:Int = this.partitions.length):RDD[(K,V)]
根據(jù)f函數(shù)提供可以排序的key
def join[W](other:RDD[(K,W)],partition:Partitioner):RDD[K,(V,W)]
連接二個(gè)RDD的數(shù)據(jù)
def cogroup[W](other:RDD[(K,W)],partitioner:Partitioner):RDD[K,(Iterable[V],Iterable[W]))]
分別將相同key的數(shù)據(jù)聚集在一起
def cartesian[U:ClassTag](other:RDD[U]):RDD[(T,U)]
做笛卡爾積 n*m
def pipe(command:String):RDD[]String
執(zhí)行外部腳本
def coalesce(numPartitions:Int,shuffle:Boolean = false,partitionCoalescer:Option[PartitionCoalescer] = Option.empty)(impliciatord:Order[T] = null): RDD[T]
縮減分區(qū)數(shù)贾节,用于大數(shù)據(jù)集過濾后。提高小數(shù)據(jù)集的執(zhí)行效率
def repartition(numPartitions:Int)(implicit ord: Ordering[T] = null): RDD[T]
重新分區(qū)
def glom():RDD[Array[T]]
def mapValues[U](f: V => U):RDD[(K,U)]
對(duì)于kv結(jié)構(gòu)RDD,只處理value
def subtracte(other: RDD[T]): RDD[T]
去掉和other重復(fù)的元素