groupByKey
groupByKey([numTasks])是數(shù)據(jù)分組操作他宛,在一個(gè)由(K,V)對組成的數(shù)據(jù)集上調(diào)用椅文,返回一個(gè)(K,Seq[V])對的數(shù)據(jù)集饱狂。
val rdd0 = sc.parallelize(Array((1,1), (1,2) , (1,3) , (2,1) , (2,2) , (2,3)), 3)
val rdd1 = rdd0.groupByKey()
rdd1.collect
res0: Array[(Int, Iterable[Int])] = Array((1,ArrayBuffer(1, 2, 3)), (2,ArrayBuffer(1, 2, 3)))
union
union(otherDataset)是數(shù)據(jù)合并廷没,返回一個(gè)新的數(shù)據(jù)集钥弯,由原數(shù)據(jù)集和otherDataset聯(lián)合而成壹罚。
val rdd1 = sc.parallelize(1 to 9, 3)
val rdd2 = rdd1.map(x => x * 2)
rdd2.collect
val rdd3 = rdd2.filter(x => x > 10)
rdd3.collect
val rdd4 = rdd1.union(rdd3)
rdd4.collect
res: Array[Int] = Array(1,2,3,4,5,6,7,8,9,12,14,16,18)
join
join(otherDataset, [numTasks])是連接操作,將輸入數(shù)據(jù)集(K,V)和另外一個(gè)數(shù)據(jù)集(K,W)進(jìn)行Join寿羞, 得到(K, (V,W))猖凛;該操作是對于相同K的V和W集合進(jìn)行笛卡爾積 操作,也即V和W的所有組合绪穆;
val rdd0 = sc.parallelize((1,1),(1,2),(1,3),(2,1),(2,2),(2,3), 3)
val rdd5 = rdd0.join(rdd0)
rdd5.collect
res:?Array[(Int, (Int, Int))] = Array((1,(1,1)), (1,(1,2)), (1,(1,3)), (1,(2,1)), (1,(2,2)), (1,(2,3)), (1,(3,1)), (1,(3,2)), (1,(3,3)), (2,(1,1)), (2,(1,2)), (2,(1,3)), (2,(2,1)), (2,(2,2)), (2,(2,3)), (2,(3,1)), (2,(3,2)), (2,(3,3)))
cogroup
cogroup(otherDataset, [numTasks])是將輸入數(shù)據(jù)集(K, V)和另外一個(gè)數(shù)據(jù)集(K, W)進(jìn)行cogroup辨泳,得到一個(gè)格式為(K, Seq[V], Seq[W])的數(shù)據(jù)集虱岂。
val rdd6 = rdd0.cogroup(rdd0)
rdd6.collect
res: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(ArrayBuffer(1, 2, 3),ArrayBuffer(1, 2, 3))), (2,(ArrayBuffer(1, 2, 3),ArrayBuffer(1, 2, 3))))