介紹以下Transformations算子:
map
flatMap
mapPartitions
mapPartitionsWithIndex
filter
sample
union
intersection
sortBy
sortByKey
groupByKey
reduceByKey
distinct
coalesce
repartition
(1) map二驰、mapPartitions壹置、mapPartitionsWithIndex
- map以一條記錄為單位進(jìn)行操作
val arr = Array("Tom","Bob","Tony","Jerry")
//把4條數(shù)據(jù)分到兩個(gè)分區(qū)中
val rdd = sc.parallelize(arr,2)
/*
* 模擬把RDD中的元素寫入數(shù)據(jù)庫的過程
*/
rdd.map(x => {
println("創(chuàng)建數(shù)據(jù)庫連接...")
println("寫入數(shù)據(jù)庫...")
println("關(guān)閉數(shù)據(jù)庫連接...")
println()
}).count()
結(jié)果:
創(chuàng)建數(shù)據(jù)庫連接...
寫入數(shù)據(jù)庫...
關(guān)閉數(shù)據(jù)庫連接...
創(chuàng)建數(shù)據(jù)庫連接...
寫入數(shù)據(jù)庫...
關(guān)閉數(shù)據(jù)庫連接...
創(chuàng)建數(shù)據(jù)庫連接...
寫入數(shù)據(jù)庫...
關(guān)閉數(shù)據(jù)庫連接...
創(chuàng)建數(shù)據(jù)庫連接...
寫入數(shù)據(jù)庫...
關(guān)閉數(shù)據(jù)庫連接...
- mapPartitions以分區(qū)為單位進(jìn)行操作
val arr = Array("Tom","Bob","Tony","Jerry")
//把4條數(shù)據(jù)分到兩個(gè)分區(qū)中
val rdd = sc.parallelize(arr,2)
/*
* 將RDD中的數(shù)據(jù)寫入到數(shù)據(jù)庫中呛凶,絕大部分使用mapPartitions算子來實(shí)現(xiàn)
*/
rdd.mapPartitions(x => {
println("創(chuàng)建數(shù)據(jù)庫連接...")
val list = new ListBuffer[String]()
while(x.hasNext) {
// 模擬寫入數(shù)據(jù)庫
list += x.next() + "寫入數(shù)據(jù)庫"
}
// 模擬執(zhí)行SQL語句枫绅,批量插入
list.iterator
}).foreach(println)
結(jié)果:
創(chuàng)建數(shù)據(jù)庫
Tom:寫入數(shù)據(jù)庫
Bob:寫入數(shù)據(jù)庫
創(chuàng)建數(shù)據(jù)庫
Tony:寫入數(shù)據(jù)庫
Jerry:寫入數(shù)據(jù)庫
- mapPartitionsWithIndex
val dataArr = Array("Tom01","Tom02","Tom03"
,"Tom04","Tom05","Tom06"
,"Tom07","Tom08","Tom09"
,"Tom10","Tom11","Tom12")
val rdd = sc.parallelize(dataArr, 3);
val result = rdd.mapPartitionsWithIndex((index,x) => {
val list = ListBuffer[String]()
while (x.hasNext) {
list += "partition:"+ index + " content:" + x.next
}
list.iterator
})
println("分區(qū)數(shù)量:" + result.partitions.size)
val resultArr = result.collect()
for(x <- resultArr){
println(x)
}
結(jié)果:
分區(qū)數(shù)量:3
partition:0 content:Tom01
partition:0 content:Tom02
partition:0 content:Tom03
partition:0 content:Tom04
partition:1 content:Tom05
partition:1 content:Tom06
partition:1 content:Tom07
partition:1 content:Tom08
partition:2 content:Tom09
partition:2 content:Tom10
partition:2 content:Tom11
partition:2 content:Tom12
(2) flatMap
val conf = new SparkConf().setAppName("FlatMapTest").setMaster("local")
val sc = new SparkContext(conf)
val data = Array("hello hadoop","hello hive", "hello spark")
val rdd = sc.makeRDD(data)
rdd.flatMap(_.split(" ")).foreach(println)
/*
結(jié)果:
hello
hadoop
hello
hive
hello
spark
*/
rdd.map(_.split(" ")).foreach(println)
/*
[Ljava.lang.String;@3c986196
[Ljava.lang.String;@113116a6
[Ljava.lang.String;@542d75a6
*/
map 和 flatMap的區(qū)別
map:輸入一條數(shù)據(jù)玲献,返回一條數(shù)據(jù)
flatMap:輸入一條數(shù)據(jù)港庄,可能返回多條數(shù)據(jù)
以下scala程序可以說明map函數(shù)甸祭、flatMap函數(shù)和flatten函數(shù)的區(qū)別和聯(lián)系:
scala> val arr = Array("hello hadoop","hello hive","hello spark")
arr: Array[String] = Array(hello hadoop, hello hive, hello spark)
scala> val map = arr.map(_.split(" "))
map: Array[Array[String]] = Array(Array(hello, hadoop), Array(hello, hive), Array(hello, spark))
scala> map.flatten
res1: Array[String] = Array(hello, hadoop, hello, hive, hello, spark)
scala> arr.flatMap(_.split(" "))
res2: Array[String] = Array(hello, hadoop, hello, hive, hello, spark)
(3) filter :過濾
val rdd = sc.makeRDD(Array("hello","hello","hello","world"))
// filter(boolean) 返回的是判斷條件為true的記錄
rdd.filter(!_.contains("hello")).foreach(println)
結(jié)果:
world
(4) sample :隨機(jī)抽樣
sample(withReplacement: Boolean, fraction: Double, seed: Long)
withReplacement : 是否是放回式抽樣
true代表如果抽中A元素,之后還可以抽取A元素
false代表如果抽中了A元素善已,之后都不在抽取A元素
fraction : 抽樣的比例
seed : 抽樣算法的隨機(jī)數(shù)種子灼捂,不同的數(shù)值代表不同的抽樣規(guī)則,可以手動(dòng)設(shè)置换团,默認(rèn)為long的隨機(jī)數(shù)
val rdd = sc.makeRDD(Array(
"hello1","hello2","hello3","hello4","hello5","hello6",
"world1","world2","world3","world4"
))
rdd.sample(false, 0.3).foreach(println)
結(jié)果:理論上會(huì)隨機(jī)抽取30%的數(shù)據(jù)悉稠,但是在數(shù)據(jù)量不大的時(shí)候,不一定很準(zhǔn)確
hello1
hello3
world3
(5) union:把兩個(gè)RDD進(jìn)行邏輯上的合并
val rdd1 =sc.makeRDD(1 to 3)
val rdd2 = sc.parallelize(4 until 6)
rdd1.union(rdd2).foreach {println}
結(jié)果:
1
2
3
4
5
(6) intersection:求兩個(gè)RDD的交集
val rdd1 =sc.makeRDD(1 to 3)
val rdd2 = sc.parallelize(2 to 5)
rdd1.intersection(rdd2).foreach(println)
結(jié)果:
2
3
(7) sortBy和sortByKey
- sortBy:手動(dòng)指定排序的字段
val arr = Array(
Tuple3(190,100,"Jed"),
Tuple3(100,202,"Tom"),
Tuple3(90,111,"Tony")
)
val rdd = sc.parallelize(arr)
rdd.sortBy(_._1).foreach(println)
/* 按第一個(gè)元素排序
(90,111,Tony)
(100,202,Tom)
(190,100,Jed)
*/
rdd.sortBy(_._2, false).foreach(println)
/* 按照第二個(gè)元素排序艘包,降序
(100,202,Tom)
(90,111,Tony)
(190,100,Jed)
*/
rdd.sortBy(_._3).foreach(println)
/* 按照第三個(gè)元素排序
(190,100,Jed)
(100,202,Tom)
(90,111,Tony)
*/
}
- sortByKey:按key進(jìn)行排序
val rdd = sc.makeRDD(Array(
(5,"Tom"),(10,"Jed"),(3,"Tony"),(2,"Jack")
))
rdd.sortByKey().foreach(println)
結(jié)果:
(2,Jack)
(3,Tony)
(5,Tom)
(10,Jed)
(8) groupByKey和reduceByKey
val rdd = sc.makeRDD(Array(
("Tom",1),("Tom",1),("Tony",1),("Tony",1)
))
rdd.groupByKey().foreach(println)
/*
(Tom,CompactBuffer(1, 1))
(Tony,CompactBuffer(1, 1))
*/
rdd.reduceByKey(_+_).foreach(println)
/*
(Tom,2)
(Tony,2)
*/
(9) distinct:去掉重復(fù)數(shù)據(jù)
val rdd = sc.makeRDD(Array(
"hello",
"hello",
"hello",
"world"
))
rdd.distinct().foreach {println}
/*
hello
world
*/
// dinstinct = map + reduceByKey + map
val distinctRDD = rdd
.map {(_,1)}
.reduceByKey(_+_)
.map(_._1)
distinctRDD.foreach {println}
/*
hello
world
*/
(10) coalesce的猛、repartition:改變RDD分區(qū)數(shù)
- coalesce
/*
* false:不產(chǎn)生shuffle
* true:產(chǎn)生shuffle
* 如果重分區(qū)的數(shù)量大于原來的分區(qū)數(shù)量,必須設(shè)置為true,否則分區(qū)數(shù)不變
* 增加分區(qū)會(huì)把原來的分區(qū)中的數(shù)據(jù)隨機(jī)分配給設(shè)置的分區(qū)中
* 默認(rèn)為false
*/
object CoalesceTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("MapTest").setMaster("local")
val sc = new SparkContext(conf)
val arr = Array(
"partition:0 content:Tom01",
"partition:0 content:Tom02",
"partition:0 content:Tom03",
"partition:0 content:Tom04",
"partition:1 content:Tom05",
"partition:1 content:Tom06",
"partition:1 content:Tom07",
"partition:1 content:Tom08",
"partition:2 content:Tom09",
"partition:2 content:Tom10",
"partition:2 content:Tom11",
"partition:2 content:Tom12")
val rdd = sc.parallelize(arr, 3);
val coalesceRdd = rdd.coalesce(6,true)
val results = coalesceRdd.mapPartitionsWithIndex((index,x) => {
val list = ListBuffer[String]()
while (x.hasNext) {
list += "partition:"+ index + " content:[" + x.next + "]"
}
list.iterator
})
println("分區(qū)數(shù)量:" + results.partitions.size)
results.foreach(println)
/*
分區(qū)數(shù)量:6
partition:0 content:[partition:1 content:Tom07]
partition:0 content:[partition:2 content:Tom10]
partition:1 content:[partition:0 content:Tom01]
partition:1 content:[partition:1 content:Tom08]
partition:1 content:[partition:2 content:Tom11]
partition:2 content:[partition:0 content:Tom02]
partition:2 content:[partition:2 content:Tom12]
partition:3 content:[partition:0 content:Tom03]
partition:4 content:[partition:0 content:Tom04]
partition:4 content:[partition:1 content:Tom05]
partition:5 content:[partition:1 content:Tom06]
partition:5 content:[partition:2 content:Tom09]
*/
// 增加分區(qū)肯定會(huì)發(fā)生shuffle耀盗,如果設(shè)置為false,結(jié)果是不變的
val coalesceRdd2 = rdd.coalesce(6,false)
val results2 = coalesceRdd2.mapPartitionsWithIndex((index,x) => {
val list = ListBuffer[String]()
while (x.hasNext) {
list += "partition:"+ index + " content:[" + x.next + "]"
}
list.iterator
})
println("分區(qū)數(shù)量:" + results2.partitions.size)
results2.foreach(println)
/*
分區(qū)數(shù)量:3
partition:0 content:[partition:0 content:Tom01]
partition:0 content:[partition:0 content:Tom02]
partition:0 content:[partition:0 content:Tom03]
partition:0 content:[partition:0 content:Tom04]
partition:1 content:[partition:1 content:Tom05]
partition:1 content:[partition:1 content:Tom06]
partition:1 content:[partition:1 content:Tom07]
partition:1 content:[partition:1 content:Tom08]
partition:2 content:[partition:2 content:Tom09]
partition:2 content:[partition:2 content:Tom10]
partition:2 content:[partition:2 content:Tom11]
partition:2 content:[partition:2 content:Tom12]
*/
val coalesceRdd3 = rdd.coalesce(2,true)
val results3 = coalesceRdd3.mapPartitionsWithIndex((index,x) => {
val list = ListBuffer[String]()
while (x.hasNext) {
list += "partition:"+ index + " content:[" + x.next + "]"
}
list.iterator
})
println("分區(qū)數(shù)量:" + results3.partitions.size)
results3.foreach(println)
/*
分區(qū)數(shù)量:2
partition:0 content:[partition:0 content:Tom01]
partition:0 content:[partition:0 content:Tom03]
partition:0 content:[partition:1 content:Tom05]
partition:0 content:[partition:1 content:Tom07]
partition:0 content:[partition:2 content:Tom09]
partition:0 content:[partition:2 content:Tom11]
partition:1 content:[partition:0 content:Tom02]
partition:1 content:[partition:0 content:Tom04]
partition:1 content:[partition:1 content:Tom06]
partition:1 content:[partition:1 content:Tom08]
partition:1 content:[partition:2 content:Tom10]
partition:1 content:[partition:2 content:Tom12]
*/
val coalesceRdd4 = rdd.coalesce(2,false)
val results4 = coalesceRdd4.mapPartitionsWithIndex((index,x) => {
val list = ListBuffer[String]()
while (x.hasNext) {
list += "partition:"+ index + " content:[" + x.next + "]"
}
list.iterator
})
println("分區(qū)數(shù)量:" + results4.partitions.size)
results4.foreach(println)
/*
分區(qū)數(shù)量:2
partition:0 content:[partition:0 content:Tom01]
partition:0 content:[partition:0 content:Tom02]
partition:0 content:[partition:0 content:Tom03]
partition:0 content:[partition:0 content:Tom04]
partition:1 content:[partition:1 content:Tom05]
partition:1 content:[partition:1 content:Tom06]
partition:1 content:[partition:1 content:Tom07]
partition:1 content:[partition:1 content:Tom08]
partition:1 content:[partition:2 content:Tom09]
partition:1 content:[partition:2 content:Tom10]
partition:1 content:[partition:2 content:Tom11]
partition:1 content:[partition:2 content:Tom12]
*/
}
}
以下圖片說明這些情況:
- repartition
repartition(int n) = coalesce(int n, true)
- partitionBy:自定義分區(qū)器卦尊,重新分區(qū)
package com.aura.transformations
import org.apache.spark.{Partitioner, SparkConf, SparkContext}
/**
* Author: Jed
* Description: 自定義分區(qū)規(guī)則
* Date: Create in 2018/1/12
*/
class MyPartition extends Partitioner {
// 分區(qū)數(shù)量為2
override def numPartitions: Int = 2
// 自定義分區(qū)規(guī)則
override def getPartition(key: Any): Int = {
if(key.hashCode() % 2 == 0) {
0
}else {
1
}
}
}
object PartitionByTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("PartitionByTest").setMaster("local")
val sc = new SparkContext(conf)
val arr = Array((1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9))
val rdd = sc.makeRDD(arr,3)
println("分區(qū)數(shù)量:" + rdd.partitions.length)
rdd.foreachPartition(x => {
println("*******")
while(x.hasNext) {
println(x.next())
}
})
/*
分區(qū)數(shù)量:3
*******
(1,1)
(2,2)
(3,3)
*******
(4,4)
(5,5)
(6,6)
*******
(7,7)
(8,8)
(9,9)
*/
val partitionRDD = rdd.partitionBy(new MyPartition)
println("分區(qū)數(shù)量:" + partitionRDD.partitions.length)
partitionRDD.foreachPartition(x => {
println("*******")
while(x.hasNext) {
println(x.next())
}
})
/*
分區(qū)數(shù)量:2
*******
(2,2)
(4,4)
(6,6)
(8,8)
*******
(1,1)
(3,3)
(5,5)
(7,7)
(9,9)
*/
}
}