Spark常用Transformations算子(一)


介紹以下Transformations算子:
map
flatMap
mapPartitions
mapPartitionsWithIndex
filter
sample
union
intersection
sortBy
sortByKey
groupByKey
reduceByKey
distinct
coalesce
repartition


(1) map二驰、mapPartitions壹置、mapPartitionsWithIndex

  1. map以一條記錄為單位進(jìn)行操作
val arr = Array("Tom","Bob","Tony","Jerry")

//把4條數(shù)據(jù)分到兩個(gè)分區(qū)中
val rdd = sc.parallelize(arr,2)

/*
 * 模擬把RDD中的元素寫入數(shù)據(jù)庫的過程
 */
rdd.map(x => {
  println("創(chuàng)建數(shù)據(jù)庫連接...")
  println("寫入數(shù)據(jù)庫...")
  println("關(guān)閉數(shù)據(jù)庫連接...")
  println()
}).count()

結(jié)果:

創(chuàng)建數(shù)據(jù)庫連接...
寫入數(shù)據(jù)庫...
關(guān)閉數(shù)據(jù)庫連接...

創(chuàng)建數(shù)據(jù)庫連接...
寫入數(shù)據(jù)庫...
關(guān)閉數(shù)據(jù)庫連接...

創(chuàng)建數(shù)據(jù)庫連接...
寫入數(shù)據(jù)庫...
關(guān)閉數(shù)據(jù)庫連接...

創(chuàng)建數(shù)據(jù)庫連接...
寫入數(shù)據(jù)庫...
關(guān)閉數(shù)據(jù)庫連接...
  1. mapPartitions以分區(qū)為單位進(jìn)行操作
val arr = Array("Tom","Bob","Tony","Jerry")
//把4條數(shù)據(jù)分到兩個(gè)分區(qū)中
val rdd = sc.parallelize(arr,2)

/*
* 將RDD中的數(shù)據(jù)寫入到數(shù)據(jù)庫中呛凶,絕大部分使用mapPartitions算子來實(shí)現(xiàn)
*/
rdd.mapPartitions(x => {
  println("創(chuàng)建數(shù)據(jù)庫連接...")
  val list = new ListBuffer[String]()
  while(x.hasNext) {
    // 模擬寫入數(shù)據(jù)庫
    list += x.next() + "寫入數(shù)據(jù)庫"
  }
  // 模擬執(zhí)行SQL語句枫绅,批量插入
  list.iterator
}).foreach(println)

結(jié)果:

創(chuàng)建數(shù)據(jù)庫
Tom:寫入數(shù)據(jù)庫
Bob:寫入數(shù)據(jù)庫 
創(chuàng)建數(shù)據(jù)庫
Tony:寫入數(shù)據(jù)庫
Jerry:寫入數(shù)據(jù)庫
  1. mapPartitionsWithIndex
val dataArr = Array("Tom01","Tom02","Tom03"
                  ,"Tom04","Tom05","Tom06"
                  ,"Tom07","Tom08","Tom09"
                  ,"Tom10","Tom11","Tom12")
val rdd = sc.parallelize(dataArr, 3);
val result = rdd.mapPartitionsWithIndex((index,x) => {
    val list = ListBuffer[String]()
    while (x.hasNext) {
      list += "partition:"+ index + " content:" + x.next
    }
    list.iterator
})
println("分區(qū)數(shù)量:" + result.partitions.size)
val resultArr = result.collect()
for(x <- resultArr){
  println(x)
}

結(jié)果:

分區(qū)數(shù)量:3
partition:0 content:Tom01
partition:0 content:Tom02
partition:0 content:Tom03
partition:0 content:Tom04
partition:1 content:Tom05
partition:1 content:Tom06
partition:1 content:Tom07
partition:1 content:Tom08
partition:2 content:Tom09
partition:2 content:Tom10
partition:2 content:Tom11
partition:2 content:Tom12

(2) flatMap

val conf = new SparkConf().setAppName("FlatMapTest").setMaster("local")
val sc = new SparkContext(conf)
val data = Array("hello hadoop","hello hive", "hello spark")
val rdd = sc.makeRDD(data)

rdd.flatMap(_.split(" ")).foreach(println)
/*
結(jié)果:
hello
hadoop
hello
hive
hello
spark
*/
rdd.map(_.split(" ")).foreach(println)
/*
[Ljava.lang.String;@3c986196
[Ljava.lang.String;@113116a6
[Ljava.lang.String;@542d75a6
*/

map 和 flatMap的區(qū)別
map:輸入一條數(shù)據(jù)玲献,返回一條數(shù)據(jù)
flatMap:輸入一條數(shù)據(jù)港庄,可能返回多條數(shù)據(jù)

以下scala程序可以說明map函數(shù)甸祭、flatMap函數(shù)和flatten函數(shù)的區(qū)別和聯(lián)系:

scala> val arr = Array("hello hadoop","hello hive","hello spark")
arr: Array[String] = Array(hello hadoop, hello hive, hello spark)

scala> val map = arr.map(_.split(" "))
map: Array[Array[String]] = Array(Array(hello, hadoop), Array(hello, hive), Array(hello, spark))

scala> map.flatten
res1: Array[String] = Array(hello, hadoop, hello, hive, hello, spark)

scala> arr.flatMap(_.split(" "))
res2: Array[String] = Array(hello, hadoop, hello, hive, hello, spark)

(3) filter :過濾

val rdd = sc.makeRDD(Array("hello","hello","hello","world"))
// filter(boolean) 返回的是判斷條件為true的記錄
rdd.filter(!_.contains("hello")).foreach(println)

結(jié)果:
world

(4) sample :隨機(jī)抽樣

sample(withReplacement: Boolean, fraction: Double, seed: Long)  

withReplacement : 是否是放回式抽樣  
    true代表如果抽中A元素,之后還可以抽取A元素
    false代表如果抽中了A元素善已,之后都不在抽取A元素  
fraction : 抽樣的比例  
seed : 抽樣算法的隨機(jī)數(shù)種子灼捂,不同的數(shù)值代表不同的抽樣規(guī)則,可以手動(dòng)設(shè)置换团,默認(rèn)為long的隨機(jī)數(shù)
val rdd = sc.makeRDD(Array(
  "hello1","hello2","hello3","hello4","hello5","hello6",
  "world1","world2","world3","world4"
))
rdd.sample(false, 0.3).foreach(println)

結(jié)果:理論上會(huì)隨機(jī)抽取30%的數(shù)據(jù)悉稠,但是在數(shù)據(jù)量不大的時(shí)候,不一定很準(zhǔn)確

hello1
hello3
world3

(5) union:把兩個(gè)RDD進(jìn)行邏輯上的合并

val rdd1 =sc.makeRDD(1 to 3)
val rdd2 = sc.parallelize(4 until 6)
rdd1.union(rdd2).foreach {println}

結(jié)果:

1
2
3
4
5

(6) intersection:求兩個(gè)RDD的交集

val rdd1 =sc.makeRDD(1 to 3)
val rdd2 = sc.parallelize(2 to 5)

rdd1.intersection(rdd2).foreach(println)

結(jié)果:
2
3

(7) sortBy和sortByKey

  1. sortBy:手動(dòng)指定排序的字段
val arr = Array(
        Tuple3(190,100,"Jed"),
        Tuple3(100,202,"Tom"),
        Tuple3(90,111,"Tony")
    )
val rdd = sc.parallelize(arr)
rdd.sortBy(_._1).foreach(println)
/* 按第一個(gè)元素排序
   (90,111,Tony)
   (100,202,Tom)
   (190,100,Jed)
 */

rdd.sortBy(_._2, false).foreach(println)
/* 按照第二個(gè)元素排序艘包,降序
   (100,202,Tom)
   (90,111,Tony)
   (190,100,Jed)
 */

rdd.sortBy(_._3).foreach(println)
/* 按照第三個(gè)元素排序
   (190,100,Jed)
   (100,202,Tom)
   (90,111,Tony)
 */

}
  1. sortByKey:按key進(jìn)行排序
val rdd = sc.makeRDD(Array(
      (5,"Tom"),(10,"Jed"),(3,"Tony"),(2,"Jack")
    ))
rdd.sortByKey().foreach(println)

結(jié)果:

(2,Jack)
(3,Tony)
(5,Tom)
(10,Jed)

(8) groupByKey和reduceByKey

val rdd = sc.makeRDD(Array(
      ("Tom",1),("Tom",1),("Tony",1),("Tony",1)
    ))

rdd.groupByKey().foreach(println)
/*
(Tom,CompactBuffer(1, 1))
(Tony,CompactBuffer(1, 1))
*/

rdd.reduceByKey(_+_).foreach(println)
/*
(Tom,2)
(Tony,2)
*/

(9) distinct:去掉重復(fù)數(shù)據(jù)

val rdd = sc.makeRDD(Array(
      "hello",
      "hello",
      "hello",
      "world"
    ))

rdd.distinct().foreach {println}
/*
hello
world
*/

// dinstinct = map + reduceByKey + map
val distinctRDD = rdd
  .map {(_,1)}
  .reduceByKey(_+_)
  .map(_._1)
distinctRDD.foreach {println}
/*
hello
world
*/

(10) coalesce的猛、repartition:改變RDD分區(qū)數(shù)

  1. coalesce
/*
 * false:不產(chǎn)生shuffle
 * true:產(chǎn)生shuffle
 * 如果重分區(qū)的數(shù)量大于原來的分區(qū)數(shù)量,必須設(shè)置為true,否則分區(qū)數(shù)不變
 * 增加分區(qū)會(huì)把原來的分區(qū)中的數(shù)據(jù)隨機(jī)分配給設(shè)置的分區(qū)中
 * 默認(rèn)為false
 */
object CoalesceTest {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("MapTest").setMaster("local")
    val sc = new SparkContext(conf)
    val arr = Array(
      "partition:0 content:Tom01",
      "partition:0 content:Tom02",
      "partition:0 content:Tom03",
      "partition:0 content:Tom04",
      "partition:1 content:Tom05",
      "partition:1 content:Tom06",
      "partition:1 content:Tom07",
      "partition:1 content:Tom08",
      "partition:2 content:Tom09",
      "partition:2 content:Tom10",
      "partition:2 content:Tom11",
      "partition:2 content:Tom12")

    val rdd = sc.parallelize(arr, 3);

    val coalesceRdd = rdd.coalesce(6,true)

    val results = coalesceRdd.mapPartitionsWithIndex((index,x) => {
      val list = ListBuffer[String]()
      while (x.hasNext) {
        list += "partition:"+ index + " content:[" + x.next + "]"
      }
      list.iterator
    })

    println("分區(qū)數(shù)量:" + results.partitions.size)
    results.foreach(println)
    /*
    分區(qū)數(shù)量:6
    partition:0 content:[partition:1 content:Tom07]
    partition:0 content:[partition:2 content:Tom10]
    partition:1 content:[partition:0 content:Tom01]
    partition:1 content:[partition:1 content:Tom08]
    partition:1 content:[partition:2 content:Tom11]
    partition:2 content:[partition:0 content:Tom02]
    partition:2 content:[partition:2 content:Tom12]
    partition:3 content:[partition:0 content:Tom03]
    partition:4 content:[partition:0 content:Tom04]
    partition:4 content:[partition:1 content:Tom05]
    partition:5 content:[partition:1 content:Tom06]
    partition:5 content:[partition:2 content:Tom09]
    */

    // 增加分區(qū)肯定會(huì)發(fā)生shuffle耀盗,如果設(shè)置為false,結(jié)果是不變的
    val coalesceRdd2 = rdd.coalesce(6,false)
    val results2 = coalesceRdd2.mapPartitionsWithIndex((index,x) => {
      val list = ListBuffer[String]()
      while (x.hasNext) {
        list += "partition:"+ index + " content:[" + x.next + "]"
      }
      list.iterator
    })

    println("分區(qū)數(shù)量:" + results2.partitions.size)
    results2.foreach(println)
    /*
    分區(qū)數(shù)量:3
    partition:0 content:[partition:0 content:Tom01]
    partition:0 content:[partition:0 content:Tom02]
    partition:0 content:[partition:0 content:Tom03]
    partition:0 content:[partition:0 content:Tom04]
    partition:1 content:[partition:1 content:Tom05]
    partition:1 content:[partition:1 content:Tom06]
    partition:1 content:[partition:1 content:Tom07]
    partition:1 content:[partition:1 content:Tom08]
    partition:2 content:[partition:2 content:Tom09]
    partition:2 content:[partition:2 content:Tom10]
    partition:2 content:[partition:2 content:Tom11]
    partition:2 content:[partition:2 content:Tom12]
    */

    val coalesceRdd3 = rdd.coalesce(2,true)
    val results3 = coalesceRdd3.mapPartitionsWithIndex((index,x) => {
      val list = ListBuffer[String]()
      while (x.hasNext) {
        list += "partition:"+ index + " content:[" + x.next + "]"
      }
      list.iterator
    })

    println("分區(qū)數(shù)量:" + results3.partitions.size)
    results3.foreach(println)
    /*
    分區(qū)數(shù)量:2
    partition:0 content:[partition:0 content:Tom01]
    partition:0 content:[partition:0 content:Tom03]
    partition:0 content:[partition:1 content:Tom05]
    partition:0 content:[partition:1 content:Tom07]
    partition:0 content:[partition:2 content:Tom09]
    partition:0 content:[partition:2 content:Tom11]
    partition:1 content:[partition:0 content:Tom02]
    partition:1 content:[partition:0 content:Tom04]
    partition:1 content:[partition:1 content:Tom06]
    partition:1 content:[partition:1 content:Tom08]
    partition:1 content:[partition:2 content:Tom10]
    partition:1 content:[partition:2 content:Tom12]
    */

    val coalesceRdd4 = rdd.coalesce(2,false)
    val results4 = coalesceRdd4.mapPartitionsWithIndex((index,x) => {
      val list = ListBuffer[String]()
      while (x.hasNext) {
        list += "partition:"+ index + " content:[" + x.next + "]"
      }
      list.iterator
    })

    println("分區(qū)數(shù)量:" + results4.partitions.size)
    results4.foreach(println)
    /*
    分區(qū)數(shù)量:2
    partition:0 content:[partition:0 content:Tom01]
    partition:0 content:[partition:0 content:Tom02]
    partition:0 content:[partition:0 content:Tom03]
    partition:0 content:[partition:0 content:Tom04]
    partition:1 content:[partition:1 content:Tom05]
    partition:1 content:[partition:1 content:Tom06]
    partition:1 content:[partition:1 content:Tom07]
    partition:1 content:[partition:1 content:Tom08]
    partition:1 content:[partition:2 content:Tom09]
    partition:1 content:[partition:2 content:Tom10]
    partition:1 content:[partition:2 content:Tom11]
    partition:1 content:[partition:2 content:Tom12]
    */
  }

}

以下圖片說明這些情況:


  1. repartition
repartition(int n) = coalesce(int n, true)
  1. partitionBy:自定義分區(qū)器卦尊,重新分區(qū)
package com.aura.transformations

import org.apache.spark.{Partitioner, SparkConf, SparkContext}

/**
  * Author: Jed
  * Description: 自定義分區(qū)規(guī)則
  * Date: Create in 2018/1/12
  */
class MyPartition extends Partitioner {

  // 分區(qū)數(shù)量為2
  override def numPartitions: Int = 2

  // 自定義分區(qū)規(guī)則
  override def getPartition(key: Any): Int = {
    if(key.hashCode() % 2 == 0) {
      0
    }else {
      1
    }
  }
}

object PartitionByTest {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("PartitionByTest").setMaster("local")
    val sc = new SparkContext(conf)

    val arr = Array((1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9))
    val rdd = sc.makeRDD(arr,3)
    println("分區(qū)數(shù)量:" + rdd.partitions.length)
    rdd.foreachPartition(x => {
      println("*******")
      while(x.hasNext) {
        println(x.next())
      }
    })
    /*
    分區(qū)數(shù)量:3
    *******
    (1,1)
    (2,2)
    (3,3)
    *******
    (4,4)
    (5,5)
    (6,6)
    *******
    (7,7)
    (8,8)
    (9,9)
     */

    val partitionRDD = rdd.partitionBy(new MyPartition)
    println("分區(qū)數(shù)量:" + partitionRDD.partitions.length)
    partitionRDD.foreachPartition(x => {
      println("*******")
      while(x.hasNext) {
        println(x.next())
      }
    })
    /*
    分區(qū)數(shù)量:2
    *******
    (2,2)
    (4,4)
    (6,6)
    (8,8)
    *******
    (1,1)
    (3,3)
    (5,5)
    (7,7)
    (9,9)
     */
  }

}
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末叛拷,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子岂却,更是在濱河造成了極大的恐慌忿薇,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,968評論 6 482
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件躏哩,死亡現(xiàn)場離奇詭異煌恢,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)震庭,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,601評論 2 382
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來你雌,“玉大人器联,你說我怎么就攤上這事⌒稣福” “怎么了拨拓?”我有些...
    開封第一講書人閱讀 153,220評論 0 344
  • 文/不壞的土叔 我叫張陵,是天一觀的道長氓栈。 經(jīng)常有香客問我渣磷,道長,這世上最難降的妖魔是什么授瘦? 我笑而不...
    開封第一講書人閱讀 55,416評論 1 279
  • 正文 為了忘掉前任醋界,我火速辦了婚禮,結(jié)果婚禮上提完,老公的妹妹穿的比我還像新娘形纺。我一直安慰自己,他們只是感情好徒欣,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,425評論 5 374
  • 文/花漫 我一把揭開白布逐样。 她就那樣靜靜地躺著,像睡著了一般打肝。 火紅的嫁衣襯著肌膚如雪脂新。 梳的紋絲不亂的頭發(fā)上粗梭,一...
    開封第一講書人閱讀 49,144評論 1 285
  • 那天,我揣著相機(jī)與錄音楼吃,去河邊找鬼始花。 笑死,一個(gè)胖子當(dāng)著我的面吹牛酷宵,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播浇垦,決...
    沈念sama閱讀 38,432評論 3 401
  • 文/蒼蘭香墨 我猛地睜開眼炕置,長吁一口氣:“原來是場噩夢啊……” “哼男韧!你這毒婦竟也來了朴摊?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,088評論 0 261
  • 序言:老撾萬榮一對情侶失蹤此虑,失蹤者是張志新(化名)和其女友劉穎甚纲,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體朦前,經(jīng)...
    沈念sama閱讀 43,586評論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡介杆,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,028評論 2 325
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了韭寸。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片春哨。...
    茶點(diǎn)故事閱讀 38,137評論 1 334
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖恩伺,靈堂內(nèi)的尸體忽然破棺而出赴背,到底是詐尸還是另有隱情,我是刑警寧澤晶渠,帶...
    沈念sama閱讀 33,783評論 4 324
  • 正文 年R本政府宣布凰荚,位于F島的核電站,受9級特大地震影響乱陡,放射性物質(zhì)發(fā)生泄漏浇揩。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,343評論 3 307
  • 文/蒙蒙 一憨颠、第九天 我趴在偏房一處隱蔽的房頂上張望胳徽。 院中可真熱鬧,春花似錦爽彤、人聲如沸养盗。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,333評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽往核。三九已至,卻和暖如春嚷节,著一層夾襖步出監(jiān)牢的瞬間聂儒,已是汗流浹背虎锚。 一陣腳步聲響...
    開封第一講書人閱讀 31,559評論 1 262
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留衩婚,地道東北人窜护。 一個(gè)月前我還...
    沈念sama閱讀 45,595評論 2 355
  • 正文 我出身青樓,卻偏偏與公主長得像非春,于是被迫代替她去往敵國和親柱徙。 傳聞我的和親對象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,901評論 2 345