Scala常用函數(shù)

1. 常用函數(shù)

takeWhile

# takeWhile是從第一個元素開始贡珊，取滿足條件的元素，直到不滿足為止
val s1 = List(1,2,3,4,10,20,30,40,5,6,7,8,50,60,70,80)
val r1 = s1.takeWhile( _ < 10)
r1: List[Int] = List(1, 2, 3, 4)

Iterator類型的drop函數(shù)

val it = List.range(0, 10, 2).map {i => i.toString}
it.drop(1).zip(it.dropRight(1))

List

# List add Element
it3  :+ (1000,2000)   # 向末尾加
it3  :: (1000,2000)   # 向頭部加

reduceByKey

# reduce不按map順序執(zhí)行, 可以使用groupBy

cogroup | join | groupByKey 區(qū)別

github:引用鏈接

Join() returns an dataset of [key, leftValue, rightValue], where [key, leftValue] comes from one dataset, and [key, rightValue] from the other dataset.

CoGroup() returns an dataset of [key, leftValues, rightValues], where [key, leftValue] entries from one dataset are group together into [key, leftValues], and [key, rightValue] from the other dataset are grouped into [key, rightValues], and both grouped entries are combined into [key, leftValues, rightValues].

GroupByKey() returns an dataset of [key, values], where [key, value] entries from one dataset are group together.

Join(), GroupByKey() and CoGroup() all depend on Partition(). Both of the input datasets should be partitioned by the same key, and partitioned to the same number of shards. Otherwise, a relatively costly partitioning will be performed.

join過程包含cogroup和flatmap兩個過程, 如下圖:
引自: join(otherRDD, numPartitions)

join計算流程

廣播變量:Broadcast

#After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. 
#In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable 
#(e.g. if the variable is shipped to a new node later).
#即: 廣播變量允許程序員將一個只讀的變量緩存在每臺機(jī)器上驾凶，而不用在任務(wù)之間傳遞變量。
# 廣播變量可被用于有效地給每個節(jié)點一個大輸入數(shù)據(jù)集的副本惋砂。
# Spark還嘗試使用高效地廣播算法來分發(fā)變量拂募，進(jìn)而減少通信的開銷。

 scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
 broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int} = Broadcast(0)

 scala> broadcastVar.value
 res0: Array[Int] = Array(1, 2, 3)

foldLeft

左側(cè)累計
0為初始值（記住numbers是List[Int]類型）叁征，m作為一個累加器尸红。

直接觀察運行過程：
scala> numbers.foldLeft(0) { (m: Int, n: Int) => println("m: " + m + " n: " + n); m + n }
m: 0 n: 1
m: 1 n: 2
m: 3 n: 3
m: 6 n: 4
m: 10 n: 5
m: 15 n: 6
m: 21 n: 7
m: 28 n: 8
m: 36 n: 9
m: 45 n: 10
res0: Int = 55

Option

scala> val myMap: Map[String, (String, Boolean)] = Map("key1" -> ("value", true))
myMap: Map[String,(String, Boolean)] = Map(key1 -> (value,true))
scala> val vs = myMap.get("key1")
vs: Option[(String, Boolean)] = Some((value,true))

# 以上是元組方式吱涉，取出元組中數(shù)據(jù)，方式如下

# 方法一：
val (v2, s2) = vs match {
    case Some((v,s)) => (v, s)
    case _        => ("null", "null")
}

#方法二：
#如果被map的元素個數(shù)是0外里，就不執(zhí)行map怎爵，但是可以執(zhí)行map之后的函數(shù)，如下：
val (v2, s2) = vs.map { case (s, b) => (s, b.toString)}.getOrElse((null, null))
# val (v2, s2) = vs.map { case (s, b) => (s, b.toString)}.getOrElse(("null", "null"))

#注意：方法二盅蝗，null不是string鳖链，后面s2不能調(diào)用關(guān)于String的方法， 關(guān)于null的類型轉(zhuǎn)化墩莫，以下例子幫助理解
# null不能調(diào)用toString, 但None是可以的

scala> null.toString
java.lang.NullPointerException
scala> None.toString
res42: String = None

# null的類型芙委，及其使用：
scala> "null"
res38: String = null
scala> null
res39: Null = null
scala> null.asInstanceOf[String]
res40: String = null
scala> Array("a",null).mkString(",")
res41: String = a,null

Option[Boolean]

scala> val myMap: Map[String, (String, Boolean)] = Map("key1" ->  true)
myMap: Map[String,(String, Boolean)] = Map(key1 -> (value,true))
scala> val myMap2 = myMap + ("k2" -> false)

// 體會以下區(qū)別, 返回值
scala> myMap2.get("k8").map(_.toString).getOrElse(null)
res160: String = null

scala> myMap2.get("k8").getOrElse(null)
res161: Any = null

HashMap

scala> val map1 = mutable.HashMap[String, String]()
map1: scala.collection.mutable.HashMap[String,String] = Map()

scala> map1.put("a1","aa1")
res104: Option[String] = None

scala> map1
res105: scala.collection.mutable.HashMap[String,String] = Map(a1 -> aa1)

scala> map1("a2") = "aa2"

scala> map1
res108: scala.collection.mutable.HashMap[String,String] = Map(a1 -> aa1, a2 -> aa2)

immutable.Map

// myMap 是immutable, 即不可改變的Map, 不能對其增加元素
scala> val myMap = Map("k1" -> true)
myMap: scala.collection.immutable.Map[String,Boolean] = Map(k1 -> true)

// 但可以把immutable與其他map相加, 返回新的值
scala> val myMap2 = myMap + ("k2" -> false)
myMap2: scala.collection.immutable.Map[String,Boolean] = Map(k1 -> true, k2 -> false)

scala> myMap2.get("k8").isEmpty
res147: Boolean = true

sortBy | sortByKey | top
引用:Spark: sortBy和sortByKey函數(shù)詳解

// sortBy
// 本地創(chuàng)建, 測試該函數(shù)
scala> val data = List(3,1,90,3,5,12)
data: List[Int] = List(3, 1, 90, 3, 5, 12)
 
scala> val rdd = sc.parallelize(data)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:14
 
scala> rdd.collect
res0: Array[Int] = Array(3, 1, 90, 3, 5, 12)
 
scala> rdd.sortBy(x => x).collect
res1: Array[Int] = Array(1, 3, 3, 5, 12, 90)
 
scala> rdd.sortBy(x => x, false).collect
res3: Array[Int] = Array(90, 12, 5, 3, 3, 1)
 
scala> val result = rdd.sortBy(x => x, false)
result: org.apache.spark.rdd.RDD[Int] = MappedRDD[23] at sortBy at <console>:16
 
// 默認(rèn)的partitions = 6
scala> result.partitions.size
res9: Int = 6
 
// 這里我們可以設(shè)置partitions的數(shù)量
scala> val result = rdd.sortBy(x => x, false, 1)
result: org.apache.spark.rdd.RDD[Int] = MappedRDD[26] at sortBy at <console>:16
 
scala> result.partitions.size
res10: Int = 1


// sortByKey

scala> val a = sc.parallelize(List("wyp", "iteblog", "com", "397090770", "test"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[84] at parallelize at <console>:25

scala> val b = sc. parallelize (1 to a.count.toInt , 2)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[86] at parallelize at <console>:27

scala> b.collect
res60: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val c = b.zip(a)
c: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[92] at zip at <console>:29

scala> c.sortByKey().collect
res61: Array[(String, Int)] = Array((397090770,4), (com,3), (iteblog,2), (test,5), (wyp,1))

// top取出按key倒序排列的的top N元素, 注意使用top不需要進(jìn)行sortBy操作, 它自帶操作
scala> c.top(3)
res63: Array[(Int, String)] = Array((5,test), (4,397090770), (3,com))

// 默認(rèn)是升序排列
scala> c.sortByKey().collect
res64: Array[(Int, String)] = Array((1,wyp), (2,iteblog), (3,com), (4,397090770), (5,test))

scala> c.sortByKey(false).collect
res66: Array[(Int, String)] = Array((5,test), (4,397090770), (3,com), (2,iteblog), (1,wyp))

// top 注意, 如果為rdd, 且, 結(jié)構(gòu)為(k,v), 那么使用top函數(shù)進(jìn)行排序時, v中不能含有Array[Long], 但可以含有l(wèi)ong
scala> val rdd2 = sc.parallelize(List((10, ("a", Array(1,2))), (9, ("b", Array(3,5))), (1, ("c", Array(6,0)))))
rdd2: org.apache.spark.rdd.RDD[(Int, (String, Array[Int]))] = ParallelCollectionRDD[134] at parallelize at <console>:26

scala> rdd2.top(1)
<console>:29: error: No implicit Ordering defined for (Int, (String, Array[Int])).
       rdd2.top(1)

// 但可以含有l(wèi)ong
scala> val rdd2 = sc.parallelize(List((10, ("a", 11)), (9, ("b", 10)), (100, ("c", 20))))
rdd2: org.apache.spark.rdd.RDD[(Int, (String, Int))] = ParallelCollectionRDD[138] at parallelize at <console>:26

scala> rdd2.top(2)
res209: Array[(Int, (String, Int))] = Array((100,(c,20)), (10,(a,11)))

數(shù)組Array.grouped

// 將數(shù)組, 分成N組:
scala> val a = (1 to 9).toArray
a: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> a.grouped(3).toArray
res178: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9))

zip函數(shù)

// 原來:
reduceByKey{case ((s1, c1), (s2, c2)) =>
            val n1 = s1.split("\t")(0).toLong + s2.split("\t")(0).toLong
            val n2 = s1.split("\t")(1).toLong + s2.split("\t")(1).toLong
            val n3 = s1.split("\t")(2).toLong + s2.split("\t")(2).toLong
            val n4 = s1.split("\t")(3).toLong + s2.split("\t")(3).toLong
            val n5 = s1.split("\t")(4).toLong + s2.split("\t")(4).toLong
            val statusTrueNumStr = Array(n1, n2, n3, n4, n5).mkString("\t")
            val count = c1 + c2
// 使用zip后:
        val rddLastOneWeek2 = rddLastOneWeek.map{case (_, bigVersion, arrStatusTrueNum, isStable, count) =>
            ((bigVersion, isStable), (arrStatusTrueNum, count))
        }.reduceByKey{case ((arr1, count1), (arr2, count2)) =>
            val arr = arr1.zip(arr2).map{case (x,y) => x+y}
            val count = count1 + count2

zipWithIndex

scala> l
res21: List[Int] = List(1, 2, 3, 4)
scala> l.zipWithIndex
res22: List[(Int, Int)] = List((1,0), (2,1), (3,2), (4,3))

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末只锭，一起剝皮案震驚了整個濱河市箍土，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌焊唬，老刑警劉巖故痊，帶你破解...
沈念sama閱讀 212,383評論 6贊 493
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件顶瞳，死亡現(xiàn)場離奇詭異玖姑，居然都是意外死亡愕秫，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,522評論 3贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門焰络，熙熙樓的掌柜王于貴愁眉苦臉地迎上來戴甩，“玉大人，你說我怎么就攤上這事闪彼√鸸拢” “怎么了？”我有些...
開封第一講書人閱讀 157,852評論 0贊 348
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵畏腕，是天一觀的道長缴川。經(jīng)常有香客問我，道長描馅，這世上最難降的妖魔是什么把夸？我笑而不...
開封第一講書人閱讀 56,621評論 1贊 284
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮铭污，結(jié)果婚禮上恋日，老公的妹妹穿的比我還像新娘膀篮。我一直安慰自己，他們只是感情好岂膳，可當(dāng)我...
茶點故事閱讀 65,741評論 6贊 386
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布誓竿。她就那樣靜靜地躺著，像睡著了一般谈截。火紅的嫁衣襯著肌膚如雪筷屡。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 49,929評論 1贊 290
城市分裂傳說
那天傻盟，我揣著相機(jī)與錄音速蕊，去河邊找鬼。笑死娘赴，一個胖子當(dāng)著我的面吹牛规哲，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播诽表，決...
沈念sama閱讀 39,076評論 3贊 410
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼唉锌，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了竿奏？” 一聲冷哼從身側(cè)響起袄简，我...
開封第一講書人閱讀 37,803評論 0贊 268
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎泛啸，沒想到半個月后绿语，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 44,265評論 1贊 303
?護(hù)林員之死
正文獨居荒郊野嶺守林人離奇死亡候址，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 36,582評論 2贊 327
?白月光啟示錄
正文我和宋清朗相戀三年吕粹，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片岗仑。...
茶點故事閱讀 38,716評論 1贊 341
活死人
序言：一個原本活蹦亂跳的男人離奇死亡匹耕，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出荠雕，到底是詐尸還是另有隱情稳其，我是刑警寧澤，帶...
沈念sama閱讀 34,395評論 4贊 333
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布炸卑，位于F島的核電站既鞠，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏盖文。R本人自食惡果不足惜嘱蛋，卻給世界環(huán)境...
茶點故事閱讀 40,039評論 3贊 316
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧浑槽，春花似錦蒋失、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,798評論 0贊 21
一樁弒父案篙挽，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至镊靴，卻和暖如春铣卡，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背偏竟。一陣腳步聲響...
開封第一講書人閱讀 32,027評論 1贊 266
情欲美人皮
我被黑心中介騙來泰國打工煮落，沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人踊谋。一個月前我還...
沈念sama閱讀 46,488評論 2贊 361
代替公主和親
正文我出身青樓蝉仇，卻偏偏與公主長得像，于是被迫代替她去往敵國和親殖蚕。傳聞我的和親對象是個殘疾皇子轿衔，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 43,612評論 2贊 350

Scala常用函數(shù)

1. 常用函數(shù)

推薦閱讀更多精彩內(nèi)容