這篇文章詳細的介紹了spark廣播變量,值得一看
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-broadcast.html
在此只摘錄其中的Example
Let’s start with an introductory example to check out how to use broadcast variables and build your initial understanding.
You’re going to use a static mapping of interesting projects with their websites, i.e. Map[String, String]
that the tasks, i.e. closures (anonymous functions) in transformations, use.
scala> val pws = Map("Apache Spark" -> "http://spark.apache.org/", "Scala" -> "http://www.scala-lang.org/")
pws: scala.collection.immutable.Map[String,String] = Map(Apache Spark -> http://spark.apache.org/, Scala -> http://www.scala-lang.org/)
scala> val websites = sc.parallelize(Seq("Apache Spark", "Scala")).map(pws).collect
...
websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)
It works, but is very ineffective as the pws
map is sent over the wire to executors while it could have been there already. If there were more tasks that need the pws
map, you could improve their performance by minimizing the number of bytes that are going to be sent over the network for task execution.
Enter broadcast variables.
val pwsB = sc.broadcast(pws)
val websites = sc.parallelize(Seq("Apache Spark", "Scala")).map(pwsB.value).collect
// websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)
Semantically, the two computations - with and without the broadcast value - are exactly the same, but the broadcast-based one wins performance-wise when there are more executors spawned to execute many tasks that use pws
map.
總結
通過這篇文章可以知道街夭,如果在driver中定義一個普通的變量,也是可以在不同的task中傳遞的领斥,只不過是通過拷貝一個副本的方式傳遞筋帖。為了提高性能通過定義廣播變量趟薄,在每個機器上只生成一個只讀變量,共享給這個機器上所有的task魄藕。