- 調(diào)優(yōu)的原因
Because of the in-memory nature of most Spark computations,
Spark programs can be bottlenecked by any resource in the cluster:
CPU, network bandwidth, or memory.
因?yàn)閟oark計(jì)算是基于內(nèi)存的泄隔,所以spark程序的瓶頸是集群上的資源包括:
CPU, 帶寬或內(nèi)存
Most often, if the data fits in memory, the bottleneck is network bandwidth,
but sometimes, you also need to do some tuning, such as storing RDDs in serialized form,
to decrease memory usage
如果數(shù)據(jù)和內(nèi)存大小匹配台腥,瓶頸就是網(wǎng)絡(luò)帶寬姆另, 但是有時(shí)呵扛,
你仍然需要做一些調(diào)優(yōu)來減少內(nèi)存使用情況形用,比如序列化存儲(chǔ)RDD
- 數(shù)據(jù)序列化
2.1 序列化的原因
(1) 數(shù)據(jù)通過網(wǎng)絡(luò)傳輸需要序列化
(2) 進(jìn)程間通信也需要序列化
2.2 Example
2.2.1 Writable接口
2.2.1.1 注釋
A serializable object which implements a simple, efficient,
serialization protocol
使用一個(gè)簡單地高效的協(xié)議實(shí)現(xiàn)一個(gè)序列化對象
2.2.1.2代碼
public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}
2.2.1.3. 補(bǔ)充說明
實(shí)現(xiàn)Writable的類都可以序列化计福,所以在Hadoop中建議使用IntWritable而不是簡單的使用Int
2.3 序列化在Spark中的使用場景
(1)算子中使用到外部變量(broadcast variable)豌鹤,需要經(jīng)過網(wǎng)絡(luò)傳輸蝉绷,肯定需要序列化鸭廷,用更好的序列化方式,序列化體積更小熔吗,網(wǎng)絡(luò)傳輸?shù)拈_銷會(huì)小辆床,如果放到內(nèi)存中,占的內(nèi)存也小
(2) cache
(3) shuffle
2.4 序列化具體描述(官網(wǎng)翻譯)
Serialization plays an important role
in the performance of any distributed application
序列化在任意分布式系統(tǒng)中扮演重要的角色
Formats that are slow to serialize objects into,
or consume a large number of bytes,
will greatly slow down the computation.
不好的序列化方法(如緩慢序列化對象桅狠,
或者占據(jù)大量bytes的序列化結(jié)果)將會(huì)使計(jì)算變得緩慢
It provides two serialization libraries
它提供了兩種序列化庫
Java serialization(優(yōu)勢在于能兼容任意的class):
By default, Spark serializes objects using Java’s ObjectOutputStream framework,
and can work with any class you create that
implements java.io.Serializable.
Java序列化: 默認(rèn)情況下讼载,Spark 使用java的ObjectOutputStream框架序列化對象,
并且能夠和任何任何實(shí)現(xiàn)java.io.Serializable的類一起使用
//java.io.ObjectOutputStream注釋
An ObjectOutputStream writes primitive data types
and graphs of Java objectsto an OutputStream
ObjectOutputStream將原始數(shù)據(jù)類型和java對象寫入到一個(gè)OutputStream中
//java.io.Serializable
public interface Serializable {
}
java.io.Serializable是一個(gè)marker interface,
任何實(shí)現(xiàn)該interface的接口就可以做序列化
You can also control the performance of your serialization
more closely by extendingjava.io.Externalizable
你能通過extendingjava.io.Externalizable更精確的控制
你的序列化的性能(但是一般情況下不使用)
java serialization is flexible but often quite slow,
and leads to large serialized formats for many classes
Kryo serialization(優(yōu)點(diǎn)速度快中跌,體積小):
Spark can also use the Kryo library (version 4)
to serialize objects more quickly
Kryo 序列化: Spark可以使用Kryo庫(version 4)更快速的去序列化對象
Kryo is significantly faster and more compact than Java serialization (often as much as 10x),
but does not support all Serializable types and requires you
to register the classes you’ll use in the program
in advance for best performance
Kryo比Java序列化更快速并且更緊湊(體積更小), 通常是java序列化的10倍咨堤,
但是其并不能支持所有的序列化類型并且為了更好的性能,
你需要提前注冊你將會(huì)在driver程序中使用的類
You can switch to using Kryo by initializing your job with a SparkConf and
calling conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
你可以通過在用SparkConf初始化Spark作業(yè)時(shí)設(shè)置
set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
來切換使用Kryo
val sparkConf = new SparkConf().
setMaster("local[2]").
setAppName("SparkContextApp").
set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
This setting configures the serializer used for not only shuffling data
between worker nodes but also when serializing RDDs to disk
這個(gè)設(shè)置配置的serializer不止是使用于在不同的work node之間傳遞的shuffling數(shù)據(jù)漩符,
并且使用于當(dāng)RDD序列化磁盤上時(shí)
The only reason Kryo is not the default is because of
the custom registration requirement,
but we recommend trying it in any network-intensive application
Kryo不是默認(rèn)的序列化庫是因?yàn)樽远x注冊的需要一喘,
但是使用kryo是推薦使用在網(wǎng)絡(luò)密集型的應(yīng)用上
Since Spark 2.0.0, we internally use Kryo serializer
when shuffling RDDs with simple types,
arrays of simple types, or string type
自從Spark 2.0.0,當(dāng)RDDs是簡單類型的集合嗜暴,
簡單類型的數(shù)組或者是String類型的時(shí)候凸克,
我們內(nèi)部使用Kryo序列化器
Finally, if you don’t register your custom classes,
Kryo will still work, but it will have to store
the full class name with each object, which is wasteful
最后,如果你不注冊自定義的類闷沥,Kryo仍然會(huì)起作用萎战,
但是會(huì)存儲(chǔ)每個(gè)對象的包名+類名,這樣做是浪費(fèi)的
2.5 程序中使用序列化
def class
val sparkConf = new SparkConf().
setMaster("local[2]").
setAppName("SparkContextApp").
set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
registerKryoClasses(Array(classOf[abc]))
2.6 具體實(shí)例