Spark can create distributed datasets from any file stored in the Hadoop distributed filesystem (HDFS) or other storage systems supported by the Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, etc.).
Spark supports text files, SequenceFiles, Avro, Parquet, and any other Hadoop InputFormat.
-
spark core
resilient distributed datasets (RDDs): A Fault-Tolerant Abstraction for In-Memory Cluster Computing. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel.
In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs.
Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster.
Create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g., a list or set) in their driver program. eg. loading a text file as an RDD of strings usingSparkContext.textFile()
.
Often do some initialETL
(extract, transform, and load) to get our data into a key/value format. Key/value RDDs expose new operations (e.g., counting up reviews for each product, grouping together data with the same key, and grouping together two different RDDs).
Advanced feature: Let users control the layout of pair RDDs across nodes:partitioning
(the PageRank algorithm).這個能帶來顯著的加速.-
Pair RDDs
Spark provides special operations on RDDs containing key/value pairs. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.
There are a number of ways to get pair RDDs in Spark.
Pair RDDs are also still RDDs (of Tuple2 objects in Java/Scala or of Python tuples).
Access only the value part of our pair RDD: Spark provides the mapValues(func) function, which is the same as map{case (x, y): (x, func(y))}.
Spark's "distributed reduce" transformations operate on RDDs of key-value pairs.Aggregations
the fold(), combine(), and reduce() actions on basic RDDs. These operations return RDDs and are transformations rather than actions.
the reduceByKey() is quite similar to reduce(); both take a function and use it to combine values. reduceByKey() runs several parallel reduce operations.
the foldByKey() is quite similar to fold(); both use a zero value of the same type of the data in our RDD and combination function.
the reduceByKey() and the foldByKey() will automatically perform combining locally on each machine before computing global totals for each key.
the combineByKey() is the most general of the per-key aggregation functions. Most of the other per-key combiners are implemented using it. (is a new element: createCombiner() -- is a value: mergeValue() -- mergeCombiners()).
In any case, using one of the specialized aggregation functions in Spark can be much faster than the naive approach of grouping our data and then reducing it.Tuning the level of parallelism
Every RDD has a fixed number of partitions that determine the degree of parallelism to use when executing operations on the RDD.
When performing aggregations or grouping operations, we can ask Spark to use a specific number of partitions. Spark will always try to infer a sensible default value based on the size of your cluster, but in some cases you will want to tune the level of parallelism for better performance.
-