What is Apache Spark?
Apache Spark is a cluster computing platform designed to be fast and general-purpose.
On the speed side, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk.
Apache Spark 是一款旨在快速通用的集群計(jì)算平臺(tái),在速度方面,擴(kuò)展了流行的MapReduce模型,以有效支持更多類型的計(jì)算,包括交互式查詢和流處理铃剔。能夠在內(nèi)存中運(yùn)行計(jì)算灭翔,但是對(duì)于在磁盤(pán)上運(yùn)行的復(fù)雜應(yīng)用程序瓷患,該系統(tǒng)也比MapReduce更高效铃肯。
The Spark project contains multiple closely integrated components. At its core, Spark is a “computational engine” that is responsible for scheduling, distributing, and mon‐itoring applications consisting of many computational tasks across many worker machines, or a computing cluster. Because the core engine of Spark is both fast and general-purpose, it powers multiple higher-level components specialized for various workloads, such as SQL or machine learning. These components are designed to interoperate closely, letting you combine them like libraries in a software project.
Spark項(xiàng)目包含多個(gè)緊密集成的組件己沛。其核心在于疤孕,Spark是一個(gè)“計(jì)算引擎”商乎,負(fù)責(zé)在多個(gè)工作機(jī)器或計(jì)算集群中安排,分發(fā)和監(jiān)視由許多計(jì)算任務(wù)組成的應(yīng)用程序祭阀。由于Spark的核心引擎既快速又通用鹉戚,因此可以為多種工作負(fù)載(如SQL或機(jī)器學(xué)習(xí))提供專門(mén)的多個(gè)高級(jí)組件鲜戒。這些組件的設(shè)計(jì)是為了與軟件項(xiàng)目中的library進(jìn)行緊密的交互操作。
We should learn the characteristics and benefits of its tight integration. Specific content, you can see "learning spark".
我們應(yīng)該學(xué)習(xí)它的緊密整合的特點(diǎn)和好處抹凳。具體的內(nèi)容遏餐,可以看看《learning spark》
Each of Spark’s components
Spark Core
Spark core includes the basic functions of Spark, including for task scheduling, memory management, fault recovery, and storage system interaction components. Spark Core also defines the Flexible Distributed Data Set (RDD) API. He provides a number of APIs for building and manipulating collections.
Spark core 包含Spark的基本功能,包括用于任務(wù)調(diào)度赢底,內(nèi)存管理失都,故障恢復(fù),與存儲(chǔ)系統(tǒng)交互的組件幸冻。Spark Core 也是定義彈性分布式數(shù)據(jù)集(RDD)API所在粹庞。他提供了許多用于構(gòu)建和操作集合的API
Spark SQL
Spark SQL is a spark package for handling structured data. It allows data to be queried through SQL. Is a variant of Apache Hive. Combine SQL with complex analysis.
Spark SQL是用于處理結(jié)構(gòu)化數(shù)據(jù)的spark包。它允許通過(guò)SQL查詢數(shù)據(jù)洽损。是Apache Hive的變體庞溜。將SQL與復(fù)雜的分析相結(jié)合。
Spark Streaming
Spark Streaming is a Spark component that can handle real-time streaming data. At the same time, it provides an API for handling data streams that are closely related to the Spark Core RDD API. This is very convenient. And move between applications that store data on memory, on disk, or in real-time access. The same program has fault tolerance, throughput, and scalability as the Spark Core.
Spark Streaming 是一個(gè)Spark組件碑定,可以處理實(shí)時(shí)流數(shù)據(jù)流码。同時(shí),它提供了一個(gè)API用于處理與Spark Core RDD API密切相關(guān)的數(shù)據(jù)流延刘。這是很方便的漫试。而且操作存儲(chǔ)在內(nèi)存,磁盤(pán)上的或者實(shí)時(shí)訪問(wèn)中的數(shù)據(jù)的應(yīng)用程序之間移動(dòng)访娶。與Spark Core有相同程序的容錯(cuò)能力商虐,吞吐量和可擴(kuò)展性。
MLIb
This is a library of common machine learning functions that provide many types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering. It also provides some lower level ML primitives, including gradient descent optimization algorithms. So these methods can be designed to expand the cluster expansion, is not it amazing?
這是一個(gè)通用機(jī)器學(xué)習(xí)功能的庫(kù)崖疤,提供很多類型的機(jī)器學(xué)習(xí)算法,包括分類典勇,回歸劫哼,聚類和協(xié)同過(guò)濾等等。它還提供了一些較低級(jí)別的ML原語(yǔ)割笙,包括梯度下降優(yōu)化算法权烧。所以的這些方法都可以被設(shè)計(jì)為擴(kuò)集群擴(kuò)展,是不是很amazing呢伤溉?
GraphX
GraphX is a library for manipulating graphics and performing graphical parallel computing. It also extends the Spark RDD API, allowing the user to create a directed graph with arbi-trary attributes attached to each vertex and edge. GraphX also provides a variety of operators for manipulating graphics and a library of common graphics algorithms.
Graphx 是用于操作圖形和執(zhí)行圖形并行計(jì)算的庫(kù)般码,它也擴(kuò)展了Spark RDD API,允許使用者創(chuàng)建一個(gè)連接到每個(gè)頂點(diǎn)和邊緣的具有arbi-trary屬性的有向圖乱顾。GraphX還提供了各種用于操作圖形的操作器和一個(gè)常見(jiàn)圖形算法庫(kù)板祝。
Cluster Managers(集群管理器)
Under the engine, Spark is designed to effectively extend from one to thousands of compute nodes. How to achieve and maximize flexibility? So with the cluster manager. Spark can run on a clustered manager, including Hadoop YARN, Apache Mesos, and a simple cluster manager that is included in Spark itself as a separate scheduler. This concept is far from the current far, and so learn to write later, first we just know this concept.
在引擎下,Spark旨在有效地從一個(gè)到數(shù)千個(gè)計(jì)算節(jié)點(diǎn)擴(kuò)展走净。怎么實(shí)現(xiàn)而且最大限度地提升靈活性券时?于是有了集群管理器孤里。Spark可以運(yùn)行在在中集群管理器上,包括Hadoop YARN橘洞,Apache Mesos捌袜,以及包含在Spark本身稱為獨(dú)立調(diào)度程序的簡(jiǎn)單集群管理器。這個(gè)概念離目前有點(diǎn)遠(yuǎn)炸枣,等學(xué)到后面再寫(xiě)虏等,先知道有這個(gè)概念就好。
具體的可以看看Spark的官方文檔适肠。