下載全部視頻和PPT彩库,請(qǐng)關(guān)注公眾號(hào)(bigdata_summit),并點(diǎn)擊“視頻下載”菜單
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets
by Jules Damji, Databricks
video, slide
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)Session hashtag: #EUdev12
下面的內(nèi)容來(lái)自機(jī)器翻譯:
在所有開(kāi)發(fā)人員的喜悅中牙勘,沒(méi)有一個(gè)比一套使開(kāi)發(fā)人員具有生產(chǎn)力,易于使用且直觀和富有表現(xiàn)力的API更有吸引力所禀。 Apache Spark通過(guò)Spark SQL方面,Streaming,Machine Learning和Graph Processing等組件提供這些API色徘,以便以Scala恭金,Java,Python和R等語(yǔ)言對(duì)大型數(shù)據(jù)集進(jìn)行大規(guī)模分布式大數(shù)據(jù)處理褂策。在這次演講中横腿,我將探討Apache Spark 2.x中提供的三套API(RDD,DataFrame和Datasets)的演變斤寂。我特別強(qiáng)調(diào)三個(gè)要點(diǎn):1)為什么和什么時(shí)候應(yīng)該使用每一套作為最佳實(shí)踐2)概述其性能和優(yōu)化的好處;以及3)強(qiáng)調(diào)何時(shí)使用DataFrame和Datasets而不是RDD來(lái)處理大數(shù)據(jù)分布式的情況耿焊。通過(guò)使用API??代碼示例進(jìn)行簡(jiǎn)單的筆記本演示,您將了解如何使用RDD遍搞,DataFrame和Datasets處理大數(shù)據(jù)罗侯,并在其中進(jìn)行互操作。 (這將是博客的發(fā)聲尾抑,以及Apache Spark 2.x數(shù)據(jù)框/數(shù)據(jù)集和Spark SQL API的最新發(fā)展:https://databricks.com/blog/2016/07/14/a-tale-of- 3-apache-spark-apis-rdds-dataframes-and-datasets.html)會(huì)話主題標(biāo)簽:#EUdev12
An Adaptive Execution Engine For Apache Spark SQL
by Carson Wang, Intel
video, slide
Catalyst is an excellent optimizer in SparkSQL, provides open interface for rule-based optimization in planning stage. However, the static (rule-based) optimization will not consider any data distribution at runtime. A technology called Adaptive Execution has been introduced since Spark 2.0 and aims to cover this part, but still pending in early stage. We enhanced the existing Adaptive Execution feature, and focus on the execution plan adjustment at runtime according to different staged intermediate outputs, like set partition numbers for joins and aggregations, avoid unnecessary data shuffling and disk IO, handle data skew cases, and even optimize the join order like CBO etc.. In our benchmark comparison experiments, this feature save huge manual efforts in tuning the parameters like the shuffled partition number, which is error-prone and misleading. In this talk, we will expose the new adaptive execution framework, task scheduling, failover retry mechanism, runtime plan switching etc. At last, we will also share our experience of benchmark 100 -300 TB scale of TPCx-BB in a hundreds of bare metal Spark cluster.Session hashtag: EUdev4
下面的內(nèi)容來(lái)自機(jī)器翻譯:
Catalyst是SparkSQL中極好的優(yōu)化器歇父,在規(guī)劃階段為規(guī)則優(yōu)化提供開(kāi)放接口。但是再愈,靜態(tài)(基于規(guī)則的)優(yōu)化不會(huì)考慮運(yùn)行時(shí)的任何數(shù)據(jù)分布榜苫。自Spark 2.0以來(lái),引入了一種名為“自適應(yīng)執(zhí)行”的技術(shù)翎冲,旨在涵蓋此部分垂睬,但仍處于早期階段。我們?cè)鰪?qiáng)了現(xiàn)有的自適應(yīng)執(zhí)行功能,并根據(jù)不同階段的中間輸出(如為連接和聚合設(shè)置分區(qū)數(shù)量)驹饺,在運(yùn)行時(shí)重點(diǎn)調(diào)整執(zhí)行計(jì)劃钳枕,避免不必要的數(shù)據(jù)混洗和磁盤IO,處理數(shù)據(jù)歪斜情況赏壹,甚至優(yōu)化加入CBO等命令鱼炒。在我們的基準(zhǔn)比較實(shí)驗(yàn)中,該功能節(jié)省了大量的手動(dòng)調(diào)整參數(shù)蝌借,如混洗分區(qū)數(shù)量昔瞧,這是容易出錯(cuò)和誤導(dǎo)。在這次演講中菩佑,我們將展示新的自適應(yīng)執(zhí)行框架自晰,任務(wù)調(diào)度,故障轉(zhuǎn)移重試機(jī)制稍坯,運(yùn)行時(shí)計(jì)劃切換等酬荞。最后,我們還將分享我們?cè)趲装賯€(gè)裸機(jī)上分享我們100-300 TB規(guī)模的TPCx-BB規(guī)模的經(jīng)驗(yàn)金屬Spark集群瞧哟。Session標(biāo)簽:EUdev4
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Methodologies
by Luca Canali, CERN
video, slide
This talk is about methods and tools for troubleshooting Spark workloads at scale and is aimed at developers, administrators and performance practitioners. You will find examples illustrating the importance of using the right tools and right methodologies for measuring and understanding performance, in particular highlighting the importance of using data and root cause analysis to understand and improve the performance of Spark applications. The talk has a strong focus on practical examples and on tools for collecting data relevant for performance analysis. This includes tools for collecting Spark metrics and tools for collecting OS metrics. Among others, the talk will cover sparkMeasure, a tool developed by the author to collect Spark task metric and SQL metrics data, tools for analysing I/O and network workloads, tools for analysing CPU usage and memory bandwidth, tools for profiling CPU usage and for Flame Graph visualization.Session hashtag: #EUdev2
下面的內(nèi)容來(lái)自機(jī)器翻譯:
這次演講的內(nèi)容是關(guān)于大規(guī)模調(diào)試Spark工作負(fù)載的方法和工具混巧,針對(duì)開(kāi)發(fā)人員,管理人員和性能從業(yè)人員绢涡。您將找到示例說(shuō)明使用正確的工具和正確的方法衡量和理解性能的重要性牲剃,特別強(qiáng)調(diào)了使用數(shù)據(jù)和根本原因分析來(lái)了解和改進(jìn)Spark應(yīng)用程序性能的重要性。這次演講非常關(guān)注實(shí)際案例和收集與績(jī)效分析有關(guān)的數(shù)據(jù)的工具雄可。這包括收集Spark指標(biāo)的工具和收集OS指標(biāo)的工具凿傅。其中包括SparkMeasure,這是一個(gè)由作者開(kāi)發(fā)的用于收集Spark任務(wù)度量和SQL度量數(shù)據(jù)的工具数苫,用于分析I / O和網(wǎng)絡(luò)工作負(fù)載的工具聪舒,分析CPU使用率和內(nèi)存帶寬的工具,分析CPU使用情況的工具以及為Flame Graph visualization.Session標(biāo)簽:#EUdev2
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytics
by Akshay Rai, LinkedIn
video, slide
Is your job running slower than usual? Do you want to make sense from the thousands of Hadoop & Spark metrics? Do you want to monitor the performance of your flow, get alerts and auto tune them? These are the common questions every Hadoop user asks but there is not a single solution that addresses it. We at Linkedin faced lots of such issues and have built a simple self serve tool for the hadoop users called Dr. Elephant. Dr. Elephant, which is already open sourced, is a performance monitoring and tuning tool for Hadoop and Spark. It tries to improve the developer productivity and cluster efficiency by making it easier to tune jobs. Since its open source, it has been adopted by multiple organizations and followed with a lot of interest in the Hadoop and Spark community. In this talk, we will discuss about Dr. Elephant and outline our efforts to expand the scope of Dr. Elephant to be a comprehensive monitoring, debugging and tuning tool for Hadoop and Spark applications. We will talk about how Dr. Elephant performs exception analysis, give clear and specific suggestions on tuning, tracking metrics and monitoring their historical trends. Open Source: https://github.com/linkedin/dr-elephantSession hashtag: #EUdev9
下面的內(nèi)容來(lái)自機(jī)器翻譯:
你的工作比平時(shí)慢嗎虐急?您想從數(shù)千個(gè)Hadoop和Spark度量中理解嗎箱残?你想監(jiān)視你的流量的性能,獲得警報(bào)和自動(dòng)調(diào)整止吁?這些都是每個(gè)Hadoop用戶所要求的常見(jiàn)問(wèn)題被辑,但沒(méi)有解決這個(gè)問(wèn)題的單一解決方案。我們?cè)贚inkedin上遇到了很多這樣的問(wèn)題敬惦,并且為稱為Dr. Elephant的hadoop用戶構(gòu)建了一個(gè)簡(jiǎn)單的自助服務(wù)工具盼理。已經(jīng)開(kāi)源的Elephant博士是Hadoop和Spark的性能監(jiān)控和調(diào)優(yōu)工具。它試圖通過(guò)調(diào)整作業(yè)更容易提高開(kāi)發(fā)人員的生產(chǎn)力和集群效率俄删。自從開(kāi)源以來(lái)宏怔,它已經(jīng)被多個(gè)組織所采用奏路,并對(duì)Hadoop和Spark社區(qū)產(chǎn)生了很大的興趣。在這次演講中臊诊,我們將討論Elephant博士鸽粉,并概述我們將Elephant博士的范圍擴(kuò)展為Hadoop和Spark應(yīng)用程序的全面監(jiān)控,調(diào)試和調(diào)優(yōu)工具的努力抓艳。我們將討論大象博士如何進(jìn)行異常分析触机,就調(diào)整,跟蹤指標(biāo)和監(jiān)測(cè)其歷史趨勢(shì)給出明確的具體建議玷或。開(kāi)源:https://github.com/linkedin/dr-elephantSession標(biāo)簽:#EUdev9
Extending Apache Spark SQL Data Source APIs with Join Push Down
by Ioana Delaney, IBM
video, slide
When Spark applications operate on distributed data coming from disparate data sources, they often have to directly query data sources external to Spark such as backing relational databases, or data warehouses. For that, Spark provides Data Source APIs, which are a pluggable mechanism for accessing structured data through Spark SQL. Data Source APIs are tightly integrated with the Spark Optimizer. They provide optimizations such as filter push down to the external data source and column pruning. While these optimizations significantly speed up Spark query execution, depending on the data source, they only provide a subset of the functionality that can be pushed down and executed at the data source. As part of our ongoing project to provide a generic data source push down API, this presentation will show our work related to join push down. An example is star-schema join, which can be simply viewed as filters applied to the fact table. Today, Spark Optimizer recognizes star-schema joins based on heuristics and executes star-joins using efficient left-deep trees. An alternative execution proposed by this work is to push down the star-join to the external data source in order to take advantage of multi-column indexes defined on the fact tables, and other star-join optimization techniques implemented by the relational data source.Session hashtag: #EUdev7
下面的內(nèi)容來(lái)自機(jī)器翻譯:
Spark應(yīng)用程序?qū)?lái)自不同數(shù)據(jù)源的分布式數(shù)據(jù)進(jìn)行操作時(shí)威兜,通常需要直接查詢Spark外部的數(shù)據(jù)源,如支持關(guān)系數(shù)據(jù)庫(kù)或數(shù)據(jù)倉(cāng)庫(kù)庐椒。為此,Spark提供了數(shù)據(jù)源API蚂踊,這是通過(guò)Spark SQL訪問(wèn)結(jié)構(gòu)化數(shù)據(jù)的可插入機(jī)制约谈。數(shù)據(jù)源API與Spark Optimizer緊密集成。它們提供了優(yōu)化犁钟,比如過(guò)濾器下推到外部數(shù)據(jù)源和列修剪棱诱。雖然這些優(yōu)化顯著加快了Spark查詢的執(zhí)行速度,但依賴于數(shù)據(jù)源涝动,它們只提供可在數(shù)據(jù)源處下推執(zhí)行的一部分功能迈勋。作為我們正在進(jìn)行的提供通用數(shù)據(jù)源下推API的項(xiàng)目的一部分,此演示文稿將顯示我們與加入下推相關(guān)的工作醋粟。星型模式連接就是一個(gè)例子靡菇,可以簡(jiǎn)單地將其視為應(yīng)用于事實(shí)表的過(guò)濾器。如今米愿,Spark Optimizer基于啟發(fā)式技術(shù)識(shí)別星型模式連接厦凤,并使用高效的左深度樹(shù)執(zhí)行星型連接。這項(xiàng)工作提出的另一種執(zhí)行方式是將星形連接推送到外部數(shù)據(jù)源育苟,以利用事實(shí)表上定義的多列索引以及由關(guān)系數(shù)據(jù)源實(shí)現(xiàn)的其他星形連接優(yōu)化技術(shù)较鼓。會(huì)話主題標(biāo)簽:#EUdev7
Extending Apache Spark's Ingestion: Building Your Own Java Data Source
by Jean Georges Perrin, Oplo
video, slide
Apache Spark is a wonderful platform for running your analytics jobs. It has great ingestion features from CSV, Hive, JDBC, etc. however, you may have your own data sources or formats you want to use. Your solution could be to convert your data in a CSV or JSON file and then ask Spark to do ingest it through its built-in tools. However, for enhanced performance, we will explore the way to build a data source, in Java, to extend Spark’s ingestion capabilities. We will first understand how Spark works for ingestion, then walk through the development of this data source plug-in. Targeted audience Software and data engineers who need to expand Spark’s ingestion capability. Key takeaways Requirements, needs & architecture – 15%. Build the required tool set in Java – 85%.Session hashtag: #EUdev6
下面的內(nèi)容來(lái)自機(jī)器翻譯:
Apache Spark是運(yùn)行分析作業(yè)的絕佳平臺(tái)。它具有從CSV违柏,Hive博烂,JDBC等偉大的攝取功能。但是漱竖,您可能有自己的數(shù)據(jù)源或您要使用的格式禽篱。您的解決方案可能是將您的數(shù)據(jù)轉(zhuǎn)換為CSV或JSON文件,然后要求Spark通過(guò)其內(nèi)置工具進(jìn)行攝取闲孤。但是谆级,為了提高性能烤礁,我們將探索以Java為基礎(chǔ)構(gòu)建數(shù)據(jù)源的方式,以擴(kuò)展Spark的攝取功能肥照。我們將首先了解Spark如何工作脚仔,然后通過(guò)這個(gè)數(shù)據(jù)源插件的開(kāi)發(fā)。目標(biāo)受眾需要擴(kuò)展Spark攝取功能的軟件和數(shù)據(jù)工程師舆绎。關(guān)鍵要求要求鲤脏,需求和架構(gòu) - 15%。在Java中構(gòu)建所需的工具集 - 85%.Session標(biāo)簽:#EUdev6
Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud
by Michael McCune, Red Hat
video, slide
Writing intelligent cloud native applications is hard enough when things go well, but what happens when there are performance and debugging issues that arise during production? Inspecting the logs is a good start, but what if the logs don’t show the whole picture? Now you have to go deeper, examining the live performance metrics that are generated by Spark, or even deploying specialized microservices to monitor and act upon that data. Spark provides several built-in sinks for exposing metrics data about the internal state of its executors and drivers, but getting at that information when your cluster is in the cloud can be a time consuming and arduous process. In this presentation, Michael McCune will walk through the options available for gaining access to the metrics data even when a Spark cluster lives in a cloud native containerized environment. Attendees will see demonstrations of techniques that will help them to integrate a full-fledged metrics story into their deployments. Michael will also discuss the pain points and challenges around publishing this data outside of the cloud and explain how to overcome them. In this talk you will learn about: Deploying metrics sinks as microservices, Common configuration options, and Accessing metrics data through a variety of mechanisms.Session hashtag: #EUdev11
下面的內(nèi)容來(lái)自機(jī)器翻譯:
當(dāng)事情進(jìn)展順利的時(shí)候編寫智能云原生應(yīng)用程序已經(jīng)夠難了吕朵,但是當(dāng)生產(chǎn)過(guò)程中出現(xiàn)性能和調(diào)試問(wèn)題時(shí)會(huì)發(fā)生什么猎醇?檢查日志是一個(gè)好的開(kāi)始,但如果日志不顯示整個(gè)圖像呢努溃?現(xiàn)在硫嘶,您必須更深入地研究Spark生成的實(shí)時(shí)性能指標(biāo),甚至部署專門的微服務(wù)來(lái)監(jiān)視和處理這些數(shù)據(jù)梧税。 Spark提供了幾個(gè)內(nèi)置接收器來(lái)公開(kāi)有關(guān)執(zhí)行程序和驅(qū)動(dòng)程序內(nèi)部狀態(tài)的度量標(biāo)準(zhǔn)數(shù)據(jù)沦疾,但是當(dāng)您的群集在云中時(shí)獲取這些信息可能是一個(gè)耗時(shí)且艱巨的過(guò)程。在本演示中第队,即使Spark集群位于云本地集裝箱環(huán)境中哮塞,邁克爾·麥庫(kù)納也將瀏覽可用于獲取指標(biāo)數(shù)據(jù)的選項(xiàng)。與會(huì)者將看到技術(shù)演示凳谦,這將有助于他們將全面的指標(biāo)故事整合到他們的部署中忆畅。邁克爾還將討論圍繞云發(fā)布這些數(shù)據(jù)的難點(diǎn)和挑戰(zhàn),并解釋如何克服這些問(wèn)題尸执。在本次演講中家凯,您將了解:將指標(biāo)匯聚為微服務(wù),通用配置選項(xiàng)以及通過(guò)各種機(jī)制訪問(wèn)指標(biāo)數(shù)據(jù)如失。Session hashtag:#EUdev11
From Pipelines to Refineries: Building Complex Data Applications with Apache Spark
by Tim Hunter, Databricks
video, slide
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.Session hashtag: #EUdev1
下面的內(nèi)容來(lái)自機(jī)器翻譯:
大數(shù)據(jù)工具很難結(jié)合成一個(gè)更大的應(yīng)用程序:具有諷刺意味的是肆饶,大數(shù)據(jù)應(yīng)用程序本身不能很好地?cái)U(kuò)展。整合和數(shù)據(jù)管理的這些問(wèn)題只會(huì)被越來(lái)越多的數(shù)據(jù)所放大岖常。 Apache Spark為批處理驯镊,流和臨時(shí)交互式分析提供了強(qiáng)大的構(gòu)建塊。然而竭鞍,當(dāng)將一個(gè)連貫的流水線集中在一起板惑,可能涉及數(shù)百個(gè)轉(zhuǎn)換步驟時(shí),用戶面臨著挑戰(zhàn)偎快,特別是在面臨快速迭代的需求時(shí)冯乘。這個(gè)演講通過(guò)函數(shù)式編程的鏡頭來(lái)探討這些問(wèn)題。它提供了一個(gè)實(shí)驗(yàn)框架晒夹,通過(guò)向Apache Spark引入更多的懶惰來(lái)提供全面的管道保證裆馒。通過(guò)整個(gè)程序檢查姊氓,自動(dòng)緩存以及積極的計(jì)算并行和重用,這個(gè)框架允許轉(zhuǎn)換無(wú)縫組合并緩解常見(jiàn)問(wèn)題喷好。Session hashtag:#EUdev1
Lessons From the Field: Applying Best Practices to Your Apache Spark Applications
by Silvio Fiorito, Databricks
video, slide
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices I’ve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.Session hashtag: #EUdev5
下面的內(nèi)容來(lái)自機(jī)器翻譯:
無(wú)論您是在進(jìn)行ETL翔横,機(jī)器學(xué)習(xí)還是數(shù)據(jù)倉(cāng)庫(kù),Apache Spark都是加速分析的絕佳工具梗搅。但是禾唁,為了充分利用Spark,需要了解數(shù)據(jù)存儲(chǔ)无切,文件格式和查詢優(yōu)化的最佳實(shí)踐荡短。本次演講將涵蓋我在現(xiàn)場(chǎng)應(yīng)用多年的最佳實(shí)踐,幫助客戶編寫Spark應(yīng)用程序哆键,并確定哪些模式對(duì)您的用例有意義掘托。Session hashtag:#EUdev5
Optimal Strategies for Large-Scale Batch ETL Jobs
by Emma Tang, Neustar
video, slide
The ad tech industry processes large volumes of pixel and server-to-server data for each online user’s click, impression, and conversion data. At Neustar, we process 10+ billion events per day, and all of our events are fed through a number of Spark ETL batch jobs. Many of our Spark jobs process over 100 terabytes of data per run, each job runs to completion in around 3.5 hours. This means we needed to optimize our jobs in specific ways to achieve massive parallelization while keeping the memory usage (and cost) as low as possible. Our talk is focused on strategies dealing with extremely large data. We will talk about the things we learned and the mistakes we made. This includes: – Optimizing memory usage using Ganglia – Optimizing partition counts for different types of stages and effective joins – Counterintuitive strategies for materializing data to maximize efficiency – Spark default settings specific to large scale jobs, and how they matter – Running Spark using Amazon EMR with more than 3200 cores – Review different types of errors and stack traces that occur with large-scale jobs and how to read and handle them – How to deal with large number of map output status when there are 100k partitions joining with 100k partitions – How to prevent serialization buffer overflow as well as map out status buffer overflow. This can easily happen when data is extremely large – How to effectively use partitioners to combine stages and minimize shuffle.Session hashtag: #EUdev3
下面的內(nèi)容來(lái)自機(jī)器翻譯:
廣告技術(shù)行業(yè)會(huì)為每個(gè)在線用戶的點(diǎn)擊量,展示次數(shù)和轉(zhuǎn)化數(shù)據(jù)處理大量像素和服務(wù)器到服務(wù)器的數(shù)據(jù)籍嘹。在Neustar烫映,我們每天處理超過(guò)10億個(gè)事件,并且我們所有的事件都通過(guò)一系列Spark ETL批處理作業(yè)進(jìn)行處理噩峦。我們的許多Spark作業(yè)每次處理超過(guò)100太字節(jié)的數(shù)據(jù),每個(gè)作業(yè)在大約3.5小時(shí)內(nèi)完成抽兆。這意味著我們需要通過(guò)特定的方式優(yōu)化我們的工作识补,以實(shí)現(xiàn)大規(guī)模并行化,同時(shí)盡可能降低內(nèi)存使用(和成本)辫红。我們的談話集中在處理極大數(shù)據(jù)的策略上凭涂。我們會(huì)談?wù)勎覀儗W(xué)到的東西和我們犯的錯(cuò)誤。這包括: - 使用Ganglia優(yōu)化內(nèi)存使用 - 針對(duì)不同類型的階段和有效的連接優(yōu)化分區(qū)計(jì)數(shù) - 實(shí)現(xiàn)數(shù)據(jù)最大化效率的違反直覺(jué)的策略 - 特定于大型作業(yè)的Spark默認(rèn)設(shè)置贴妻,以及它們的重要性 - 使用Amazon EMR運(yùn)行Spark有超過(guò)3200個(gè)內(nèi)核 - 查看大規(guī)模作業(yè)中發(fā)生的不同類型的錯(cuò)誤和堆棧跟蹤切油,以及如何讀取和處理它們 - 當(dāng)100k分區(qū)加入100k分區(qū)時(shí)如何處理大量的地圖輸出狀態(tài) - 防止序列化緩沖區(qū)溢出以及映射出狀態(tài)緩沖區(qū)溢出。當(dāng)數(shù)據(jù)非常大時(shí)名惩,這很容易發(fā)生 - 如何有效地使用分區(qū)來(lái)合并階段并最大限度地減少shuffle澎胡。Session hashtag:#EUdev3
Storage Engine Considerations for Your Apache Spark Applications
by Mladen Kovacevic, Cloudera
video, slide
You have the perfect use case for your Spark applications – whether it be batch processing or super fast near-real time streaming — Now, where to store your valuable data!? In this talk we take a look at four storage options; HDFS, HBase, Solr and Kudu. With so many to choose from, which will fit your use case? What considerations should be taken into account? What are the pros and cons, what are the similarities and differences and how do they fit in with your Spark application? Learn the answers to these questions and more with a look at design patterns and techniques, and sample code to integrate into your application immediately. Walk away with the confidence to propose the right architecture for your use cases and the development know-how to implement and deliver with success.Session hashtag: #EUdev10
下面的內(nèi)容來(lái)自機(jī)器翻譯:
您的Spark應(yīng)用程序具有完美的用例 - 無(wú)論是批處理還是超快近實(shí)時(shí)流 - 現(xiàn)在,在哪里存儲(chǔ)您的寶貴數(shù)據(jù)娩鹉!在這個(gè)演講中攻谁,我們看看四個(gè)存儲(chǔ)選項(xiàng); HDFS,HBase弯予,Solr和Kudu戚宦。有這么多的選擇,哪個(gè)適合你的用例锈嫩?應(yīng)該考慮哪些因素受楼?什么是優(yōu)點(diǎn)和缺點(diǎn)垦搬,有什么相似之處和不同之處,它們?nèi)绾芜m合你的Spark應(yīng)用程序艳汽?通過(guò)了解設(shè)計(jì)模式和技術(shù)以及示例代碼猴贰,立即了解這些問(wèn)題的答案和更多內(nèi)容。有信心為您提供適合您的用例的體系結(jié)構(gòu)骚灸,并提供成功實(shí)施和交付的開(kāi)發(fā)技巧糟趾。Session hashtag:#EUdev10
Supporting Highly Multitenant Spark Notebook Workloads: Best Practices and Useful Patches
by Brad Kaiser, IBM
video, slide
Notebooks: they enable our users, but they can cripple our clusters. Let’s fix that. Notebooks have soared in popularity at companies world-wide because they provide an easy, user-friendly way of accessing the cluster-computing power of Spark. But the more users you have hitting a cluster, the harder it is to manage the cluster resources as big, long-running jobs start to starve out small, short-running jobs. While you could have users spin up EMR-style clusters, this reduces the ability to take advantage of the collaborative nature of notebooks. It also quickly becomes expensive as clusters sit idle for long periods of time waiting on single users. What we want is fair, efficient resource utilization on a large single cluster for a large number of users. In this talk we’ll discuss dynamic allocation and the best practices for configuring the current version of Spark as-is to help solve this problem. We’ll also present new improvements we’ve made to address this use case. These include: decommissioning executors without losing cached data, proactively shutting down executors to prevent starvation, and improving the start times of new executors.Session hashtag: #EUdev8
下面的內(nèi)容來(lái)自機(jī)器翻譯:
筆記本電腦:他們使我們的用戶,但他們可以削弱我們的集群甚牲。我們來(lái)解決這個(gè)問(wèn)題义郑。筆記本電腦在世界各地的公司中受到普遍歡迎,因?yàn)樗鼈兲峁┝嗽L問(wèn)Spark集群計(jì)算能力的簡(jiǎn)單易用的方法丈钙。但是非驮,越是擁有群集的用戶,管理群集資源就越困難雏赦,因?yàn)榇笮徒袤希L(zhǎng)時(shí)間運(yùn)行的作業(yè)開(kāi)始消耗小而短的作業(yè)。雖然您可以讓用戶啟用EMR風(fēng)格的集群星岗,但這會(huì)降低利用筆記本電腦協(xié)作特性的能力填大。隨著群集閑置很長(zhǎng)一段時(shí)間等待單個(gè)用戶,它也很快變得昂貴俏橘。我們所需要的是在大量用戶的大型單一群集上進(jìn)行公平允华,高效的資源利用。在這個(gè)討論中寥掐,我們將討論動(dòng)態(tài)分配靴寂,并且將Spark當(dāng)前版本配置為最佳實(shí)踐來(lái)幫助解決這個(gè)問(wèn)題。我們還將介紹我們?yōu)榻鉀Q這個(gè)用例所做的新的改進(jìn)召耘。其中包括:在不丟失緩存數(shù)據(jù)的情況下停止執(zhí)行程序百炬,主動(dòng)關(guān)閉執(zhí)行程序以防止饑餓,并改善新執(zhí)行程序的啟動(dòng)時(shí)間污它。Session#hashdeg:#EUdev8