Apache Parquet: How to Be a Hero with the open source columnar data format on the Google and Amaz...

Get All the Benefits of Apache Parquet File Format for Google Cloud, Amazon Athena and Redshift?Spectrum

You have read about?Google Cloud?(BigQuery, Dataproc…),?Amazon Redshift Spectrum?and?AWS Athena. You are looking to take advantage of one or two. However, before you jump into the deep end you will want to familiarize yourself with the opportunities of leveraging?Apache Parquet?file format instead of regular Text, CSV or TSV files. The parquet format is a columnar storage format which allows systems, like Amazon Athena, the ability to query information as columnar data vs a flat file like CSV

If you are not thinking about how to optimize for these new query service models, you can be throwing money out the window.

What Is Apache?Parquet?

Apache?Parquet?format is a columnar storage format with?the following characteristics:

Apache Parquet is column-oriented and designed to bring efficient columnar storage of data compared to row based files like CSV

Apache Parquet is built from the ground up with complex nested data structures in mind

Apache Parquet is built to support very efficient compression and encoding schemes

Apache Parquet allows to lower storage costs for data files and maximizes the effectiveness of querying data with serverless technologies like Amazon Athena, Redshift Spectrum and Google Dataproc.

Apache Parquet is a self-describing data format which embeds the schema, or structure, within the data itself. This results a file that is optimized for query performance and minimizing I/O. Parquet also supports very efficient compression and encoding schemes. The great thing is that it is licensed under the Apache software foundation and available to any project.

Parquet and The Rise of Cloud Warehouses & Interactive Query?Services

The rise interactive query services like AWS Athena and Amazon Redshift Spectrum make it easy using standard SQL to analyze data in storage systems like Amazon S3. Also, data warehouses like Google BigQuery and the Google Dataproc platform can leverage different formats for data ingest.

However, the data format you select can have significant implications on performance and cost, especially if you are looking at machine learning, AI or other complex operations. We will walk you through a few examples of those considerations.

Parquet vs?CSV

CSV is simple and ubiquitous. Many tools like Excel, Google Sheets and a host of others can generate CSV files. You can even create them with your favorite text editing tool. We all love CSV files, but everything has a cost, even your love of CSV files, especially if CSV is your default format for data processing pipelines.

AWS Athena and AWS Redshift Spectrum charge you by the amount of data scanned per query. (Many other services also charge based on data queried so this is not unique to AWS)

Google and Amazon charge you for the amount of data stored on GS/S3

Google Dataproc?charges?are time-based

Defaulting to the use of CSV will have both technical and financial outcomes (not in a good way). You will learn to love Apache Parquet just as much as your trusty CSV.

Example: A 1 TB CSV?File

The following demonstrates the efficiency and effectiveness of using a Parquet file vs CSV.

By converting your CSV data to Parquet’s columnar format, compressing and partitioning it, you save money and reap the rewards of better performance. The following table compares the savings created by converting data into Parquet vs CSV.

Think about this: If over the course of a year you stuck with the uncompressed 1 TB CSV files as a foundation of your queries costs would be?$2000 USD. Using Parquet files your total cost would be?$3.65?USD. I know you love your CSV files, but do you love them THAT much?

Also, if time is money your analysts can be spending close to 5 minutes waiting for a query to complete simply because you use raw CSV. If you are paying someone $150 an hour and they are doing this once a day for a year then they spent about 30 hours simply waiting for a query to complete. That is roughly about?$4500?in unproductive “wait” time. Total wait time for the Apache Parquet user? About 42 mins or?$100.

Example 2: Parquet, CSV and Your Redshift Data Warehouse

Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries against data in?Amazon S3. This can be an effective strategy for teams that want to partition data where some of it is resident within Redshift and other data is resident on S3. For example, let’s assume you have about 4 TB of data in a?historical_purchase?table in Redshift. Since it is not accessed frequently, offloading it to S3 makes sense. This will free up that space in Redshift while still providing your team access via Spectrum. Now, the big question becomes what format are you storing that 4 TB?historical_purchase?table in? CSV? How about using Parquet?

Our?historical_purchase?table has 4 equally sized columns, stored in Amazon S3 in three files; uncompressed CSV, gzip CSV and Parquet.

Uncompressed CSV File

The uncompressed CSV file has a total size of 4 TB. Running a query to get data from a single column of the table requires Redshift Spectrum to scan the entire file 4 TB. As result this query would cost?$20.

GZIP CSV File

If you compress your CSV file using GZIP, the file size is reduced to?1GB.Great savings! However, Redshift Spectrum still has to scan the entire file. The good news is your CSV file is four times smaller than the uncompressed one so you pay one-fourth of what you did before. This query would cost?$5.

Parquet File

If you compress your file and convert it to Apache Parquet you end up with 1 TB of data in S3. However, because Parquet is columnar, Redshift Spectrum can read only the column that is relevant for the query being run. It only needs to scan just 1/4 the data. This query would only cost?$1.25.

If you are running this query once a day for a year, using uncompressed CSV files would cost?$7300.?Even compressed CSV queries would cost over?$1800. However, using the Apache Parquet file format it would cost about?$460.?Still in love with your CSV file?

Summary

The trend toward “serverless”, interactive query services and pre-built data processing suites is rapidly progressing. It is providing new opportunities for teams to go faster with lower investments.?Athena?and?Spectrum?make it easy to analyze data in Amazon S3 using standard SQL. Also, Google supports loading Parquet files into?BigQuery?and Dataproc.

When you only pay for the queries that you run, or resources like CPU and storage, it is important to look at optimizing the data those systems are relying on.

By the way, we have launched a zero admin data processing framework for Amazon Redshift Spectrum and Amazon Athena which includes automated database/table creation, Parquet file conversion, partitioning and more. See announcement for details:

Amazon Redshift Spectrum Automated — 60 Second Setup, Zero Administration And Automatic…

Announcing fully-managed support of zero administration Amazon Redshift Spectrum data pipeline service.blog.openbridge.com

AWS Athena Automated — 60 Second Setup, Zero Administration And Automatic Optimization

We are excited to announce the release of our zero administration AWS Athena data pipeline service.blog.openbridge.com

Also, take a look at our post about AWS Redshift Spectrum and AWS Athena. Using Apache Parquet can benefit both!

How is AWS Redshift Spectrum different than AWS Athena?

This question has come up a few times and most of the discussion in centered around the technical difference. Rather…blog.openbridge.com

Did we miss anything? Do you have any questions about how to transform your CSV to Apache Parquet? If you want help to streamline your data to Google Cloud, AWS Athena, AWS Redshift Spectrum or other data technologies, feel free to leave a comment or contact us at?hello@openbridge.com. You can also visit us at?https://www.openbridge.comto learn how we are helping other companies with their data efforts.

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者

人面猴
序言：七十年代末雄可，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子缠犀，更是在濱河造成了極大的恐慌数苫，老刑警劉巖，帶你破解...
沈念sama閱讀 222,104評論 6贊 515
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件辨液，死亡現(xiàn)場離奇詭異虐急，居然都是意外死亡，警方通過查閱死者的電腦和手機滔迈，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 94,816評論 3贊 399
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門止吁，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人燎悍，你說我怎么就攤上這事敬惦。” “怎么了谈山？”我有些...
開封第一講書人閱讀 168,697評論 0贊 360
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵俄删，是天一觀的道長。經(jīng)常有香客問我奏路，道長畴椰，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 59,836評論 1贊 298
?港島之戀（遺憾婚禮）
正文為了忘掉前任思劳，我火速辦了婚禮迅矛，結果婚禮上，老公的妹妹穿的比我還像新娘潜叛。我一直安慰自己秽褒，他們只是感情好壶硅，可當我...
茶點故事閱讀 68,851評論 6贊 397
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著销斟，像睡著了一般庐椒。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上蚂踊，一...
開封第一講書人閱讀 52,441評論 1贊 310
城市分裂傳說
那天约谈，我揣著相機與錄音，去河邊找鬼犁钟。笑死棱诱，一個胖子當著我的面吹牛，可吹牛的內容都是我干的涝动。我是一名探鬼主播迈勋，決...
沈念sama閱讀 40,992評論 3贊 421
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼醋粟！你這毒婦竟也來了靡菇？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 39,899評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤米愿，失蹤者是張志新（化名）和其女友劉穎厦凤，沒想到半個月后，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體育苟，經(jīng)...
沈念sama閱讀 46,457評論 1贊 318
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡较鼓，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 38,529評論 3贊 341
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了宙搬。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片笨腥。...
茶點故事閱讀 40,664評論 1贊 352
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖勇垛，靈堂內的尸體忽然破棺而出脖母，到底是詐尸還是另有隱情，我是刑警寧澤闲孤，帶...
沈念sama閱讀 36,346評論 5贊 350
?日本核電站爆炸內幕
正文年R本政府宣布谆级，位于F島的核電站，受9級特大地震影響讼积，放射性物質發(fā)生泄漏肥照。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 42,025評論 3贊 334
男人毒藥：我在死后第九天來索命
文/蒙蒙一勤众、第九天我趴在偏房一處隱蔽的房頂上張望舆绎。院中可真熱鬧，春花似錦们颜、人聲如沸吕朵。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,511評論 0贊 24
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽努溃。三九已至硫嘶，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間梧税，已是汗流浹背沦疾。一陣腳步聲響...
開封第一講書人閱讀 33,611評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留第队，地道東北人哮塞。一個月前我還...
沈念sama閱讀 49,081評論 3贊 377
代替公主和親
正文我出身青樓，卻偏偏與公主長得像斥铺，于是被迫代替她去往敵國和親彻桃。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 45,675評論 2贊 359