任務:
1. 基礎數(shù)據(jù)是Wikipedia Page Counts. (>1TB compressed, available since 2008)澎埠。
2. 字段包含
1) project (i.e. “en”, “fr”, etc, which is usually a language)
2) title of the page (uri), urlencoded
3) number of requests
4) size of the content returned
3. 文件名是日期+時間(詳細到小時, 例如2008013101是2008年1月31日1點到2點的記錄)
4. 目標是
1) 統(tǒng)計每天被請求最多的10個英文wikipedia頁( top 10 pages by the number of requests per day in English Wikipedia)
2) 支持任意單詞搜索饱普,以完成"比較一下關于'myspace'的頁面總請求數(shù)與關于'facebook'的頁面總請求數(shù)"這樣的任務胯究。
單獨用MySQL處理的流程如下
1. 解壓縮文件(>10TB),把數(shù)據(jù)加載到臨時表中,并根據(jù)文件名追加日期字段适室。
2. 聚合后插入最終表。
3. 解碼title(使用UDF)
那么滔以,整個 過程需要多長時間呢?
According to my calculations it should take > 1 month to do the whole pipeline for 6 years of data (this time does not include the uncompress time and does not include the load time depreciation as the table get bigger and bigger and indexes need to be updated). There are a lots of things we can do here to speed it up of course, i.e., load into different MySQL instances, load into MEMORY table first, then group by into InnoDB, etc.
比較快速的做法是使用pyspark操作歼捐,代碼如下
(包含的功能是讀文件何陆,對url的訪問次數(shù)按天累加,然后更新的MySQL)
The script took about an hour on 3 boxes to process 1 month of data and load theaggregated datato MySQL (single instance). We can estimate that to load all 6 years (aggregated) to MySQL is ~3 days.
因為使用了Spark RDD Transformation窥岩,處理時間從原來的1個月縮短到3天甲献。
其他的性能改善還包括
1) group_res.write.jdbc(url=mysql_url, table=”wikistats.wikistats_by_day_spark”, mode=”append”)
Spark是在使用多線程的方式更新數(shù)據(jù)庫。
2) group_res.saveAsParquetFile("/ssd/wikistats_parquet_bydate/mydate=" + mydate)
這是在把sparkRDD存為parquet file (it can be saved to a directory to HDFS)颂翼。parquet file的作用是替代mysql做查詢晃洒。從下面的截圖可以看出,使用mysql查詢10個訪問最多的網(wǎng)頁需要1小時22分鐘朦乏,使用parquet file時需要20多分鐘球及。
原帖地址:https://www.percona.com/blog/2015/10/07/using-apache-spark-mysql-data-analysis/