需求分析
計算部門的平均薪資和年齡
- 只統(tǒng)計年齡在20歲以上的員工
- 根據(jù)部門名稱和員工性別為粒度來進行統(tǒng)計
- 統(tǒng)計出每個部門分性別的平均薪資和年齡
關(guān)鍵技術(shù)點
- 導入隱式轉(zhuǎn)化
import spark.implicits._
- 導入spark.sql.fucntions
import org.apache.spark.sql.functions._
- 兩個表的字段的連接條件,需要使用三個等號
$"depId" === $"id"
- groupBy聚合時奉狈,指定表及相應字段
groupBy(department("name"), employee("gender"))
- agg聚合函數(shù)
agg(avg(employee("salary")), avg(employee("age")))
- dataframe == dataset[Row],dataframe的類型是Row是趴,所以是untyped類型,弱類型,dataset的類型通常是我們自定義的case class摹迷,所以是typed類型,強類型
- dataset開發(fā),與rdd開發(fā)有很多的共同點。dataset采用encoder序列化
代碼示例
package com.spark.dataset
import org.apache.spark.sql.SparkSession
/**
* 計算部門的平均薪資和年齡
*
* 需求:
* 1蜂挪、只統(tǒng)計年齡在20歲以上的員工
* 2、根據(jù)部門名稱和員工性別為粒度來進行統(tǒng)計
* 3嗓化、統(tǒng)計出每個部門分性別的平均薪資和年齡
*
*/
object DepartmentAvgSalaryAndAgeStat extends App{
val spark=SparkSession
.builder()
.appName("DepartmentAvgSalaryAndAgeStat")
.master("local")
.config("spark.sql.warehouse.dir","E:\\worksplace\\spark\\spark-warehouse")
.getOrCreate()
//導入隱式轉(zhuǎn)換
import spark.implicits._
//spark sql functions
import org.apache.spark.sql.functions._
/**
+---+--------------------+
| id| name|
+---+--------------------+
| 1|Technical Department|
| 2|Financial Department|
| 3| HR Department|
+---+--------------------+
*/
val department=spark.read.json("E:\\worksplace\\spark\\src\\main\\resources\\department.json")
/**
+---+-----+------+------+------+
|age|depId|gender| name|salary|
+---+-----+------+------+------+
| 25| 1| male| Leo| 20000|
| 30| 2|female| Marry| 25000|
| 35| 1| male| Jack| 15000|
| 42| 3| male| Tom| 18000|
| 21| 3|female|Kattie| 21000|
| 30| 2|female| Jen| 28000|
| 19| 2|female| Jen| 8000|
+---+-----+------+------+------+
*/
val employee=spark.read.json("E:\\worksplace\\spark\\src\\main\\resources\\employee.json")
//department.show()
//employee.show()
//1.過濾20歲以上的員工
val filtedEmployee=employee.filter("age>20")
//filtedEmployee.show()
/**
+---+-----+------+------+------+---+--------------------+
|age|depId|gender| name|salary| id| name|
+---+-----+------+------+------+---+--------------------+
| 25| 1| male| Leo| 20000| 1|Technical Department|
| 30| 2|female| Marry| 25000| 2|Financial Department|
| 35| 1| male| Jack| 15000| 1|Technical Department|
| 42| 3| male| Tom| 18000| 3| HR Department|
| 21| 3|female|Kattie| 21000| 3| HR Department|
| 30| 2|female| Jen| 28000| 2|Financial Department|
+---+-----+------+------+------+---+--------------------+
*/
// 注意:untyped join棠涮,兩個表的字段的連接條件,需要使用三個等號
val joined=filtedEmployee.join(department, $"depId" === $"id")
val result=employee
// 先對employee進行過濾刺覆,只統(tǒng)計20歲以上的員工
.filter("age>20")
// 需要跟department數(shù)據(jù)進行join故爵,然后才能根據(jù)部門名稱和員工性別進行聚合
// 注意:untyped join,兩個表的字段的連接條件隅津,需要使用三個等號
.join(department, $"depId" === $"id")
// 根據(jù)部門名稱和員工性別進行分組
.groupBy(department("name"), employee("gender"))
// 最后執(zhí)行聚合函數(shù)
.agg(avg(employee("salary")), avg(employee("age")))
// 執(zhí)行action操作,將結(jié)果顯示出來
/**
+--------------------+------+-----------+--------+
| name|gender|avg(salary)|avg(age)|
+--------------------+------+-----------+--------+
| HR Department|female| 21000.0| 21.0|
|Technical Department| male| 17500.0| 30.0|
|Financial Department|female| 26500.0| 30.0|
| HR Department| male| 18000.0| 42.0|
+--------------------+------+-----------+--------+
*/
result.show()
}