Function
functions.scala
hobbies.txt
alice? jogging,Coding,cooking? 3
lina? ? travel,dance? ? ? ? ? ? 2
case class Likes(name:String,likes:String)
val likes = spark.sparkContext.textFile("file:///home/hadoop/data/hobbies.txt")
val likeDF = likes.map(_.split("\t")).map(x=>Likes(x(0),x(1))).toDF()
likeDF.createOrReplaceTempView("t_likes")
spark.udf.register("likes_num", (x:String)=>x.split(",").size)
spark.sql("select name,likes,likes_num(likes) from t_likes").show(false)
Spark SQL愿景
1)write less code
2)read less data
3)Let the optimizer do the hard work
json
{"name":"zhangsan", "gender":"F", "height":160}
{"name":"lisi", "gender":"M", "height":175, "age":30}
{"name":"wangwu", "gender":"M", "height":180.3}
case class Person(name:String, age:Int, salary:Double)
工資大于30000? 只需要name,不需要age和salary
sc.textFile(path).map(x=>{
val Array(name,age,salary) = x.split(",")
Person(name,age,salary)
}).map()
select name
from (
select name,salary from xxx
) t
where t.salary>30000;
select name
from (
select name,salary from xxx where salary>30000
)
自動優(yōu)化:基于Spark SQL憎瘸,不是Core
Spark2.x
Catalog
DF vs DS
DataFrame = Dataset[Row]
tableName? id,name
? ? ? ? ? ? ? ? ? ? SQL? ? ? ? ? ? ? DF? ? ? ? ? DS
Syntax Errors? ? ? runtime? ? ? ? Compile? ? ? ? Compile
Analysis Errors? ? runtime? ? ? ? Runtime? ? ? ? Compile
seletc name from xx
df.seletc("name")
df.select("nname")
ds.seletc("name")
ds.map(_.nname)
Analysis Errors reported before a distributed job starts.
1) LogApp ==> DF/DS
XXXXXX
DESC
DOMAIN1
TOP10
DOMAIN2
TOP10
DOMAIN3
TOP10
2) option:text的外部數(shù)據(jù)源