好玩的大數(shù)據(jù)之44:Spark實驗2(Spark-SQL)

一号阿、簡介


????????? ? 本實驗參考《spark權(quán)威指南》第9章:Spark-SQL

二冈钦、實驗內(nèi)容


????? ? ? ? 利用spark-sql讀寫以下類型的數(shù)據(jù)源

1.Databases

????1)Show Databases

????2)Use Databases

????3)Creating Databases

????5)Dropping Databases

2.Tables

????1)Creating Tables

????2)Inserting Into Tables

????3)Describing Table Metadata

????4)Refreshing Table Metadata

????5)Dropping Tables

3.Views

????1)Creating Views

????2)Dropping Views

4.Select Statements

? ? 1)Case When Then Statements

5.Advanced Topics

????1)Complex Types

????????????Structs

????????????Sets and Lists

????2)Functions

????????????SYSTEM FUNCTIONS

????????????User Defined Functions

????3)Spark Managed Tables

????????????Creating External Tables

????????????Dropping Unmanaged Tables

????4)Subqueries

????????????Uncorrelated Predicate Subqueries

????????????Correlated Predicated Subqueries

????????????Uncorrelated Scalar Queries

三十办、實驗前準備


1.實驗數(shù)據(jù)

????????參考“好玩的大數(shù)據(jù)之43:Spark實驗1(用Spark-Shell讀寫外部數(shù)據(jù)源)”實驗前準備章節(jié)

2.改變?nèi)罩据敵黾墑e

????????參考“好玩的大數(shù)據(jù)之43:Spark實驗1(用Spark-Shell讀寫外部數(shù)據(jù)源)”實驗前準備章節(jié):改變?nèi)罩据敵黾墑e

四坝撑、啟動


? ? ? ? ? ? 1.啟動spark-sql

? ? ? ? ? ? ? ? ? spark-sql

? ? ? ? ? ? ? ? ? 直到看到spark-sql提示符

啟動spark-sql

說明修噪,這里我已經(jīng)修改了conf/log4j.properties里面的控制臺日志輸出級別為ERROR

? ??????????????????????log4j.rootCategory=ERROR, console?

? ? ? ? ? ? 2.小試一下牛刀

? ? ? ? ? ? ? ? ? ? ? ? select 1+1

結(jié)果應(yīng)該為2

五喊巍、實驗過程


除非特殊說明,以下均在spark-sql (default)> 提示符下執(zhí)行骡和,需要輸入的代碼在begin-end之間

注:所有的語句均要以分號“;”結(jié)尾才能執(zhí)行

1.Databases

=====================Databases?begin =====================

1)Show Databases

show databases;

Show Databases

2)Creating Databases

create database testdb;

show databases;

Creating Databases

3)Use Databases

use testdb;

Use Databases

4)Dropping Databases

create database testdb2;

show databases;

drop database testdb2;

Dropping Databases

5)背后的故事

實際上相赁,這些命令執(zhí)行的結(jié)果相寇,最后都保存在hadoop的文件系統(tǒng)中,具體位置根據(jù)hive配置文件conf/hive-site.xml的如下屬性

<property>

? ? <name>hive.metastore.warehouse.dir</name>

? ? <value>/mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse</value>

? ? <description>location of default database for the warehouse</description>

? </property>


新開一個Terminal

hadoop fs -ls /mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse

hadoop fs -ls?/mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse/testdb.db

testdb.db的位置

=====================Databases?end=====================


2.Tables

=====================Tables?begin =====================

1)Creating Tables

#常規(guī)建表

create table student(

? ? id int,

? ? name?string

);

#使用Using建表

CREATE TABLE flights (

DEST_COUNTRY_NAME STRING,

ORIGIN_COUNTRY_NAME STRING,

count LONG)

USING JSON

OPTIONS (

path '/mylab/mydata/spark/spark_guide_data/flight-data/json/2015-summary.json');

#使用Using建表,并且給列注釋

CREATE TABLE flights_csv (

DEST_COUNTRY_NAME STRING,

ORIGIN_COUNTRY_NAME STRING COMMENT "remember that the most prevalent will",

count LONG)

USING csv

OPTIONS (

inferSchema true,

header true,

path '/mylab/mydata/spark/spark_guide_data/flight-data/csv/2015-summary.csv');

show tables;

創(chuàng)建了3張表

在Terminal下執(zhí)行下面的命令钮科,看看文件在不在

hadoop fs -ls /mylab/mydata/spark/spark_guide_data/flight-data/json/2015-summary.json

hadoop fs -ls /mylab/mydata/spark/spark_guide_data/flight-data/csv/2015-summary.csv

hadoop fs -ls?/mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse/testdb.db

testdb.db目錄下只有student子目錄

說明使用了USING 之后唤衫,數(shù)據(jù)用的是原來的那個數(shù)據(jù)文件

#使用查詢建表

CREATE TABLE flights_from_select

AS

SELECT * FROM flights;

CREATE TABLE IF NOT EXISTS flights_from_select

AS

SELECT * FROM flights

LIMIT 5;

show tables;

create a table from a query

在Terminal下執(zhí)行下面的命令,看看文件在不在

hadoop fs -ls?/mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse/testdb.db


hadoop fs -ls /mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse/testdb.db/flights_from_select

hadoop fs -cat /mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse/testdb.db/flights_from_select/part-00000-5bfd19b2-ceb4-4eca-adbe-19118cc4bcd0-c000

可以看到通過子查詢生成的表里面有數(shù)據(jù)


#使用分區(qū)建表

CREATE TABLE partitioned_flights

USING parquet

PARTITIONED BY (DEST_COUNTRY_NAME)

AS

SELECT DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count FROM flights

LIMIT 5 ;

select * from flights limit 5;

select * from partitioned_flights;

分區(qū)表

在Terminal下執(zhí)行

hadoop fs -ls /mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse/testdb.db/partitioned_flights

partitioned table

按關(guān)鍵字進行分區(qū)后绵脯,會根據(jù)不同的關(guān)鍵字屬性值生成不同的文件

2)Show Tables

? ? ? ? show tables;

? ? ? ? show tables in testdb;

Show Tables

3)Inserting Into Tables

#插入到普通表

insert into student values(1,'zhang');

insert into student values(2,'zhao');

insert into student values(3,'wang');

insert into student values(4,'li');

select * from student;

Inserting Into Tables

hadoop fs -ls /mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse/testdb.db/student

hadoop fs -cat /mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse/testdb.db/student/*

hadoop fs -cat /mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse/testdb.db/student/part-00000-4d48fb12-7412-46b0-9460-27ba674bfce9-c000

每行記錄都會生成一個文件

#插入到使用Using創(chuàng)建的表

insert into flights_csv values("beijing","shanghai",15);

無法插入到使用Using創(chuàng)建的表

#插入到分區(qū)表

#通過values插入

INSERT INTO partitioned_flights

PARTITION (DEST_COUNTRY_NAME="United States")

values('China',1);

INSERT INTO partitioned_flights

PARTITION (DEST_COUNTRY_NAME="China")

values('japan',10);

INSERT INTO partitioned_flights

PARTITION (DEST_COUNTRY_NAME="China")

values('Singapore',15);

#通過select插入

INSERT INTO partitioned_flights

PARTITION (DEST_COUNTRY_NAME="United States")

SELECT ORIGIN_COUNTRY_NAME,count FROM flights

WHERE DEST_COUNTRY_NAME='United States'

LIMIT 1;

select * from partitioned_flights;

新插入的記錄

在Terminal下執(zhí)行

hadoop fs -ls /mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse/testdb.db/partitioned_flights

新增加的分區(qū)

hadoop fs -ls /mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse/testdb.db/partitioned_flights/DEST_COUNTRY_NAME=China

新增加的記錄

4)Describing Table

DESCRIBE TABLE flights;

DESC TABLE flights;

DESC TABLE partitioned_flights;


Describing Table Metadata

在Terminal下執(zhí)行

hadoop fs -ls /mylab/soft/apache-hive-3.1.2-bin/working/metastore.warehouse/testdb.db/partitioned_flights

新增的2個目錄

5)Refreshing Table

? ? ? ? 兩種方式

????????????REFRESH table partitioned_flights

????????????MSCK REPAIR TABLE partitioned_flights

Refreshing Table Metadata

6)Dropping Tables

????????DROP TABLE flights_csv;

? ??????DROP TABLE IF EXISTS flights_csv;

=====================Tables end=====================


3.Views

=====================Viewsbegin =====================

1)Show?Views

? ? show views;

? ? show views in testdb;

2)?Creating Views

CREATE OR REPLACE VIEW just_usa_view AS

SELECT *

FROM flights

WHERE dest_country_name = 'United States';

CREATE OR REPLACE TEMP VIEW just_usa_view_temp AS

SELECT *

FROM flights

WHERE dest_country_name = 'United States';

CREATE OR REPLACE GLOBAL TEMP VIEW just_usa_global_view_temp AS

SELECT *

FROM flights

WHERE dest_country_name = 'United States';

SELECT * FROM just_usa_view;

SELECT * FROM just_usa_view_temp;

SELECT * FROM just_usa_global_view_temp;(這個好像出錯佳励,不知道為啥)

3)Dropping Views

? ? ? ? drop view?just_usa_view;

? ? ? ? drop view?just_usa_view_temp蛆挫;

????????drop view?just_usa_global_view_temp赃承;(這個好像出錯,不知道為啥)

=====================Views?end=====================

4.Select Statements

=====================Select?begin=====================

1)Select語法

SELECT [ALL|DISTINCT] named_expression[, named_expression, ...]

FROM relation[, relation, ...]

[lateral_view[, lateral_view, ...]]

[WHERE boolean_expression]

[aggregation [HAVING boolean_expression]]

[ORDER BY sort_expressions]

[CLUSTER BY expressions]

[DISTRIBUTE BY expressions]

[SORT BY sort_expressions]

[WINDOW named_window[, WINDOW named_window, ...]]

[LIMIT num_rows]

named_expression:

: expression [AS alias]

relation:

| join_relation

| (table_name|query|relation) [sample] [AS alias]

: VALUES (expressions)[, (expressions), ...]

[AS (column_name[, column_name, ...])]

expressions:

: expression[, expression, ...]

sort_expressions:

: expression [ASC|DESC][, expression [ASC|DESC], ...]

2)Case When Then Statements

SELECT

CASE?

WHEN DEST_COUNTRY_NAME = 'UNITED STATES' THEN 1

WHEN DEST_COUNTRY_NAME = 'Egypt' THEN 0

ELSE -1 END

FROM

partitioned_flights;

=====================Select?end=====================


5.Advanced Topics

=====================Advanced?Topics?begin =====================

1) Complex Types

????1.1)Structs

CREATE VIEW IF NOT EXISTS

nested_data

AS

SELECT

(DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME) as country,

count

FROM flights;

SELECT * FROM nested_data limit 5;

Structs

SELECT country.DEST_COUNTRY_NAME, count

FROM nested_data;

SELECT country.*, count

FROM nested_data

????1.2)Sets and Lists

SELECT

DEST_COUNTRY_NAME as new_name,

collect_list(count) as flight_counts,

collect_set(ORIGIN_COUNTRY_NAME) as origin_set

FROM

partitioned_flights

GROUP BY

DEST_COUNTRY_NAME;

Sets and Lists

SELECT

DEST_COUNTRY_NAME as new_name,

collect_list(count)[0]

FROM

partitioned_flights

GROUP BY

DEST_COUNTRY_NAME;

array querysyntax


CREATE OR REPLACE TEMP VIEW flights_agg

AS

SELECT

DEST_COUNTRY_NAME,

collect_list(count) as collected_counts

FROM

partitioned_flights

GROUP BY

DEST_COUNTRY_NAME;

select * from flights_agg;

SELECT explode(collected_counts), DEST_COUNTRY_NAME

FROM flights_agg;

convert an array back into rows using explode


2)Functions

? ??2.1)ALL FUNCTIONS

? ??????????SHOW FUNCTIONS

? ? ? ? ? ? 所有的FUNCTIONS參見附錄列表

? ? 2.2)SYSTEM FUNCTIONS

? ??????????SHOW SYSTEM FUNCTIONS

def power3(number:Double):Double = {

number * number * number

}

spark.udf.register("power3", power3(_:Double):Double)

? ? 2.3)User Defined Functions

? ??????????SHOW USER FUNCTIONS;

????2.4)展示部分

? ??????????SHOW FUNCTIONS "s*";

? ??????????SHOW FUNCTIONS LIKE "collect*";

????2.5)創(chuàng)建自定義函數(shù)

? ? ? ? ? ? 切換到Spark-shell

????????????????????def power3(number:Double):Double = {

????????????????????????????number * number * number

????????????????????}

????????????????????spark.udf.register("power3", power3(_:Double):Double)

spark-shell中定義

????????????????切換到Spark-SQL(好像不能使用)

? ??SELECT count, power3(count)

????FROM flights;

3)Spark Managed Tables

Creating External Tables

Dropping Unmanaged Tables

4)Subqueries

Uncorrelated Predicate Subqueries

Correlated Predicated Subqueries

Uncorrelated Scalar Queries

=====================Advanced?Topics?end=====================


exit;????退出

六悴侵、附錄:SYSTEM FUNTION


!

!=

%

&

*

+

-

/

<

<=

<=>

<>

=

==

>

>=

^

abs

acos

acosh

add_months

aggregate

and

any

approx_count_distinct

approx_percentile

array

array_contains

array_distinct

array_except

array_intersect

array_join

array_max

array_min

array_position

array_remove

array_repeat

array_sort

array_union

arrays_overlap

arrays_zip

ascii

asin

asinh

assert_true

atan

atan2

atanh

avg

base64

between

bigint

bin

binary

bit_and

bit_count

bit_length

bit_or

bit_xor

bool_and

bool_or

boolean

bround

cardinality

case

cast

cbrt

ceil

ceiling

char

char_length

character_length

chr

coalesce

collect_list

collect_set

concat

concat_ws

conv

corr

cos

cosh

cot

count

count_if

count_min_sketch

covar_pop

covar_samp

crc32

cube

cume_dist

current_database

current_date

current_timestamp

date

date_add

date_format

date_part

date_sub

date_trunc

datediff

day

dayofmonth

dayofweek

dayofyear

decimal

decode

degrees

dense_rank

div

double

e

element_at

elt

encode

every

exists

exp

explode

explode_outer

expm1

extract

factorial

filter

find_in_set

first

first_value

flatten

float

floor

forall

format_number

format_string

from_csv

from_json

from_unixtime

from_utc_timestamp

get_json_object

greatest

grouping

grouping_id

hash

hex

hour

hypot

if

ifnull

in

initcap

inline

inline_outer

input_file_block_length

input_file_block_start

input_file_name

instr

int

isnan

isnotnull

isnull

java_method

json_tuple

kurtosis

lag

last

last_day

last_value

lcase

lead

least

left

length

levenshtein

like

ln

locate

log

log10

log1p

log2

lower

lpad

ltrim

make_date

make_interval

make_timestamp

map

map_concat

map_entries

map_filter

map_from_arrays

map_from_entries

map_keys

map_values

map_zip_with

max

max_by

md5

mean

min

min_by

minute

mod

monotonically_increasing_id

month

months_between

named_struct

nanvl

negative

next_day

not

now

ntile

nullif

nvl

nvl2

octet_length

or

overlay

parse_url

percent_rank

percentile

percentile_approx

pi

pmod

posexplode

posexplode_outer

position

positive

pow

power

printf

quarter

radians

rand

randn

random

rank

reflect

regexp_extract

regexp_replace

repeat

replace

reverse

right

rint

rlike

rollup

round

row_number

rpad

rtrim

schema_of_csv

schema_of_json

second

sentences

sequence

sha

sha1

sha2

shiftleft

shiftright

shiftrightunsigned

shuffle

sign

signum

sin

sinh

size

skewness

slice

smallint

some

sort_array

soundex

space

spark_partition_id

split

sqrt

stack

std

stddev

stddev_pop

stddev_samp

str_to_map

string

struct

substr

substring

substring_index

sum

tan

tanh

timestamp

tinyint

to_csv

to_date

to_json

to_timestamp

to_unix_timestamp

to_utc_timestamp

transform

transform_keys

transform_values

translate

trim

trunc

typeof

ucase

unbase64

unhex

unix_timestamp

upper

uuid

var_pop

var_samp

variance

version

weekday

weekofyear

when

window

xpath

xpath_boolean

xpath_double

xpath_float

xpath_int

xpath_long

xpath_number

xpath_short

xpath_string

xxhash64

year

zip_with

|

~

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末瞧剖,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子可免,更是在濱河造成了極大的恐慌抓于,老刑警劉巖,帶你破解...
    沈念sama閱讀 212,884評論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件浇借,死亡現(xiàn)場離奇詭異捉撮,居然都是意外死亡,警方通過查閱死者的電腦和手機妇垢,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,755評論 3 385
  • 文/潘曉璐 我一進店門巾遭,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人修己,你說我怎么就攤上這事恢总。” “怎么了睬愤?”我有些...
    開封第一講書人閱讀 158,369評論 0 348
  • 文/不壞的土叔 我叫張陵片仿,是天一觀的道長。 經(jīng)常有香客問我尤辱,道長砂豌,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 56,799評論 1 285
  • 正文 為了忘掉前任光督,我火速辦了婚禮阳距,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘结借。我一直安慰自己筐摘,他們只是感情好,可當我...
    茶點故事閱讀 65,910評論 6 386
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著咖熟,像睡著了一般圃酵。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上馍管,一...
    開封第一講書人閱讀 50,096評論 1 291
  • 那天郭赐,我揣著相機與錄音,去河邊找鬼确沸。 笑死捌锭,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的罗捎。 我是一名探鬼主播观谦,決...
    沈念sama閱讀 39,159評論 3 411
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼宛逗!你這毒婦竟也來了坎匿?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,917評論 0 268
  • 序言:老撾萬榮一對情侶失蹤雷激,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后告私,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體屎暇,經(jīng)...
    沈念sama閱讀 44,360評論 1 303
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,673評論 2 327
  • 正文 我和宋清朗相戀三年驻粟,在試婚紗的時候發(fā)現(xiàn)自己被綠了根悼。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 38,814評論 1 341
  • 序言:一個原本活蹦亂跳的男人離奇死亡蜀撑,死狀恐怖挤巡,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情酷麦,我是刑警寧澤矿卑,帶...
    沈念sama閱讀 34,509評論 4 334
  • 正文 年R本政府宣布,位于F島的核電站沃饶,受9級特大地震影響母廷,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜糊肤,卻給世界環(huán)境...
    茶點故事閱讀 40,156評論 3 317
  • 文/蒙蒙 一琴昆、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧馆揉,春花似錦业舍、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,882評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽态罪。三九已至,卻和暖如春脚牍,著一層夾襖步出監(jiān)牢的瞬間向臀,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 32,123評論 1 267
  • 我被黑心中介騙來泰國打工诸狭, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留券膀,地道東北人。 一個月前我還...
    沈念sama閱讀 46,641評論 2 362
  • 正文 我出身青樓驯遇,卻偏偏與公主長得像芹彬,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子叉庐,可洞房花燭夜當晚...
    茶點故事閱讀 43,728評論 2 351