11.spark sql之RDD轉(zhuǎn)換DataSet

簡(jiǎn)介

??Spark SQL提供了兩種方式用于將RDD轉(zhuǎn)換為Dataset雀瓢。

  • 使用反射機(jī)制推斷RDD的數(shù)據(jù)結(jié)構(gòu)

??當(dāng)spark應(yīng)用可以推斷RDD數(shù)據(jù)結(jié)構(gòu)時(shí),可使用這種方式玉掸。這種基于反射的方法可以使代碼更簡(jiǎn)潔有效刃麸。

  • 通過編程接口構(gòu)造一個(gè)數(shù)據(jù)結(jié)構(gòu),然后映射到RDD上

??當(dāng)spark應(yīng)用無法推斷RDD數(shù)據(jù)結(jié)構(gòu)時(shí)排截,可使用這種方式嫌蚤。

反射方式

  • scala
// For implicit conversions from RDDs to DataFrames
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
  .textFile("examples/src/main/resources/people.txt")
  .map(_.split(","))
  .map(attributes => Person(attributes(0), attributes(1).trim.toInt))
  .toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")

// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

// or by field name
teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect()
// Array(Map("name" -> "Justin", "age" -> 19))
  • java
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;

// Create an RDD of Person objects from a text file
JavaRDD<Person> peopleRDD = spark.read()
  .textFile("examples/src/main/resources/people.txt")
  .javaRDD()
  .map(line -> {
    String[] parts = line.split(",");
    Person person = new Person();
    person.setName(parts[0]);
    person.setAge(Integer.parseInt(parts[1].trim()));
    return person;
  });

// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset<Row> peopleDF = spark.createDataFrame(peopleRDD, Person.class);
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people");

// SQL statements can be run by using the sql methods provided by spark
Dataset<Row> teenagersDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");

// The columns of a row in the result can be accessed by field index
Encoder<String> stringEncoder = Encoders.STRING();
Dataset<String> teenagerNamesByIndexDF = teenagersDF.map(
    (MapFunction<Row, String>) row -> "Name: " + row.getString(0),
    stringEncoder);
teenagerNamesByIndexDF.show();
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

// or by field name
Dataset<String> teenagerNamesByFieldDF = teenagersDF.map(
    (MapFunction<Row, String>) row -> "Name: " + row.<String>getAs("name"),
    stringEncoder);
teenagerNamesByFieldDF.show();
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+
  • python
from pyspark.sql import Row

sc = spark.sparkContext

# Load a text file and convert each line to a Row.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

# Infer the schema, and register the DataFrame as a table.
schemaPeople = spark.createDataFrame(people)
schemaPeople.createOrReplaceTempView("people")

# SQL can be run over DataFrames that have been registered as a table.
teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

# The results of SQL queries are Dataframe objects.
# rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
teenNames = teenagers.rdd.map(lambda p: "Name: " + p.name).collect()
for name in teenNames:
    print(name)
# Name: Justin

編程方式

  • scala
import org.apache.spark.sql.types._

// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
  .map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
  .map(_.split(","))
  .map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// |        value|
// +-------------+
// |Name: Michael|
// |   Name: Andy|
// | Name: Justin|
// +-------------+
  • java
import java.util.ArrayList;
import java.util.List;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

// Create an RDD
JavaRDD<String> peopleRDD = spark.sparkContext()
  .textFile("examples/src/main/resources/people.txt", 1)
  .toJavaRDD();

// The schema is encoded in a string
String schemaString = "name age";

// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
  StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
  fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);

// Convert records of the RDD (people) to Rows
JavaRDD<Row> rowRDD = peopleRDD.map((Function<String, Row>) record -> {
  String[] attributes = record.split(",");
  return RowFactory.create(attributes[0], attributes[1].trim());
});

// Apply the schema to the RDD
Dataset<Row> peopleDataFrame = spark.createDataFrame(rowRDD, schema);

// Creates a temporary view using the DataFrame
peopleDataFrame.createOrReplaceTempView("people");

// SQL can be run over a temporary view created using DataFrames
Dataset<Row> results = spark.sql("SELECT name FROM people");

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
Dataset<String> namesDS = results.map(
    (MapFunction<Row, String>) row -> "Name: " + row.getString(0),
    Encoders.STRING());
namesDS.show();
// +-------------+
// |        value|
// +-------------+
// |Name: Michael|
// |   Name: Andy|
// | Name: Justin|
// +-------------+
  • python
# Import data types
from pyspark.sql.types import *

sc = spark.sparkContext

# Load a text file and convert each line to a Row.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))

# The schema is encoded in a string.
schemaString = "name age"

fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)

# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)

# Creates a temporary view using the DataFrame
schemaPeople.createOrReplaceTempView("people")

# SQL can be run over DataFrames that have been registered as a table.
results = spark.sql("SELECT name FROM people")

results.show()
# +-------+
# |   name|
# +-------+
# |Michael|
# |   Andy|
# | Justin|
# +-------+
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市断傲,隨后出現(xiàn)的幾起案子脱吱,更是在濱河造成了極大的恐慌,老刑警劉巖认罩,帶你破解...
    沈念sama閱讀 222,000評(píng)論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件箱蝠,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)宦搬,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,745評(píng)論 3 399
  • 文/潘曉璐 我一進(jìn)店門牙瓢,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人间校,你說我怎么就攤上這事矾克。” “怎么了憔足?”我有些...
    開封第一講書人閱讀 168,561評(píng)論 0 360
  • 文/不壞的土叔 我叫張陵胁附,是天一觀的道長(zhǎng)。 經(jīng)常有香客問我滓彰,道長(zhǎng)控妻,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 59,782評(píng)論 1 298
  • 正文 為了忘掉前任揭绑,我火速辦了婚禮弓候,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘他匪。我一直安慰自己菇存,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 68,798評(píng)論 6 397
  • 文/花漫 我一把揭開白布诚纸。 她就那樣靜靜地躺著撰筷,像睡著了一般陈惰。 火紅的嫁衣襯著肌膚如雪畦徘。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 52,394評(píng)論 1 310
  • 那天抬闯,我揣著相機(jī)與錄音井辆,去河邊找鬼。 笑死溶握,一個(gè)胖子當(dāng)著我的面吹牛杯缺,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播睡榆,決...
    沈念sama閱讀 40,952評(píng)論 3 421
  • 文/蒼蘭香墨 我猛地睜開眼萍肆,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來了胀屿?” 一聲冷哼從身側(cè)響起塘揣,我...
    開封第一講書人閱讀 39,852評(píng)論 0 276
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎宿崭,沒想到半個(gè)月后亲铡,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,409評(píng)論 1 318
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,483評(píng)論 3 341
  • 正文 我和宋清朗相戀三年奖蔓,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了赞草。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 40,615評(píng)論 1 352
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡吆鹤,死狀恐怖厨疙,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情疑务,我是刑警寧澤轰异,帶...
    沈念sama閱讀 36,303評(píng)論 5 350
  • 正文 年R本政府宣布,位于F島的核電站暑始,受9級(jí)特大地震影響搭独,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜廊镜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,979評(píng)論 3 334
  • 文/蒙蒙 一牙肝、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧嗤朴,春花似錦配椭、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,470評(píng)論 0 24
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至吱雏,卻和暖如春敦姻,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背歧杏。 一陣腳步聲響...
    開封第一講書人閱讀 33,571評(píng)論 1 272
  • 我被黑心中介騙來泰國(guó)打工镰惦, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人犬绒。 一個(gè)月前我還...
    沈念sama閱讀 49,041評(píng)論 3 377
  • 正文 我出身青樓旺入,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親凯力。 傳聞我的和親對(duì)象是個(gè)殘疾皇子茵瘾,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,630評(píng)論 2 359

推薦閱讀更多精彩內(nèi)容