## pig的數(shù)據(jù)類型
###1:基本數(shù)據(jù)類型
?###2:復(fù)雜類型(map废岂、 tuple创倔、 bag)
? ??2.1 map: 是一種chararray 和數(shù)據(jù)元素之間的鍵值對映射畏纲。
? ? 2.2 tuple 是一定長的碗短,包含有序pig數(shù)據(jù)元素的集合裹赴。(一個tuple相當(dāng)于sql中的一行羡忘,而tuple的字段相當(dāng)于sql中的列睛低。)
? ??? ? ? ? ??tuple常量使用圓括號來指示tuple結(jié)構(gòu)案狠,使用逗號來劃分tuple中的字段。如(‘bob’,55)
? ? 2.3?bag:是一個無序的tuple集合钱雷,因為它無序骂铁,所以無法通過位置獲取bag中的tuple。?
? ??? ??? ??? ??bag常量是通過花括號進(jìn)行劃分的罩抗,bag中的tuple用逗號來分隔拉庵,如{(‘bob’,55),(‘sally’,52),(‘john’,25)}。
? ###3 pig 與 數(shù)據(jù)庫表的對比
3.1 關(guān)系(relation)--> 表 ?:一個關(guān)系是一個包
3.2 ?包(bag)--> ? ? ? ? ? ?: 一個bag是一個tuple的集合套蒂,bag使用的是{}
3.3 ?元組( tuple)--> 行 ? ? ? ?:并不要求每一個“元組”都含有相同數(shù)量的字段钞支,并且也不會要求各“元組”中在相同位置處的字段具有相同的數(shù)據(jù)類型(太隨意了,是吧操刀?)
## 一數(shù)據(jù)準(zhǔn)備(上傳數(shù)據(jù)到hdfs)烁挟,
### 1: student.txt
```
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
```
### 2:student_details.txt
```
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
```
### 3customers.txt
```
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00?
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
customers1 = LOAD 'hdfs://mycluster/test/test_data/customers.txt' USING PigStorage(',')
? ?as (id:int, name:chararray, age:int, address:chararray, salary:int);
```?
### 4?orders.txt
```
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
?orders = LOAD 'hdfs://mycluster/test/test_data/orders.txt' USING PigStorage(',')
? ?as (oid:int, date:chararray, customer_id:int, amount:int);
```
## 二: 加載數(shù)據(jù)
? ? 1:load 數(shù)據(jù)
```
student = LOAD '/test/test_data/student_data.txt' USING PigStorage(',')
? ? as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
# 指定集群(hdfs://mycluster)
student_details = LOAD 'hdfs://mycluster/test/test_data/student_details.txt' USING PigStorage(',')
? ?as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
```
## 三 診斷運算符(Diagnostic)
?Load?語句會簡單地將數(shù)據(jù)加載到Apache Pig中的指定關(guān)系中。要驗證Load語句的執(zhí)行骨坑,必須使用Diagnostic運算符撼嗓。Pig Latin提供四種不同類型的診斷運算符:
1 Dump 運算符 (執(zhí)行)
2 Describe 運算符 ? (查看數(shù)據(jù)schema)
3 Explanation 運算符
4 Illustrate 運算符 ?( 以表的形式呈現(xiàn)數(shù)據(jù))
## 四 Pig Group 運算符
###1?讓我們按照年齡關(guān)系中的記錄/元組進(jìn)行分組
```
# 按一列分組
group_data = group student by age;
# 按多列分組 ??
group_data = Group student_details by (age, city);
```
## 五 pig join?
1: Self-join
2: Inner-join
3: Outer-join ? left join, right join, and full join
? ??? ?注:Outer-join 與 inner-join 的區(qū)別就是至少返回一個關(guān)系中的所有行。
###5.1 self join
```
# self-join用于將表與其自身連接欢唾,就像表是兩個關(guān)系一樣且警,臨時重命名至少一個關(guān)系。通常礁遣,在Apache Pig中斑芜,為了執(zhí)行self-join,我們將在不同的別名(名稱)下多次加載相同的數(shù)據(jù)祟霍。那么杏头,將文件?customers.txt?的內(nèi)容加載為兩個表,如下所示浅碾。customers1 = LOAD 'hdfs://mycluster/test/test_data/customers.txt' USING PigStorage(',')
? ?as (id:int, name:chararray, age:int, address:chararray, salary:int);
customers2 = LOAD 'hdfs://mycluster/test/test_data/customers.txt' USING PigStorage(',')
? ?as (id:int, name:chararray, age:int, address:chararray, salary:int);
customers3 = join customers1 by id, customers2 by id
```
### 5.2 Inner join(內(nèi)部鏈接)
Inner Join使用較為頻繁大州;它也被稱為等值連接。當(dāng)兩個表中都存在匹配時垂谢,內(nèi)部連接將返回行厦画。基于連接謂詞(join-predicate),通過組合兩個關(guān)系(例如A和B)的列值來創(chuàng)建新關(guān)系根暑。查詢將A的每一行與B的每一行進(jìn)行比較力试,以查找滿足連接謂詞的所有行對。當(dāng)連接謂詞被滿足時排嫌,A和B的每個匹配的行對的列值被組合成結(jié)果行畸裳。
```
?coustomer_orders = JOIN customers1 BY id, orders BY customer_id;
```
### 5.3 ? Outer Join
Outer Join:與inner join不同,outer join返回至少一個關(guān)系中的所有行淳地。outer join操作以三種方式執(zhí)行:
```
# Left-Join
?outer_left = join customers1 by id left outer, orders by customer_id;
#Right-Join
? outer_right= join customers1 by id right outer, orders by customer_id;
#Full-JJoin
outer_full= join customers1 by id full outer, orders by customer_id;
```
### 5.4 多條件鏈接
```
coustomer_orders = JOIN customers1 BY (id, salary) , orders BY customer_id, amount);
```
## pig 去重
注:pig的distinct 按照元組去重(正行)怖糊,想要實現(xiàn)按照字段去重就需要借助groupBy