由于數(shù)據(jù)量本身通常過于浩大竖螃,我們需要一部分數(shù)據(jù)來加速數(shù)據(jù)分析過程笔刹。我們就需要運用抽樣技術來發(fā)現(xiàn)整個數(shù)據(jù)集的模式和趨勢句伶。HQL中有三種抽樣技術:隨機取樣(random sampling)篮撑、表取樣(bucket table sampling)和塊取樣(block sampling)。
- 隨機取樣(random sampling)
隨機取樣使用rand()函數(shù)和LIMIT關鍵字來獲取樣例數(shù)據(jù)流济,請看以下示例
> SELECT name FROM employee_hr
> DISTRIBUTE BY rand() SORT BY rand() LIMIT 2;
+--------+
| name |
+--------+
| Will |
| Steven |
+--------+
2 rows selected (52.399 seconds)
以上語句中DISTRIBUTE和SORT關鍵字用于確保數(shù)據(jù)有效地在mapper和reducer中隨機分布锐锣,也可以使用ORDER BY rand()來代替,但性能很差绳瘟。
- 表抽樣(bucket table sampling)
這種抽樣方式專門針對bucket table進行了優(yōu)化雕憔。如下例所示,
-- Sampling based on the whole row
> SELECT name FROM employee_trans
> TABLESAMPLE(BUCKET 1 OUT OF 2 ON rand()) a;
+--------+
| name |
+--------+
| Steven |
+--------+
1 row selected (0.129 seconds)
-- Sampling based on the bucket column, which is efficient
> SELECT name FROM employee_trans
> TABLESAMPLE(BUCKET 1 OUT OF 2 ON emp_id) a;
+---------+
| name |
+---------+
| Lucy |
| Steven |
| Michael |
+---------+
3 rows selected (0.136 seconds)
如果抽樣列同時是CLUSTERED BY列糖声,抽樣效率更高斤彼。
- 塊抽樣(Block sampling)
塊抽樣隨機提取n行數(shù)據(jù)記錄,百分之n大小的數(shù)據(jù)記錄或者n個字節(jié)的數(shù)據(jù)記錄姨丈。這個抽樣的粒度為HDFS塊級別畅卓。請看以下示例,
-- Sample by number of rows
> SELECT name
> FROM employee TABLESAMPLE(1 ROWS) a;
+----------+
| name |
+----------+
| Michael |
+----------+
1 rows selected (0.075 seconds)
-- Sample by percentage of data size
> SELECT name
> FROM employee TABLESAMPLE(50 PERCENT) a;
+----------+
| name |
+----------+
| Michael |
| Will |
+----------+
2 rows selected (0.041 seconds)
-- Sample by data size
-- Support b/B, k/K, m/M, g/G
> SELECT name FROM employee TABLESAMPLE(1B) a;
+----------+
| name |
+----------+
| Michael |
+----------+
1 rows selected (0.075 seconds)