Apache Sqoop - Overview?(概述)?

Apache Sqoop 概述

使用Hadoop來分析和處理數(shù)據(jù)需要將數(shù)據(jù)加載到集群中并且將它和企業(yè)生產(chǎn)數(shù)據(jù)庫中的其他數(shù)據(jù)進行結(jié)合處理。從生產(chǎn)系統(tǒng)加載大塊數(shù)據(jù)到Hadoop中或者從大型集群的map reduce應(yīng)用中獲得數(shù)據(jù)是個挑戰(zhàn)。用戶必須意識到確保數(shù)據(jù)一致性储笑,消耗生產(chǎn)系統(tǒng)資源翁涤,供應(yīng)下游管道的數(shù)據(jù)預(yù)處理這些細節(jié)沿腰。用腳本來轉(zhuǎn)化數(shù)據(jù)是低效和耗時的方式敷扫。使用map reduce應(yīng)用直接去獲取外部系統(tǒng)的數(shù)據(jù)使得應(yīng)用變得復(fù)雜和增加了生產(chǎn)系統(tǒng)來自集群節(jié)點過度負載的風(fēng)險晴楔。

這就是Apache Sqoop能夠做到的摆昧。Aapche Sqoop 目前是Apache軟件會的孵化項目撩满。更多關(guān)于這個項目的信息可以在http://incubator.apache.org/sqoop查看

Sqoop能夠使得像關(guān)系型數(shù)據(jù)庫、企業(yè)數(shù)據(jù)倉庫和NoSQL系統(tǒng)那樣簡單地從結(jié)構(gòu)化數(shù)據(jù)倉庫中導(dǎo)入導(dǎo)出數(shù)據(jù)绅你。你可以使用Sqoop將數(shù)據(jù)從外部系統(tǒng)加載到HDFS伺帘,存儲在Hive和HBase表格中。Sqoop配合Ooozie能夠幫助你調(diào)度和自動運行導(dǎo)入導(dǎo)出任務(wù)忌锯。Sqoop使用基于支持插件來提供新的外部鏈接的連接器伪嫁。

當你運行Sqoop的時候看起來是非常簡單的,但是表象底層下面發(fā)生了什么呢偶垮?數(shù)據(jù)集將被切片分到不同的partitions和運行一個只有map的作業(yè)來負責(zé)數(shù)據(jù)集的某個切片张咳。因為Sqoop使用數(shù)據(jù)庫的元數(shù)據(jù)來推斷數(shù)據(jù)類型所以每條數(shù)據(jù)都以一種類型安全的方式來處理。

在這篇文章其余部分中我們將通過一個例子來展示Sqoop的各種使用方式似舵。這篇文章的目標是提供Sqoop操作的一個概述而不是深入高級功能的細節(jié)脚猾。

導(dǎo)入數(shù)據(jù)

下面的命令用于將一個MySQL數(shù)據(jù)庫中名為ORDERS的表中所有數(shù)據(jù)導(dǎo)入到集群中---

$ sqoop import --connect jdbc:mysql://localhost/acmedb \

?--table ORDERS --username test --password ****

在這條命令中的各種選項解釋如下:

import: 指示Sqoop開始導(dǎo)入

--connect , --username , --password : 這些都是連接數(shù)據(jù)庫時需要的參數(shù)。這跟你通過JDBC連接數(shù)據(jù)庫時所使用的參數(shù)沒有區(qū)別

--table: 指定要導(dǎo)入哪個表

導(dǎo)入操作通過下面Figure1所描繪的那兩步來完成砚哗。第一步龙助,Sqoop從數(shù)據(jù)庫中獲取要導(dǎo)入的數(shù)據(jù)的元數(shù)據(jù)。第二步蛛芥,Sqoop提交map-only作業(yè)到Hadoop集群中提鸟。第二步通過在前一步中獲取的元數(shù)據(jù)做實際的數(shù)據(jù)傳輸工作军援。

Figure 1: Sqoop Import Overview

導(dǎo)入的數(shù)據(jù)存儲在HDFS目錄下。正如Sqoop大多數(shù)操作一樣称勋,用戶可以指定任何替換路徑來存儲導(dǎo)入的數(shù)據(jù)胸哥。

默認情況下這些文檔包含用逗號分隔的字段,用新行來分隔不同的記錄铣缠。你可以明確地指定字段分隔符和記錄結(jié)束符容易地實現(xiàn)文件復(fù)制過程中的格式覆蓋烘嘱。

Sqoop也支持不同數(shù)據(jù)格式的數(shù)據(jù)導(dǎo)入。例如蝗蛙,你可以通過指定 --as-avrodatafile 選項的命令行來簡單地實現(xiàn)導(dǎo)入Avro 格式的數(shù)據(jù)蝇庭。

Sqoop提供許多選項可以用來滿足指定需求的導(dǎo)入操作。

導(dǎo)入數(shù)據(jù)到 Hive

在許多情況下捡硅,導(dǎo)入數(shù)據(jù)到Hive就跟運行一個導(dǎo)入任務(wù)然后使用Hive創(chuàng)建和加載一個確定的表和partition哮内。手動執(zhí)行這個操作需要你要知道正確的數(shù)據(jù)類型映射和其他細節(jié)像序列化格式和分隔符。Sqoop負責(zé)將合適的表格元數(shù)據(jù)填充到Hive 元數(shù)據(jù)倉庫和調(diào)用必要的指令來加載table和partition壮韭。這些操作都可以通過簡單地在命令行中指定--hive-import 來實現(xiàn)北发。

$ sqoop import --connect jdbc:mysql://localhost/acmedb \

?--table ORDERS --username test --password **** --hive-import

當你運行一個Hive import時,Sqoop將會將數(shù)據(jù)的類型從外部數(shù)據(jù)倉庫的原生數(shù)據(jù)類型轉(zhuǎn)換成Hive中對應(yīng)的類型喷屋,Sqoop自動地選擇Hive使用的本地分隔符琳拨。如果被導(dǎo)入的數(shù)據(jù)中有新行或者有其他Hive分隔符,Sqoop允許你移除這些字符并且獲取導(dǎo)入到Hive的正確數(shù)據(jù)屯曹。

一旦導(dǎo)入操作完成狱庇,你就像Hive其他表格一樣去查看和操作。

導(dǎo)入數(shù)據(jù)到 HBase

你可以使用Sqoop將數(shù)據(jù)插入到HBase表格中特定列族恶耽。跟Hive導(dǎo)入操作很像密任,可以通過指定一個額外的選項來指定要插入的HBase表格和列族。所有導(dǎo)入到HBase的數(shù)據(jù)將轉(zhuǎn)換成字符串并以UTF-8字節(jié)數(shù)組的格式插入到HBase中

$ sqoop import --connect jdbc:mysql://localhost/acmedb \

--table ORDERS --username test --password **** \

--hbase-create-table --hbase-table ORDERS --column-family mysql

下面是命令行中各種選項的解釋:

--hbase-create-table: 這個選項指示Sqoop創(chuàng)建HBase表.

--hbase-table: 這個選項指定HBase表格的名字.

--column-family: T這個選項指定列族的名字.

剩下的選項跟普通的導(dǎo)入操作一樣浪讳。

導(dǎo)出數(shù)據(jù)

在一些情況中,通過Hadoop pipelines來處理數(shù)據(jù)可能需要在生產(chǎn)系統(tǒng)中運行額外的關(guān)鍵業(yè)務(wù)函數(shù)來提供幫助淹遵。Sqoop可以在必要的時候用來導(dǎo)出這些的數(shù)據(jù)到外部數(shù)據(jù)倉庫负溪。還是使用上面的例子,如果Hadoop pieplines產(chǎn)生的數(shù)據(jù)對應(yīng)數(shù)據(jù)庫OREDERS表格中的某些地方笙以,你可以使用下面的命令行:

$ sqoop export --connect jdbc:mysql://localhost/acmedb \

--table ORDERS --username test --password **** \

--export-dir /user/arvind/ORDERS

下面是各種選項的解釋:

export: 指示Sqoop開始導(dǎo)出

--connect , --username , --password :這些都是連接數(shù)據(jù)庫時需要的參數(shù)冻辩。這跟你通過JDBC連接數(shù)據(jù)庫時所使用的參數(shù)沒有區(qū)別

--table: 指定要被填充的表格

--export-dir : 導(dǎo)出路徑.

導(dǎo)入操作通過下面Figure2所描繪的那兩步來完成猖腕。第一步拆祈,從數(shù)據(jù)庫中獲取要導(dǎo)入的數(shù)據(jù)的元數(shù)據(jù),第二步則是數(shù)據(jù)的傳輸倘感。Sqoop將輸入數(shù)據(jù)集分割成片然后用map任務(wù)將片插入到數(shù)據(jù)庫中放坏。為了確保最佳的吞吐量和最小的資源使用率,每個map任務(wù)通過多個事務(wù)來執(zhí)行這個數(shù)據(jù)傳輸老玛。


Figure 2: Sqoop Export Overview

一些連接器支持臨時表格來幫助隔離那些任何原因?qū)е碌淖鳂I(yè)失敗而產(chǎn)生的生產(chǎn)表格淤年。一旦所有的數(shù)據(jù)都傳輸完成,臨時表格中的數(shù)據(jù)首先被填充到map任務(wù)和合并到目標表格蜡豹。

Sqoop 連接器

使用專門連接器麸粮,Sqoop可以連接那些擁有優(yōu)化導(dǎo)入導(dǎo)出基礎(chǔ)設(shè)施的外部系統(tǒng),或者不支持本地JDBC镜廉。連接器是插件化組件基于Sqoop的可擴展框架和可以添加到任何當前存在的Sqoop弄诲。一旦連接器安裝好,Sqoop可以使用它在Hadoop和連接器支持的外部倉庫之間進行高效的傳輸數(shù)據(jù)娇唯。

默認情況下齐遵,Sqoop包含支持各種常用數(shù)據(jù)庫例如MySQL,PostgreSQL,Oracle塔插,SQLServer和DB2的連接器梗摇。它也包含支持MySQL和PostgreSQL數(shù)據(jù)庫的快速路徑連接器∠胄恚快速路徑連接器是專門的連接器用來實現(xiàn)批次傳輸數(shù)據(jù)的高吞吐量伶授。Sqoop也包含一般的JDBC連接器用于連接通過JDBC連接的數(shù)據(jù)庫

跟內(nèi)置的連接不同的是,許多公司會開發(fā)他們自己的連接器插入到Sqoop中伸刃,從專門的企業(yè)倉庫連接器到NoSQL數(shù)據(jù)庫谎砾。

總結(jié)

在這篇文檔中可以看到大數(shù)據(jù)集在Hadoop和外部數(shù)據(jù)倉庫例如關(guān)系型數(shù)據(jù)庫的傳輸是多么的簡單。除此之外捧颅,Sqoop提供許多高級提醒如不同數(shù)據(jù)格式景图、壓縮、處理查詢等等碉哑。我們建議你多嘗試Sqoop并給我們提供反饋挚币。

更多關(guān)于Sqoop的信息可以在下面路徑找到:

Project Website: http://incubator.apache.org/sqoop

Wiki: https://cwiki.apache.org/confluence/display/SQOOP

Project Status: ?http://incubator.apache.org/projects/sqoop.html

Mailing Lists: https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists


下面是原文


Apache Sqoop - Overview

Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.

This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.

Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external system on to HDFS, and populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks. Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems.

What happens underneath the covers when you run Sqoop is very straightforward. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe manner since Sqoop uses the database metadata to infer the data types.

In the rest of this post we will walk through an example that shows the various ways you can use Sqoop. The goal of this post is to give an overview of Sqoop operation without going into much detail or advanced functionality.

Importing Data

The following command is used to import all data from a table called ORDERS from a MySQL database:

---

$ sqoop import --connect jdbc:mysql://localhost/acmedb \

?--table ORDERS --username test --password ****

---

In this command the various options specified are as follows:

import: This is the sub-command that instructs Sqoop to initiate an import.

--connect , --username , --password : These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection.

--table: This parameter specifies the table which will be imported.

The import is done in two steps as depicted in Figure 1 below. In the first Step Sqoop introspects the database to gather the necessary metadata for the data being imported. The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is this job that does the actual data transfer using the metadata captured in the previous step.


Figure 1: Sqoop Import Overview

The imported data is saved in a directory on HDFS based on the table being imported. As is the case with most aspects of Sqoop operation, the user can specify any alternative directory where the files should be populated.

By default these files contain comma delimited fields, with new lines separating different records. You can easily override the format in which data is copied over by explicitly specifying the field separator and record terminator characters.

Sqoop also supports different data formats for importing data. For example, you can easily import data in Avro data format by simply specifying the option --as-avrodatafile with the import command.

There are many other options that Sqoop provides which can be used to further tune the import operation to suit your specific requirements.

Importing Data into Hive

In most cases, importing data into Hive is the same as running the import task and then using Hive to create and load a certain table or partition. Doing this manually requires that you know the correct type mapping between the data and other details like the serialization format and delimiters. Sqoop takes care of populating the Hive metastore with the appropriate metadata for the table and also invokes the necessary commands to load the table or partition as the case may be. All of this is done by simply specifying the option --hive-import with the import command.

----

$ sqoop import --connect jdbc:mysql://localhost/acmedb \

?--table ORDERS --username test --password **** --hive-import

----

When you run a Hive import, Sqoop converts the data from the native datatypes within the external datastore into the corresponding types within Hive. Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new line or other Hive delimiter characters in it, Sqoop allows you to remove such characters and get the data correctly populated for consumption in Hive.

Once the import is complete, you can see and operate on the table just like any other table in Hive.

Importing Data into HBase

You can use Sqoop to populate data in a particular column family within the HBase table. Much like the Hive import, this can be done by specifying the additional options that relate to the HBase table and column family being populated. All data imported into HBase is converted to their string representation and inserted as UTF-8 bytes.

----

$ sqoop import --connect jdbc:mysql://localhost/acmedb \

--table ORDERS --username test --password **** \

--hbase-create-table --hbase-table ORDERS --column-family mysql

----

In this command the various options specified are as follows:

--hbase-create-table: This option instructs Sqoop to create the HBase table.

--hbase-table: This option specifies the table name to use.

--column-family: This option specifies the column family name to use.

The rest of the options are the same as that for regular import operation.

Exporting Data

In some cases data processed by Hadoop pipelines may be needed in production systems to help run additional critical business functions. Sqoop can be used to export such data into external datastores as necessary. Continuing our example from above - if data generated by the pipeline on Hadoop corresponded to the ORDERS table in a database somewhere, you could populate it using the following command:

----

$ sqoop export --connect jdbc:mysql://localhost/acmedb \

--table ORDERS --username test --password **** \

--export-dir /user/arvind/ORDERS

----

In this command the various options specified are as follows:

export: This is the sub-command that instructs Sqoop to initiate an export.

--connect , --username , --password : These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection.

--table: This parameter specifies the table which will be populated.

--export-dir : This is the directory from which data will be exported.

Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into splits and then uses individual map tasks to push the splits to the database. Each map task performs this transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.


Figure 2: Sqoop Export Overview

Some connectors support staging tables that help isolate production tables from possible corruption in case of job failures due to any reason. Staging tables are first populated by the map tasks and then merged into the target table once all of the data has been delivered it.

Sqoop Connectors

Using specialized connectors, Sqoop can connect with external systems that have optimized import and export facilities, or do not support native JDBC. Connectors are plugin components based on Sqoop’s extension framework and can be added to any existing Sqoop installation. Once a connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external store supported by the connector.

By default Sqoop includes connectors for various popular databases such as MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes fast-path connectors for MySQL and PostgreSQL databases. Fast-path connectors are specialized connectors that use database specific batch tools to transfer data with high throughput. Sqoop also includes a generic JDBC connector that can be used to connect to any database that is accessible via JDBC.

Apart from the built-in connectors, many companies have developed their own connectors that can be plugged into Sqoop. These range from specialized connectors for enterprise data warehouse systems to NoSQL datastores.

Wrapping Up

In this post you saw how easy it is to transfer large datasets between Hadoop and external datastores such as relational databases. Beyond this, Sqoop offers many advance features such as different data formats, compression, working with queries instead of tables etc. We encourage you to try out Sqoop and give us your feedback.

More information regarding Sqoop can be found at:

Project Website: http://incubator.apache.org/sqoop

Wiki: https://cwiki.apache.org/confluence/display/SQOOP

Project Status: ?http://incubator.apache.org/projects/sqoop.html

Mailing Lists: https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末润努,一起剝皮案震驚了整個濱河市铺浇,隨后出現(xiàn)的幾起案子鳍侣,更是在濱河造成了極大的恐慌吼拥,老刑警劉巖扔罪,帶你破解...
    沈念sama閱讀 216,402評論 6 499
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件矿酵,死亡現(xiàn)場離奇詭異全肮,居然都是意外死亡,警方通過查閱死者的電腦和手機辜腺,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,377評論 3 392
  • 文/潘曉璐 我一進店門测砂,熙熙樓的掌柜王于貴愁眉苦臉地迎上來百匆,“玉大人,你說我怎么就攤上這事存璃∽荻” “怎么了?”我有些...
    開封第一講書人閱讀 162,483評論 0 353
  • 文/不壞的土叔 我叫張陵洒扎,是天一觀的道長衰絮。 經(jīng)常有香客問我岂傲,道長子檀,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,165評論 1 292
  • 正文 為了忘掉前任亩进,我火速辦了婚禮归薛,結(jié)果婚禮上匪蝙,老公的妹妹穿的比我還像新娘逛球。我一直安慰自己,他們只是感情好幸海,可當我...
    茶點故事閱讀 67,176評論 6 388
  • 文/花漫 我一把揭開白布物独。 她就那樣靜靜地躺著挡篓,像睡著了一般溢谤。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上阀参,一...
    開封第一講書人閱讀 51,146評論 1 297
  • 那天蛛壳,我揣著相機與錄音衙荐,去河邊找鬼忧吟。 笑死,一個胖子當著我的面吹牛溜族,可吹牛的內(nèi)容都是我干的煌抒。 我是一名探鬼主播,決...
    沈念sama閱讀 40,032評論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼贩疙,長吁一口氣:“原來是場噩夢啊……” “哼这溅!你這毒婦竟也來了芍躏?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 38,896評論 0 274
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎榜配,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體临燃,經(jīng)...
    沈念sama閱讀 45,311評論 1 310
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,536評論 2 332
  • 正文 我和宋清朗相戀三年爪瓜,在試婚紗的時候發(fā)現(xiàn)自己被綠了匙瘪。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 39,696評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡翁都,死狀恐怖谅猾,靈堂內(nèi)的尸體忽然破棺而出税娜,到底是詐尸還是另有隱情巧涧,我是刑警寧澤谤绳,帶...
    沈念sama閱讀 35,413評論 5 343
  • 正文 年R本政府宣布缩筛,位于F島的核電站瞎抛,受9級特大地震影響桐臊,放射性物質(zhì)發(fā)生泄漏晓殊。R本人自食惡果不足惜巫俺,卻給世界環(huán)境...
    茶點故事閱讀 41,008評論 3 325
  • 文/蒙蒙 一介汹、第九天 我趴在偏房一處隱蔽的房頂上張望嘹承。 院中可真熱鬧,春花似錦赶撰、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,659評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽墨技。三九已至扣汪,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間冬筒,已是汗流浹背舞痰。 一陣腳步聲響...
    開封第一講書人閱讀 32,815評論 1 269
  • 我被黑心中介騙來泰國打工响牛, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留呀打,地道東北人聚磺。 一個月前我還...
    沈念sama閱讀 47,698評論 2 368
  • 正文 我出身青樓,卻偏偏與公主長得像稠炬,于是被迫代替她去往敵國和親首启。 傳聞我的和親對象是個殘疾皇子暮屡,可洞房花燭夜當晚...
    茶點故事閱讀 44,592評論 2 353

推薦閱讀更多精彩內(nèi)容