用distcp在hdfs和S3之間數(shù)據(jù)傳輸

隸屬于文章系列:大數(shù)據(jù)安全實(shí)戰(zhàn) http://www.reibang.com/p/76627fd8399c

  • hdfs和s3之間的帶寬是復(fù)制速度的上限考阱。
    • -m <num_maps>
    • bandwidth 帶寬
    • 增加mapper的個(gè)數(shù)矢渊,和每個(gè)mapper的帶寬可能會(huì)提高傳輸速度,但是 mapper個(gè)數(shù) * bandwidth選項(xiàng),不會(huì)超過hdfs和s3帶寬的上限浑玛。
  • hadoop集群離s3越遠(yuǎn)甲锡,帶寬越小,復(fù)制越慢中燥。在同一區(qū)域的s3 和主機(jī)往往能提高速度寇甸。
  • 但是即便是hadoop同樣也部署云基礎(chǔ)設(shè)施中,有可能因觸發(fā)s3的速度限制而導(dǎo)致數(shù)據(jù)復(fù)制慢疗涉。對(duì)一個(gè)目錄樹的負(fù)載過重拿霉,S3可能會(huì)可能會(huì)拖延處理或者拒絕。
    • 很多mapper的大數(shù)據(jù)量的復(fù)制操作咱扣,可能會(huì)減慢向S3的上傳绽淘。
    • 當(dāng)增減mapper的個(gè)數(shù)實(shí)際上在減慢復(fù)制速度的時(shí)候,可能的原因就是觸發(fā)了限制闹伪。
  • -p[rbugpcaxt]選項(xiàng)對(duì)s3來說沒有意義沪铭,但是用了之后會(huì)讓每次數(shù)據(jù)傳輸都是全量復(fù)制。
hive@cdh-slave01:/tmp$ hadoop distcp -Dfs.s3a.access.key=fdsfdsfdsfsdIMNH6ZBsdf  -Dfs.s3a.secret.key=fasdfds  -update/user/hive/warehouse/ads.db  s3a://bucket/user/hive/warehouse/ads.db
19/04/10 06:59:01 INFO tools.OptionsParser: parseChunkSize: blocksperchunk false
19/04/10 06:59:02 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
19/04/10 06:59:02 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
19/04/10 06:59:02 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started
19/04/10 06:59:03 INFO Configuration.deprecation: fs.s3a.server-side-encryption-key is deprecated. Instead, use fs.s3a.server-side-encryption.key
19/04/10 06:59:03 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=false, useRdiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null,logPath=null, sourceFileListing=null, sourcePaths=[/user/hive/warehouse/ads.db], targetPath=s3a://bucket/user/hive/warehouse/ads.db, targetPathExists=true, filtersFile='null', blocksPerChunk=0, copyBufferSize=8192}
19/04/10 06:59:04 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 2843; dirCnt = 19
19/04/10 06:59:04 INFO tools.SimpleCopyListing: Build file listing completed.
19/04/10 06:59:04 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
19/04/10 06:59:04 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
19/04/10 06:59:04 INFO tools.DistCp: Number of paths in the copy list: 2843
19/04/10 06:59:04 INFO tools.DistCp: Number of paths in the copy list: 2843
19/04/10 06:59:04 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm721
19/04/10 06:59:04 INFO mapreduce.JobSubmitter: number of splits:21
19/04/10 06:59:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1554779117367_2179
19/04/10 06:59:05 INFO impl.YarnClientImpl: Submitted application application_1554779117367_2179
19/04/10 06:59:05 INFO mapreduce.Job: The url to track the job: http://cdh-master2:8088/proxy/application_1554779117367_2179/
19/04/10 06:59:05 INFO tools.DistCp: DistCp job-id: job_1554779117367_2179
19/04/10 06:59:05 INFO mapreduce.Job: Running job: job_1554779117367_2179
19/04/10 06:59:10 INFO mapreduce.Job: Job job_1554779117367_2179 running in uber mode : false
19/04/10 06:59:10 INFO mapreduce.Job:  map 0% reduce 0%
19/04/10 06:59:18 INFO mapreduce.Job:  map 10% reduce 0%
19/04/10 06:59:19 INFO mapreduce.Job:  map 19% reduce 0%
19/04/10 06:59:20 INFO mapreduce.Job:  map 52% reduce 0%
19/04/10 06:59:21 INFO mapreduce.Job:  map 67% reduce 0%
19/04/10 06:59:22 INFO mapreduce.Job:  map 86% reduce 0%
19/04/10 06:59:23 INFO mapreduce.Job:  map 95% reduce 0%
19/04/10 06:59:24 INFO mapreduce.Job:  map 100% reduce 0%
19/04/10 06:59:25 INFO mapreduce.Job: Job job_1554779117367_2179 completed successfully
19/04/10 06:59:25 INFO mapreduce.Job: Counters: 40
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=3304256
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=985334
        HDFS: Number of bytes written=413302
        HDFS: Number of read operations=5797
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=42
        S3A: Number of bytes read=0
        S3A: Number of bytes written=32647
        S3A: Number of read operations=3014
        S3A: Number of large read operations=0
        S3A: Number of write operations=37
    Job Counters
        Launched map tasks=21
        Other local map tasks=21
        Total time spent by all maps in occupied slots (ms)=183711
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=183711
        Total vcore-milliseconds taken by all map tasks=183711
        Total megabyte-milliseconds taken by all map tasks=564360192
    Map-Reduce Framework
        Map input records=2843
        Map output records=2822
        Input split bytes=2394
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=9536
        CPU time spent (ms)=136920
        Physical memory (bytes) snapshot=9398771712
        Virtual memory (bytes) snapshot=56006500352
        Total committed heap usage (bytes)=15602810880
    File Input Format Counters
        Bytes Read=950293
    File Output Format Counters
        Bytes Written=413302
    DistCp Counters
        Bytes Copied=32647
        Bytes Expected=32647
        Bytes Skipped=1800151
        Files Copied=21
        Files Skipped=2822
19/04/10 06:59:25 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...
19/04/10 06:59:25 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.
19/04/10 06:59:25 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

#在命令行查看目錄個(gè)數(shù) 文件個(gè)數(shù) 目錄總的字節(jié)數(shù)
hive@cdh-slave01:/tmp$ hdfs dfs -count /user/hive/warehouse/ads.db
          20         2824            1832798 /user/hive/warehouse/ads.db

當(dāng)distcp計(jì)數(shù)器中有skipped這一行的時(shí)候偏瓤,才是復(fù)制的增量杀怠,不然就是復(fù)制全量。
驗(yàn)證方法:
- 運(yùn)行輸出中:
Number of paths in the copy list: 2843包含了文件和目錄的總數(shù)=Files Copied=+Files Skipped=files num+dir num
目錄總的字節(jié)數(shù)=Bytes Copied=+Bytes Skipped

參考

  • distcp官網(wǎng)文檔

  • ?Improving DistCp Performance

  • hadoop fs -count < hdfs path >

    統(tǒng)計(jì)hdfs對(duì)應(yīng)路徑下的目錄個(gè)數(shù)厅克,文件個(gè)數(shù)赔退,文件總計(jì)大小

    顯示為目錄個(gè)數(shù),文件個(gè)數(shù)证舟,文件總計(jì)大小硕旗,輸入路徑

    例如:

    hadoop fs -count /data/dltb3yi/
    
     1        24000       253953854502 /data/dltb3yi/      獲得24000個(gè)文件
    
    
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市女责,隨后出現(xiàn)的幾起案子漆枚,更是在濱河造成了極大的恐慌,老刑警劉巖抵知,帶你破解...
    沈念sama閱讀 219,039評(píng)論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件墙基,死亡現(xiàn)場(chǎng)離奇詭異昔榴,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)碘橘,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,426評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門互订,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人痘拆,你說我怎么就攤上這事仰禽。” “怎么了纺蛆?”我有些...
    開封第一講書人閱讀 165,417評(píng)論 0 356
  • 文/不壞的土叔 我叫張陵吐葵,是天一觀的道長。 經(jīng)常有香客問我桥氏,道長温峭,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,868評(píng)論 1 295
  • 正文 為了忘掉前任字支,我火速辦了婚禮凤藏,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘堕伪。我一直安慰自己揖庄,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,892評(píng)論 6 392
  • 文/花漫 我一把揭開白布欠雌。 她就那樣靜靜地躺著蹄梢,像睡著了一般。 火紅的嫁衣襯著肌膚如雪富俄。 梳的紋絲不亂的頭發(fā)上禁炒,一...
    開封第一講書人閱讀 51,692評(píng)論 1 305
  • 那天,我揣著相機(jī)與錄音霍比,去河邊找鬼幕袱。 笑死,一個(gè)胖子當(dāng)著我的面吹牛桂塞,可吹牛的內(nèi)容都是我干的凹蜂。 我是一名探鬼主播馍驯,決...
    沈念sama閱讀 40,416評(píng)論 3 419
  • 文/蒼蘭香墨 我猛地睜開眼阁危,長吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來了汰瘫?” 一聲冷哼從身側(cè)響起狂打,我...
    開封第一講書人閱讀 39,326評(píng)論 0 276
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎混弥,沒想到半個(gè)月后趴乡,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體对省,經(jīng)...
    沈念sama閱讀 45,782評(píng)論 1 316
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,957評(píng)論 3 337
  • 正文 我和宋清朗相戀三年晾捏,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了蒿涎。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 40,102評(píng)論 1 350
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡惦辛,死狀恐怖劳秋,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情胖齐,我是刑警寧澤玻淑,帶...
    沈念sama閱讀 35,790評(píng)論 5 346
  • 正文 年R本政府宣布,位于F島的核電站呀伙,受9級(jí)特大地震影響补履,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜剿另,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,442評(píng)論 3 331
  • 文/蒙蒙 一箫锤、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧雨女,春花似錦麻汰、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,996評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至岔擂,卻和暖如春位喂,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背乱灵。 一陣腳步聲響...
    開封第一講書人閱讀 33,113評(píng)論 1 272
  • 我被黑心中介騙來泰國打工塑崖, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人痛倚。 一個(gè)月前我還...
    沈念sama閱讀 48,332評(píng)論 3 373
  • 正文 我出身青樓规婆,卻偏偏與公主長得像,于是被迫代替她去往敵國和親蝉稳。 傳聞我的和親對(duì)象是個(gè)殘疾皇子抒蚜,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,044評(píng)論 2 355

推薦閱讀更多精彩內(nèi)容