MapReduce

參考鏈接:https://github.com/wangzhiwubigdata/God-Of-BigData/blob/master/%E5%A4%A7%E6%95%B0%E6%8D%AE%E6%A1%86%E6%9E%B6%E5%AD%A6%E4%B9%A0/Hadoop-MapReduce.md

MapReduce 概述

分布式計算框架
A MapReduce job consists of a number of
– Map tasks(Map task performs data transformation)
– Reduce tasks(Reduce task combines results of map tasks)
– (Internally) shuffle tasks (Shuffle task sends output of map tasks to right reduce tasks)

比較簡單的流程(后續(xù)有詳解):
input data -> blocks in hdfs -> map function in parallel -> reduce -> output

MapReduce 1.x 流程(MapReduce最原始的處理流程,2.x之后使用YARN)

hadoop cluster.png

JobTracker: ===>集群資源管理與作業(yè)調(diào)度 + 與client進行通信

  • Only one job tracker in a Hadoop cluster
  • Takes requests from clients (MapReduce programs)
  • Ask name node for location of data
  • Assign tasks to task trackers near the data
  • Reassign tasks if failed

TaskTracker ===>匯報心跳,一個是執(zhí)行命令

  • Accept (map, reduce, shuffle) tasks from job trackers
  • Send heartbeats to job trackers: I am alive
  • Monitor status of tasks and notify job tracker
JobTracker & TaskTracker.png

從 Map function和Recuce function 說起

python 中map和reduce的使用

https://www.runoob.com/python/python-func-map.html

map函數(shù): map(function, iterable, ...)
>>> map(lambda x: x ** 2, [1, 2, 3, 4, 5])  
[1, 4, 9, 16, 25]
# 提供了兩個列表故黑,對相同位置的列表數(shù)據(jù)進行相加
>>> map(lambda x, y: x + y, [1, 3, 5, 7, 9], [2, 4, 6, 8, 10])
[3, 7, 11, 15, 19]

reduce 函數(shù) reduce(function, iterable[, initializer])
from functools import reduce
>>> reduce(lambda x, y: x+y, [1,2,3,4,5])  
15

MapReduce 中map和reduce的使用

  • map和reduce的輸入輸出均為<k,v>鍵值對
  • map的輸出作為reduce的輸入(map輸出其實還有一步group by的過程)
  • System groups the intermediate key‐value pairs from map tasks by key
    E.g., <hello, 1> <hello, 1> <hello, 1> <this, 1> => <hello, [1, 1, 1]>, <this, [1]>
MapReduce diagram.png

MapReduce in parallel.png

In hadoop

  • A node may run multiple map/reduce tasks
  • Typically, one map task per input split (chunk of data)
  • One reduce task per partition of map output E.g., partition by key range or hashing

wordcount 例子:

word.txt
hello
hello world

Map function:

  • Input: <offset of line, line> // line = a line of text in a document =>{0:"hello",6:"hello world"}
  • Output: for each word in line, output <word, 1> => ["hello":1,"hello":1,"world":1]

Reduce function:
– Input: <word, list of 1’s> => {"hello":[1,1],"world":[1]} 所以其實reduce的輸入在某種程度上來說不是map的輸出
– Output: <word, count> where count is the number of 1's in the input list =>
{"hello":2,"world":1}

Map 和 Reduce 中的Java實現(xiàn)

對于MapReduce框架肯污,其更像一套八股文,我們只需要編寫Map和Reduce函數(shù)即可彩届,其他的細節(jié)框架已經(jīng)幫我們實現(xiàn)了伪冰,在Java中通過Mapper和Reducer類封裝,所以我們可以編寫子類繼承父類然后overide相應(yīng)的方法即可

Each map task runs an instance of Mapper
– Mapper has a map function
– Map task invokes the map function of the Mapper once for each input key‐value pair

Mapper.png

Each reduce task runs an instance of Reducer
– Reducer has a reduce function
– Reduce task invokes the reduce function of the Reducer once for every different intermediate key
– For the reduce function, values are NOT in any particular order

Reducer.png

Map 之后樟蠕, Reduce之前的細節(jié) --- Shuffling

Shuffle

  • Process of distributing intermediate key‐values to the right reduce tasks
  • It is the only communication among map and reduce tasks
    • Individual map tasks do not exchange data directly with other map tasks
    • They are not even aware of existence of their peer

Map可能在不同的機器上并行處理的贮聂,需要通過 shuffling 將相同 key 值的數(shù)據(jù)分發(fā)到同一個節(jié)點上去合并,這樣才能統(tǒng)計出最終的結(jié)果寨辩,并且這一步也將一個由Map生成的<k,v> 變成<k,[v1,v2]>傳給后續(xù)的Reduce來操作


Shuffle.png

圖片來源:
https://xinze.fun/2020/01/10/Spark-Shuffle-%E5%92%8C-Spill-%E7%9A%84%E5%8C%BA%E5%88%AB/

Internal of shuffling
1.Map side:Partition, sort, spill & merge

  • Partition data in the buffer into R parts (R = # of reduce tasks) # partition = # reduce task
  • Sort data in each partition by key
  • Spill/write data in the buffer to disk
  • Merge the spills
  • Notify job tracker: output complete

2.Reduce side: Fetch & merge

  • Task tracker notified by job tracker: data ready
  • Fetch/copy data from map side
  • Merge data( Some data may sit on disk once fetched , this depends on the buffer size)
  • Figure out groups from sorted data
details about Shuffle.png

回到開頭 數(shù)據(jù)的輸入輸出格式

mapreduce with 2 nodes.png

輸入

InputFormat

  • Determine how input files are split and read
  • Defined in the Java interface InputFormat
  • Job:
    • Split input file into chunks called InputSplits(logic concepts)
    • Implement RecordReader to read data from splits

在Java中InputFormat是個抽象類吓懈,其有諸多的子類去讀取特定的輸入
? FileInputFormat (input from files in given dirs) --- 用的最多
? DBInputFormat (input data from a database)
? CombineFileInputFormat (input data by combining multiple files)

FileInputFormat
– Takes paths to files
– Read all files in the paths
Divide each file into one or more InputSplits
FileInputFormat也是一個抽象類,其下有對應(yīng)的子類來實現(xiàn)對應(yīng)的內(nèi)容輸入
– TextInputFormat
– KeyValueTextInputFormat
– SequenceFileInputFormat

Subclasses of FileInputFormat.png

InputFormat 干的第一件事情:Split input file into InputSplits

聊一聊InputSplit靡狞,注意一點耻警,InputSplit是一個邏輯劃分的概念
? If a file is big, multiple splits may be created Typical split size = 128MB
? A map task is created for each split (a chunk of some input file)

InputFormat 干的第二件事情:Implement RecordReader to read data from InputSplits

  • InputFormat defines an instanceof RR E.g., TextInputFormat provides LineRecordReader
  • LineRecordReader
    • Form a key‐value pair for every line of file
    • Data type for key: LongWritable; value: Text
  • Reader is repeatedly called until all data in the split are processed

輸出

OutputFormat

  • Define the format of output from Reducers(Output stored in a file)
  • Defined in the Java interface OutputFormat
  • Implemention: FileOutputFormat
    • Subclasses: TextOutputFormat, SequenceFileOutputFormat


      Subclasses of OutputFormat.png
  • All Reducers write to the same directory
  • Set by FileOutputFormat.setOutputPath() method
  • OutputFormat defines aRecordWrite which handles the write

可選操作:Combiner

個人理解:combiner 約等于 一個本地化的reduer

  • Run on the node running the Mapper
    • Perform local (or mini‐) reduction
  • Combine Mapper results
    • Before they are sent to the Reducers
    • Reduce communication costs
    • E.g., may use a combiner in WordCount – (cat, 1), (cat, 1), (cat, 1) => (cat, 3)

注意,使用combiner是為了節(jié)省空間成本、加快計算榕栏,但是不能對最終的結(jié)果有影響畔勤,例如說求個平均數(shù)什么的就不能用combiner
專業(yè)的說法就是:
May directly use the combiner

  • If it is commutative and associative
  • Meaning operations can be grouped & performed in any order
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市扒磁,隨后出現(xiàn)的幾起案子庆揪,更是在濱河造成了極大的恐慌,老刑警劉巖妨托,帶你破解...
    沈念sama閱讀 211,743評論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件缸榛,死亡現(xiàn)場離奇詭異,居然都是意外死亡兰伤,警方通過查閱死者的電腦和手機内颗,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,296評論 3 385
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來敦腔,“玉大人均澳,你說我怎么就攤上這事》危” “怎么了找前?”我有些...
    開封第一講書人閱讀 157,285評論 0 348
  • 文/不壞的土叔 我叫張陵,是天一觀的道長判族。 經(jīng)常有香客問我躺盛,道長,這世上最難降的妖魔是什么形帮? 我笑而不...
    開封第一講書人閱讀 56,485評論 1 283
  • 正文 為了忘掉前任槽惫,我火速辦了婚禮,結(jié)果婚禮上辩撑,老公的妹妹穿的比我還像新娘界斜。我一直安慰自己,他們只是感情好槐臀,可當(dāng)我...
    茶點故事閱讀 65,581評論 6 386
  • 文/花漫 我一把揭開白布锄蹂。 她就那樣靜靜地躺著,像睡著了一般水慨。 火紅的嫁衣襯著肌膚如雪得糜。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 49,821評論 1 290
  • 那天晰洒,我揣著相機與錄音朝抖,去河邊找鬼。 笑死谍珊,一個胖子當(dāng)著我的面吹牛治宣,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 38,960評論 3 408
  • 文/蒼蘭香墨 我猛地睜開眼侮邀,長吁一口氣:“原來是場噩夢啊……” “哼坏怪!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起绊茧,我...
    開封第一講書人閱讀 37,719評論 0 266
  • 序言:老撾萬榮一對情侶失蹤铝宵,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后华畏,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體鹏秋,經(jīng)...
    沈念sama閱讀 44,186評論 1 303
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,516評論 2 327
  • 正文 我和宋清朗相戀三年亡笑,在試婚紗的時候發(fā)現(xiàn)自己被綠了侣夷。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 38,650評論 1 340
  • 序言:一個原本活蹦亂跳的男人離奇死亡仑乌,死狀恐怖百拓,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情晰甚,我是刑警寧澤耐版,帶...
    沈念sama閱讀 34,329評論 4 330
  • 正文 年R本政府宣布,位于F島的核電站压汪,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏古瓤。R本人自食惡果不足惜止剖,卻給世界環(huán)境...
    茶點故事閱讀 39,936評論 3 313
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望落君。 院中可真熱鬧穿香,春花似錦、人聲如沸绎速。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,757評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽纹冤。三九已至洒宝,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間萌京,已是汗流浹背雁歌。 一陣腳步聲響...
    開封第一講書人閱讀 31,991評論 1 266
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留知残,地道東北人靠瞎。 一個月前我還...
    沈念sama閱讀 46,370評論 2 360
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親乏盐。 傳聞我的和親對象是個殘疾皇子佳窑,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 43,527評論 2 349

推薦閱讀更多精彩內(nèi)容