MapReduce

參考鏈接：https://github.com/wangzhiwubigdata/God-Of-BigData/blob/master/%E5%A4%A7%E6%95%B0%E6%8D%AE%E6%A1%86%E6%9E%B6%E5%AD%A6%E4%B9%A0/Hadoop-MapReduce.md

MapReduce 概述

分布式計算框架
A MapReduce job consists of a number of
– Map tasks（Map task performs data transformation）
– Reduce tasks（Reduce task combines results of map tasks）
– (Internally) shuffle tasks （Shuffle task sends output of map tasks to right reduce tasks）

比較簡單的流程（后續(xù)有詳解）：
input data -> blocks in hdfs -> map function in parallel -> reduce -> output

MapReduce 1.x 流程（MapReduce最原始的處理流程，2.x之后使用YARN）

hadoop cluster.png

JobTracker： ===>集群資源管理與作業(yè)調(diào)度 + 與client進行通信

Only one job tracker in a Hadoop cluster
Takes requests from clients (MapReduce programs)
Ask name node for location of data
Assign tasks to task trackers near the data
Reassign tasks if failed

TaskTracker ===>匯報心跳，一個是執(zhí)行命令

Accept (map, reduce, shuffle) tasks from job trackers
Send heartbeats to job trackers: I am alive
Monitor status of tasks and notify job tracker

JobTracker & TaskTracker.png

從 Map function和Recuce function 說起

python 中map和reduce的使用

https://www.runoob.com/python/python-func-map.html

map函數(shù)： map(function, iterable, ...)
>>> map(lambda x: x ** 2, [1, 2, 3, 4, 5])  
[1, 4, 9, 16, 25]
# 提供了兩個列表故黑，對相同位置的列表數(shù)據(jù)進行相加
>>> map(lambda x, y: x + y, [1, 3, 5, 7, 9], [2, 4, 6, 8, 10])
[3, 7, 11, 15, 19]

reduce 函數(shù) reduce(function, iterable[, initializer])
from functools import reduce
>>> reduce(lambda x, y: x+y, [1,2,3,4,5])  
15

MapReduce 中map和reduce的使用

map和reduce的輸入輸出均為<k,v>鍵值對
map的輸出作為reduce的輸入（map輸出其實還有一步group by的過程）
System groups the intermediate key‐value pairs from map tasks by key
E.g., <hello, 1> <hello, 1> <hello, 1> <this, 1> => <hello, [1, 1, 1]>, <this, [1]>

MapReduce diagram.png

MapReduce in parallel.png

In hadoop

A node may run multiple map/reduce tasks
Typically, one map task per input split (chunk of data)
One reduce task per partition of map output E.g., partition by key range or hashing

wordcount 例子：

word.txt
hello
hello world

Map function：

Input: <offset of line, line> // line = a line of text in a document =>{0:"hello",6:"hello world"}
Output: for each word in line, output <word, 1> => ["hello":1,"hello":1,"world":1]

Reduce function:
– Input: <word, list of 1’s> => {"hello":[1,1],"world":[1]} 所以其實reduce的輸入在某種程度上來說不是map的輸出
– Output: <word, count> where count is the number of 1's in the input list =>
{"hello":2,"world":1}

Map 和 Reduce 中的Java實現(xiàn)

對于MapReduce框架肯污，其更像一套八股文，我們只需要編寫Map和Reduce函數(shù)即可彩届，其他的細節(jié)框架已經(jīng)幫我們實現(xiàn)了伪冰，在Java中通過Mapper和Reducer類封裝，所以我們可以編寫子類繼承父類然后overide相應(yīng)的方法即可

Each map task runs an instance of Mapper
– Mapper has a map function
– Map task invokes the map function of the Mapper once for each input key‐value pair

Mapper.png

Each reduce task runs an instance of Reducer
– Reducer has a reduce function
– Reduce task invokes the reduce function of the Reducer once for every different intermediate key
– For the reduce function, values are NOT in any particular order

Reducer.png

Map 之后樟蠕， Reduce之前的細節(jié) --- Shuffling

Shuffle

Process of distributing intermediate key‐values to the right reduce tasks
It is the only communication among map and reduce tasks
- Individual map tasks do not exchange data directly with other map tasks
- They are not even aware of existence of their peer

Map可能在不同的機器上并行處理的贮聂，需要通過 shuffling 將相同 key 值的數(shù)據(jù)分發(fā)到同一個節(jié)點上去合并，這樣才能統(tǒng)計出最終的結(jié)果寨辩，并且這一步也將一個由Map生成的<k,v> 變成<k,[v1,v2]>傳給后續(xù)的Reduce來操作

Shuffle.png

圖片來源：
https://xinze.fun/2020/01/10/Spark-Shuffle-%E5%92%8C-Spill-%E7%9A%84%E5%8C%BA%E5%88%AB/

Internal of shuffling
1.Map side：Partition, sort, spill & merge

Partition data in the buffer into R parts (R = # of reduce tasks) # partition = # reduce task
Sort data in each partition by key
Spill/write data in the buffer to disk
Merge the spills
Notify job tracker: output complete

2.Reduce side： Fetch & merge

Task tracker notified by job tracker: data ready
Fetch/copy data from map side
Merge data( Some data may sit on disk once fetched , this depends on the buffer size)
Figure out groups from sorted data

details about Shuffle.png

回到開頭數(shù)據(jù)的輸入輸出格式

mapreduce with 2 nodes.png

輸入

InputFormat

Determine how input files are split and read
Defined in the Java interface InputFormat
Job:
- Split input file into chunks called InputSplits（logic concepts）
- Implement RecordReader to read data from splits

在Java中InputFormat是個抽象類吓懈，其有諸多的子類去讀取特定的輸入
? FileInputFormat (input from files in given dirs) --- 用的最多
? DBInputFormat (input data from a database)
? CombineFileInputFormat (input data by combining multiple files)

FileInputFormat
– Takes paths to files
– Read all files in the paths
– Divide each file into one or more InputSplits
FileInputFormat也是一個抽象類，其下有對應(yīng)的子類來實現(xiàn)對應(yīng)的內(nèi)容輸入
– TextInputFormat
– KeyValueTextInputFormat
– SequenceFileInputFormat

Subclasses of FileInputFormat.png

InputFormat 干的第一件事情：Split input file into InputSplits

聊一聊InputSplit靡狞，注意一點耻警，InputSplit是一個邏輯劃分的概念
? If a file is big, multiple splits may be created Typical split size = 128MB
? A map task is created for each split （a chunk of some input file）

InputFormat 干的第二件事情：Implement RecordReader to read data from InputSplits

InputFormat defines an instanceof RR E.g., TextInputFormat provides LineRecordReader
LineRecordReader
- Form a key‐value pair for every line of file
- Data type for key: LongWritable; value: Text
Reader is repeatedly called until all data in the split are processed

輸出

OutputFormat

Define the format of output from Reducers（Output stored in a file）
Defined in the Java interface OutputFormat
Implemention: FileOutputFormat
- Subclasses: TextOutputFormat, SequenceFileOutputFormat
  
  Subclasses of OutputFormat.png
All Reducers write to the same directory
Set by FileOutputFormat.setOutputPath() method
OutputFormat defines aRecordWrite which handles the write

可選操作：Combiner

個人理解：combiner 約等于一個本地化的reduer

Run on the node running the Mapper
- Perform local (or mini‐) reduction
Combine Mapper results
- Before they are sent to the Reducers
- Reduce communication costs
- E.g., may use a combiner in WordCount – (cat, 1), (cat, 1), (cat, 1) => (cat, 3)

注意，使用combiner是為了節(jié)省空間成本、加快計算榕栏，但是不能對最終的結(jié)果有影響畔勤，例如說求個平均數(shù)什么的就不能用combiner
專業(yè)的說法就是：
May directly use the combiner

If it is commutative and associative
Meaning operations can be grouped & performed in any order

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市扒磁，隨后出現(xiàn)的幾起案子庆揪，更是在濱河造成了極大的恐慌，老刑警劉巖妨托，帶你破解...
沈念sama閱讀 211,743評論 6贊 492
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件缸榛，死亡現(xiàn)場離奇詭異，居然都是意外死亡兰伤，警方通過查閱死者的電腦和手機内颗，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,296評論 3贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來敦腔，“玉大人均澳，你說我怎么就攤上這事》危” “怎么了找前？”我有些...
開封第一講書人閱讀 157,285評論 0贊 348
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長判族。經(jīng)常有香客問我躺盛，道長，這世上最難降的妖魔是什么形帮？我笑而不...
開封第一講書人閱讀 56,485評論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任槽惫，我火速辦了婚禮，結(jié)果婚禮上辩撑，老公的妹妹穿的比我還像新娘界斜。我一直安慰自己，他們只是感情好槐臀，可當(dāng)我...
茶點故事閱讀 65,581評論 6贊 386
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布锄蹂。她就那樣靜靜地躺著，像睡著了一般水慨。火紅的嫁衣襯著肌膚如雪得糜。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 49,821評論 1贊 290
城市分裂傳說
那天晰洒，我揣著相機與錄音朝抖，去河邊找鬼。笑死谍珊，一個胖子當(dāng)著我的面吹牛治宣，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 38,960評論 3贊 408
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼侮邀，長吁一口氣：“原來是場噩夢啊……” “哼坏怪！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起绊茧，我...
開封第一講書人閱讀 37,719評論 0贊 266
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤铝宵，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后华畏，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體鹏秋，經(jīng)...
沈念sama閱讀 44,186評論 1贊 303
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 36,516評論 2贊 327
?白月光啟示錄
正文我和宋清朗相戀三年亡笑，在試婚紗的時候發(fā)現(xiàn)自己被綠了侣夷。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 38,650評論 1贊 340
活死人
序言：一個原本活蹦亂跳的男人離奇死亡仑乌，死狀恐怖百拓，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情晰甚，我是刑警寧澤耐版，帶...
沈念sama閱讀 34,329評論 4贊 330
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站压汪，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏古瓤。R本人自食惡果不足惜止剖，卻給世界環(huán)境...
茶點故事閱讀 39,936評論 3贊 313
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望落君。院中可真熱鬧穿香，春花似錦、人聲如沸绎速。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,757評論 0贊 21
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽纹冤。三九已至洒宝，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間萌京，已是汗流浹背雁歌。一陣腳步聲響...
開封第一講書人閱讀 31,991評論 1贊 266
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留知残，地道東北人靠瞎。一個月前我還...
沈念sama閱讀 46,370評論 2贊 360
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親乏盐。傳聞我的和親對象是個殘疾皇子佳窑，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 43,527評論 2贊 349