Reducer
Reduce處理一系列相同key的中間記錄障癌。
用戶可以通過 Job.setNumReduceTasks(int) 來設(shè)置reduce的數(shù)量阅酪。
總的來說宅粥,通過 Job.setReducerClass(Class) 可以給 job 設(shè)置 recuder 的實現(xiàn)類并且進(jìn)行初始化≈钕危框架將會調(diào)用 reduce 方法來處理每一組按照一定規(guī)則分好的輸入數(shù)據(jù)雕沉,應(yīng)用可以通過復(fù)寫cleanup 方法執(zhí)行任何清理工作盯捌。
Reducer有3個主要階段:混洗、排序和reduce蘑秽。
Shuffle(混洗)
輸出到Reducer的數(shù)據(jù)都在Mapper階段經(jīng)過排序的饺著。在這個階段框架將通過HTTP從恰當(dāng)?shù)?i>Mapper的分區(qū)中取得數(shù)據(jù)箫攀。
Sort(排序)
這個階段框架將對輸入到的 Reducer 的數(shù)據(jù)通過key(不同的 Mapper 可能輸出相同的 key)進(jìn)行分組。
混洗和排序階段是同時進(jìn)行幼衰;map的輸出數(shù)據(jù)被獲取時會進(jìn)行合并靴跛。
Secondary Sort(二次排序)
如果想要對中間記錄實現(xiàn)與 map 階段不同的排序方式,可以通過Job.setSortComparatorClass(Class) 來設(shè)置一個比較器 渡嚣。Job.setGroupingComparatorClass(Class) 被用于控制中間記錄的排序方式梢睛,這些能用來進(jìn)行值的二次排序。
Reduce
在這個階段reduce方法將會被調(diào)用來處理每個已經(jīng)分好的組鍵值對识椰。
reduce 任務(wù)一般通過 Context.write(WritableComparable, Writable) 將數(shù)據(jù)寫入到FileSystem绝葡。
應(yīng)用可以使用 Counter 進(jìn)行統(tǒng)計。
Recuder 輸出的數(shù)據(jù)是不經(jīng)過排序的腹鹉。
How Many Reduces?
合適的 reduce 總數(shù)應(yīng)該在 節(jié)點數(shù)*每個節(jié)點的容器數(shù)*0.95 至 節(jié)點數(shù)*每個節(jié)點的容器數(shù)*1.75 之間藏畅。
當(dāng)設(shè)定值為0.95時,map任務(wù)結(jié)束后所有的 reduce 將會立刻啟動并且開始轉(zhuǎn)移數(shù)據(jù)功咒,當(dāng)設(shè)定值為1.75時愉阎,處理更多任務(wù)的時候?qū)焖俚匾惠営忠惠喌剡\(yùn)行 reduce 達(dá)到負(fù)載均衡。
reduce 的數(shù)目的增加將會增加框架的負(fù)擔(dān)(天花板)力奋,但是會提高負(fù)載均衡和降低失敗率榜旦。
整體的規(guī)模將會略小于總數(shù),因為有一些 reduce slot 用來存儲推測任務(wù)和失敗任務(wù)景殷。
Reducer NONE
當(dāng)沒有 reduction 需求的時候可以將 reduce-task 的數(shù)目設(shè)置為0溅呢,是允許的。
在這種情況當(dāng)中猿挚,map任務(wù)將直接輸出到 FileSystem咐旧,可通過 ?FileOutputFormat.setOutputPath(Job, Path) 來設(shè)置。該框架不會對輸出的 FileSystem 的數(shù)據(jù)進(jìn)行排序亭饵。
Partitioner
Partitioner對key進(jìn)行分區(qū)。
Partitioner 對 map 輸出的中間值的 key(Recuder之前)進(jìn)行分區(qū)梁厉。分區(qū)采用的默認(rèn)方法是對 key 取 hashcode辜羊。分區(qū)數(shù)等于 job 的 reduce 任務(wù)數(shù)。因此這會根據(jù)中間值的 key 將數(shù)據(jù)傳輸?shù)綄?yīng)的 reduce词顾。
HashPartitioner 是默認(rèn)的的分區(qū)器八秃。
Counter
計數(shù)器是一個工具用于報告 Mapreduce 應(yīng)用的統(tǒng)計。
Mapper 和 Reducer 實現(xiàn)類可使用計數(shù)器來報告統(tǒng)計值肉盹。
Hadoop Mapreduce 是普遍的可用的 Mappers昔驱、Reducers 和 Partitioners 組成的一個庫。
下面是原文
Reducer
Reducerreduces a set of intermediate values which share a key to a smaller set of values.
The number of reduces for the job is set by the user viaJob.setNumReduceTasks(int).
Overall,Reducer implementations are passed the Job for the job via theJob.setReducerClass(Class)method and can override it to initialize themselves. The framework then callsreduce(WritableComparable,Iterable, Context)method for each pair in the grouped inputs. Applications can then override the cleanup(Context)method to perform any required cleanup.
Reducer has 3 primary phases: shuffle, sort and reduce.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches(取得)the relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage(階段).
The shuffle and sort phases occur simultaneously(同時); while map-outputs are being fetched they are merged.
Secondary Sort
If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator viaJob.setSortComparatorClass(Class). SinceJob.setGroupingComparatorClass(Class)can be used to control how intermediate keys are grouped, these can be used in conjunction(協(xié)調(diào))to simulate(模擬)secondary sort on values.
Reduce
In this phase the reduce(WritableComparable, Iterable, Context) method is called for each pair in the grouped inputs.
The output of the reduce task is typically written to theFileSystemvia ?Context.write(WritableComparable, Writable).
Applications can use the Counter to report its statistics.
The output of the Reducer isnot sorted.
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied(乘上)by(<no. of nodes> * <no. of maximum containers per node>).
With 0.95 all of the reduces can launch immediately(立刻)and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave(波浪)of reduces doing a much better job of load balancing(均衡).
Increasing the number of reduces increases the framework overhead(負(fù)擔(dān)上忍,天花板), but increases load balancing and lowers the cost of failures.
The scaling(規(guī)模)factors above are slightly(輕微的)less than whole numbers to reserve a few reduce slots in the framework for speculative(推測的)-tasks and failed tasks.
Reducer NONE
It is legal to set the number of reduce-tasks tozeroif no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set byFileOutputFormat.setOutputPath(Job,Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
Partitioner
Partitionerpartitions the key space.
Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset(子集)of the key) is used to derive(取得骤肛;源自)the partition, typically by ahash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of them reduce tasks the intermediate key (and hence the record) is sent to for reduction.
HashPartitioneris the default Partitioner.
Counter
Counteris a facility for MapReduce applications to report its statistics.
Mapper and Reducer implementations can use the Counter to report statistics.
Hadoop MapReduce comes bundled with alibraryof generally(普遍的)useful mappers, reducers, and partitioners.
*由于翻譯能力不足所出現(xiàn)的錯誤纳本,請多多指出和包涵