聲明:由于本人也是處于學習階段,有些理解可能并不深刻贷岸,甚至會攜帶一定錯誤跃闹,因此請以批判的態(tài)度來進行閱讀嵌削,如有錯誤,請留言或直接聯(lián)系本人望艺。
WEEK4 內(nèi)容摘要(MapReduce):1)Design Pattern 4:Value-to-Key Conversion; 2)Miscellaneous
關(guān)鍵詞:Value-to-Key Conversion; composite key; natural key; Boolean Text Retrieva; Inverted Index; Ranked Text Retrieval; TF; IDF; Miscellaneous; Counter; SequenceFile; Input Formats; Methods to Write MapReduce Jobs; Number of Maps and Reduces; MapReduce Advantages
1)Design Pattern 4:Value-to-Key Conversion
問題:如何將mapper output的(Key, list(value))中的value進行排序呢苛秕?
由于Reducer不支持對value的自動排序,所以我們有兩種方式解決value排序的問題找默。
1)將擁有共同key的所有values緩存在內(nèi)存中想帅,然后進行排序。這個方法對于小型數(shù)據(jù)是可行的啡莉,但是對于大型數(shù)據(jù)港准,可能會產(chǎn)生內(nèi)存溢出。
2)通過“Value-to-Key Conversion”這種方式咧欣,在Mapper中的pair (key, value)將value 與key相結(jié)合形成composite key (K, S)浅缸,這里原本的key可以我們稱為natural key。然后Mapper輸出composite key魄咕,由MapReduce的Framework 來對這里的(K, S)中的S進行排序(framework sort by using the cluster nodes)衩椒。然而,對于multiple key-value pair我們怎么處理呢哮兰?通過設(shè)置partitioner來根據(jù)哈希值將不同的natural key所對應的composite key分派到各個Reducer當中毛萌,然而在分派的這個過程中,F(xiàn)ramework對composite key進行sorting喝滞。
問題阁将,MapReduce在現(xiàn)實生產(chǎn)中可以運用到哪些領(lǐng)域呢?
答:搜索引擎
這里有2中信息檢索方式:1) Boolean Text Retrieval; 2) Ranked Text Retrieval
1)Boolean Text Retrieval:
特點:
(1)Each document or query is treated as a “bag” of words or terms. Word sequence is not considered
(2)Query terms are combined logically using the Boolean operators AND, OR, and NOT.
(2.1) E.g., ((data AND mining) AND (NOT text))
(3) Retrieval
(3.1) Given a Boolean query, the system retrieves every document that makes the query logically true.
(3.2) Called exact match
(4) The retrieval results are usually quite poor because term frequency is not considered and results are not ranked.
Boolean Text Retrieval一般是通過“ Inverted Index”來進行檢索的右遭, 這里 Inverted Index的特點是:
(1)The inverted index of a document collection is basically a data structure that
1.1)attaches each distinctive term with a list of all documents that contains the term.
1.2)The documents containing a term are sorted in the list
(2)Thus, in retrieval, it takes constant time to
2.1)find the documents that contains a query term.
2.2)multiple query terms are also easy handle as we will see soon.
以下可見Inverted Index的搜索方式做盅,
這里設(shè)置一個query語句q,來查詢blue窘哈,cat吹榴,egg ... two這些單詞,分為以下3步來實施Inverted Index搜索(即圖中的上滚婉,左图筹,右三步):
(1)(vocabulary search): find each term/word in q in the inverted index.
(2)(results merging): Merge results to find documents that contain all or some of the words/terms in q.
(3)Step 3 (Rank score computation): To rank the resulting documents/pages, using:
(3.1)content-based ranking
(3.2)link-based ranking
問題,如何構(gòu)建Inverted Index让腹?
通過將input records (docid, doc)輸入远剩,輸出(term, list(docid)), 其中l(wèi)ist(docid)是排序好的,且即使一個term可能在一個docid中出現(xiàn)多次哨鸭,只會在list(docid)中出現(xiàn)一次該term所對應的docid民宿。
那么,如何在MapReduce中實現(xiàn)Inverted Index的構(gòu)建呢像鸡?
(1)Mapper將(docid, doc)中的doc的每個term都列出來活鹰,然后組合成(term, docid)。記住只估,這里的term只會記錄一次志群,不會按照出現(xiàn)的次數(shù)多少記錄(即(1, ‘how long will I love you, as long as you father told you’), 那么只會記錄(how, 1), (long , 1), (will, 1), (I, 1), (love, 1), (as, 1), (father, 1), (told, 1))。
(2)在Reducer端蛔钙,收到的是(term, list(docid))锌云, 然而這里的docid們是未經(jīng)排序的,所以我們需要給他們排序吁脱,然而這里的排序是在Reducer段的memory中進行排序的桑涎,所以我們需要注意最長的list(docid)不能大于內(nèi)存最大容量彬向。
這里有兩個缺點:1)Inefficient; 2)docids are sorted in reducers(這里我們可否用value-to-key方法進行改進呢攻冷?)
總結(jié):Boolean Text Retrieval 可以從documents中提取出關(guān)鍵字娃胆,并將有相同關(guān)鍵字的document id進行整合。當用戶通過關(guān)鍵字進行搜索的時候等曼,通過AND, OR和NOT的方式里烦,來篩選出用戶所要求的內(nèi)容,但是這里有一個缺點禁谦,就是這種關(guān)鍵字搜索時沒有權(quán)重的胁黑,它只是按照數(shù)據(jù)庫中document存儲的順序(即ID的順序)來展示出來,因此此種算法對用戶而言效率很低州泊。(例如丧蘸,用戶搜索:“張三和李四”, 如果數(shù)據(jù)庫下的document id = (1,7,8,9...119965)含有“張三和李四”的關(guān)鍵詞拥诡,那么触趴,網(wǎng)頁就是按照id順序來顯示。)
2)Ranked Text Retrieval
在說明“Ranked Text Retrieval”之前渴肉,需要引入一些名詞:
(1)weight: 即term在全文(這里的全文指的時所有的documents的集合)中的權(quán)重冗懦。
(2)TF: Term frequency,指的是某一個給定的詞語在該文件中(即單一的本地文件)出現(xiàn)的頻率仇祭,所以TF的作用域時local的披蕉。
(3)IDF: Inverse document frequency,是一個詞語普遍重要性的度量乌奇,某一特定詞語的idf没讲,可以由總文件數(shù)目除以包含該詞語之文件的數(shù)目,再將得到的商取以10為底的對數(shù)得到礁苗,因為它的取值是關(guān)乎總文件數(shù)的爬凑,所以IDF的作用域是global的。
在這里试伙,Terms that appear often in a document should get high weights嘁信;Terms that appear in many documents should get low weights。
這里我們可以理解為W = TF*IDF疏叨。
上式子中分子是該詞在文件 dj中的出現(xiàn)次數(shù)潘靖,而分母則是在文件 dj中所有字詞的出現(xiàn)次數(shù)之和。
如何在MapReduce中構(gòu)建Ranked Text Retrieval的Index呢蚤蔓?
這里Reducer input的是(term, list(docid, term_frequency)), 輸出的是(term, list(docid, weight))
在這里卦溢,有兩個缺點顯而易見:
(1)所有的document id是在Reducer中排序的(占用大量內(nèi)存);
(2)因為要計算weight,所以先要計算IDF单寂,然而要計算IDF就需要知道所有包含該term的文件的數(shù)(即document frequency)和總的document的數(shù)量贬芥,所以內(nèi)存需要buffer所有的(term, (docid, term_frequency))來計算document frequency(占用大量內(nèi)存,可能會造成溢出導致任務(wù)失斝觥)誓军。
要解決以上兩個問題,分作兩步:
第一步:使用value-to-key方法將(term, docid)組合為composite key疲扎,使得docid不在Reducer中被排序。
第二步:在這里我們形成了新的(list(term, docid), list(term_frequency ))我們姑且稱之為bag捷雕,由于partitioner的設(shè)置椒丧,所有相同term的bag被同一個Reducer索取(假設(shè)一個Reducer只處理一個關(guān)鍵詞term的數(shù)據(jù))救巷。但是壶熏,在Reducer我需要知道所有包含該term的document的數(shù)量和總的document的數(shù)量,于是浦译,我們仿照order-inversion的方式棒假,設(shè)置一個special pair,預先將所有被遍歷的文件數(shù)精盅,以及包含該term的文件數(shù)都打包起來帽哑,最先發(fā)給對應的Reducer。這樣的話IDF的計算就方便了叹俏。
總結(jié):使用Ranked Text Retrieval方便點在于搜索會有權(quán)重比較妻枕,權(quán)重高的會優(yōu)先顯示,這樣的話用戶搜索到其想要的東西的概率會更高粘驰,增加用戶體驗屡谐,且改進方法使得處理速率更高。
2)Miscellaneous
這里介紹幾個MapReduce的幾個功能:
(1)Counter:
1)Hadoop maintains some built-in counters for every job.
2)Several groups for built-in counters
2.1)File System Counters – number of bytes read and written
2.2)Job Counters – documents number of map and reduce tasks launched, number of failed tasks
2.3)Map-Reduce Task Counters– mapper, reducer, combiner input and output records counts, time and memory statistics
(2)SequenceFile:
1)File operations based on binary format rather than text format
2)SequenceFile class prvoides a persistent data structure for binary key-value pairs, e.g.,
2.1)Key: timestamp represented by a LongWritable
2.2)Value: quantity being logged represented by a Writable
3)Use SequenceFile in MapReduce:
3.1)job.setinputFormatClass(SequenceFileOutputFormat.class);
3.2)job.setOutputFormatClass(SequenceFileOutputFormat.class);
3.3)In Mapreduce by default TextInputFormat
(3)Input Formats:
1)InputSplit
1.1)A chunk of the input processed by a single map
1.2)Each split is divided into records
1.3)Split is just a reference to the data (doesn’t contain the input data)
2)RecordReader
2.1)Iterate over records
2.2)Used by the map task to generate record key-value pairs
3)As a MapReduce application programmer, we do not need to deal with InputSplit directly, as they are created in InputFormat
4)In MapReduce, by default TextInputFormat and LineRecordReader
(4)Methods to Write MapReduce Jobs:
1)Typical – usually written in Java
1.1)MapReduce 2.0 API
1.2MapReduce 1.0 API
2)Streaming
2.1)Uses stdin and stdout
2.2)Can use any language to write Map and Reduce Functions(C#, Python, JavaScript, etc…)
3)Pipes
3.1Often used with C++
4)Abstraction libraries
4.1Hive, Pig, etc… write in a higher level language, generate one or more MapReduce jobs
(5)Number of Maps and Reduces
1)Maps
1.1)The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
1.2The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks.
1.3)If you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.
2)Reduces
2.1)The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>)
2.2)With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
2.3)Use job.setNumReduceTasks(int) to set the number
(6)MapReduce Advantages:
1)Automatic Parallelization:
1.1)Depending on the size of RAW INPUT DATA ? instantiate multiple MAP tasks
1.2)Similarly, depending upon the number of intermediate <key, value> partitions ? instantiate multiple REDUCE tasks
2)Run-time:
2.1)Data partitioning
2.2)Task scheduling
2.3)Handling machine failures
2.4)Managing inter-machine communication
3)Completely transparent to the programmer/analyst/user