IR-chapter1:Boolean retrieval


Information retrieval

meaning

Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).

keywords: unstructured, large scale - provides a more natural and acceptable way of human-machine interaction compared with daunting database-style searching, also gives more challenge to data organization and query processing.(while In fact, no data is truly unstructured)

IR also covers supporting users in browsing or filtering document
collections or further processing a set of retrieved documents

scale

  • web search
    billions of documents stored on millions of computers
    gather documents to indexed
    build efficient system
    exploit hypertext
    protect from being boosted
  • personal information retrieval
    spotlight, instant search
    email program, search and classification
  • enterprise, institutional, and domain-specific search

An example information retrieval problem

Shakespeare's collected works, containing the words Brutus and Caesar and not Calpurnia.

grep

(How about requiring lager data, more flexible query, ranked retrieval more quickly)

incidence matrix

incidence matrix for Shakespeare' collections
query processing

extremely sparse

terminology

  • boolean retrieval model
    a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms.
  • term
    the smallest unit we treat as the element of the set
  • document
    units we have decided to build a retrieval system over
  • collection/corpus
    the group of documents
  • information need
    the topic about which the user desires to know more.
  • query
    what the user convey to the computer.
  • relevant
    a document is relevant if it is the one that the user perceives as containing information of value with respect to their personal informational need.
  • effectiveness
    the quality of its search results
ll type of true and false
  • pricision
    TP/(TP+FP)
  • recall
    TP/(TP+FN)

inverted index/inverted file/index

part of inverted index for Shakespeare's collections
  • vocabulary/lexicon
    the set of terms
  • dictionary
    the data structure of the items
  • posting
    each item in the list
  • posting list
  • postings
    all posting lists

a first take at building an inverted index

  1. collect documents to be indexed
  2. tokenize the text, turning each document into a list of tokens
  3. do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms
  4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
4th step
  • storage
    memory - disk(a linked list of fixed length arrays for each term)

processing boolean queries

  • simple conjunctive query
merge algorithm
  • query optimization
    process in increasing order of term frequency
Algorithm for conjunctive queries
  • asymmetric
  • difference is large

The extended Boolean model versus ranked retrieval

  • ranked retrieval model
    such as the vector space model, in which users freely use free text queries
  • the extended Boolean model
    proximity operator: specify that two terms in a query must occur close to each other in a document
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末知举,一起剝皮案震驚了整個濱河市琢感,隨后出現(xiàn)的幾起案子胚宦,更是在濱河造成了極大的恐慌文虏,老刑警劉巖,帶你破解...
    沈念sama閱讀 216,997評論 6 502
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件扁凛,死亡現(xiàn)場離奇詭異躺屁,居然都是意外死亡囱怕,警方通過查閱死者的電腦和手機(jī)功舀,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,603評論 3 392
  • 文/潘曉璐 我一進(jìn)店門萍倡,熙熙樓的掌柜王于貴愁眉苦臉地迎上來身弊,“玉大人辟汰,你說我怎么就攤上這事≮宸穑” “怎么了帖汞?”我有些...
    開封第一講書人閱讀 163,359評論 0 353
  • 文/不壞的土叔 我叫張陵,是天一觀的道長凑术。 經(jīng)常有香客問我翩蘸,道長,這世上最難降的妖魔是什么淮逊? 我笑而不...
    開封第一講書人閱讀 58,309評論 1 292
  • 正文 為了忘掉前任催首,我火速辦了婚禮,結(jié)果婚禮上泄鹏,老公的妹妹穿的比我還像新娘郎任。我一直安慰自己,他們只是感情好备籽,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,346評論 6 390
  • 文/花漫 我一把揭開白布舶治。 她就那樣靜靜地躺著,像睡著了一般车猬。 火紅的嫁衣襯著肌膚如雪霉猛。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,258評論 1 300
  • 那天珠闰,我揣著相機(jī)與錄音惜浅,去河邊找鬼。 笑死伏嗜,一個胖子當(dāng)著我的面吹牛坛悉,可吹牛的內(nèi)容都是我干的杭朱。 我是一名探鬼主播,決...
    沈念sama閱讀 40,122評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼吹散,長吁一口氣:“原來是場噩夢啊……” “哼弧械!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起空民,我...
    開封第一講書人閱讀 38,970評論 0 275
  • 序言:老撾萬榮一對情侶失蹤刃唐,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后界轩,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體画饥,經(jīng)...
    沈念sama閱讀 45,403評論 1 313
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,596評論 3 334
  • 正文 我和宋清朗相戀三年浊猾,在試婚紗的時候發(fā)現(xiàn)自己被綠了抖甘。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 39,769評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡葫慎,死狀恐怖衔彻,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情偷办,我是刑警寧澤涛癌,帶...
    沈念sama閱讀 35,464評論 5 344
  • 正文 年R本政府宣布态蒂,位于F島的核電站配紫,受9級特大地震影響碉就,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜废岂,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,075評論 3 327
  • 文/蒙蒙 一祖搓、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧湖苞,春花似錦拯欧、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,705評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至蚓再,卻和暖如春滑肉,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背摘仅。 一陣腳步聲響...
    開封第一講書人閱讀 32,848評論 1 269
  • 我被黑心中介騙來泰國打工靶庙, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人娃属。 一個月前我還...
    沈念sama閱讀 47,831評論 2 370
  • 正文 我出身青樓六荒,卻偏偏與公主長得像护姆,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子掏击,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,678評論 2 354

推薦閱讀更多精彩內(nèi)容

  • 撐著油紙傘卵皂,獨(dú)自 彷徨在悠長、悠長 又寂寥的雨巷砚亭, 我希望逢著 一個丁香一樣地 結(jié)著愁怨的姑娘灯变。 ...
    白櫻嵐閱讀 401評論 0 1
  • 浩瀚書海,選書成了一個問題捅膘。 這個問題添祸,第一次真正的正視,一直看的都很任性和隨性寻仗。不覺得是個問題刃泌。直到今天,在微信...
    cissyfriends閱讀 1,075評論 0 0
  • 關(guān)于減肥的方法混坞,現(xiàn)在真是層出不窮,千千百百種钢坦。但究竟什么樣的減肥方法究孕,才是最科學(xué)、最健康爹凹、最有效的呢厨诸? 咱們且聽聽...
    瘦朵朵黃教練閱讀 254評論 0 0