引言
The problem of clustering has been studied widely in the database and statistics literature in the context of a wide variety of data mining tasks [50, 54]. The clustering problem is defined to be that of finding groups of similar objects in the data. The similarity between the objects is measured with the use of a similarity function. The problem of clustering can be very useful in the text domain, where the objects to be clusters can be of different granularities such as documents, paragraphs, sentences or terms. Clustering is especially useful for organizing documents to improve retrieval and support browsing [11, 26].
在各種數(shù)據(jù)挖掘任務(wù)的背景下,聚類問題的研究廣泛的應(yīng)用于數(shù)據(jù)庫和文獻(xiàn)統(tǒng)計(jì),聚類問題的的本質(zhì)是對數(shù)據(jù)中相似的對象進(jìn)行分組樱蛤,通過相似性函數(shù)來測量對象之間的相似性,聚類問題在文本領(lǐng)域非常有用,要用于聚類的對象可以是不同的粒度瘟忱,例如文檔蝶缀,段落耸袜,句子或者詞游添。聚類對于組織文檔以改善檢索和瀏覽支持特別有用
The study of the clustering problem precedes its applicability to the text domain. Traditional methods for clustering have generally focussed on the case of quantitative data [44, 71, 50, 54, 108], in which the attributes of the data are numeric. The problem has also been studied for the case of categorical data [10, 41, 43], in which the attributes may take on nominal values. A broad overview of clustering (as it relates to generic numerical and categorical data) may be found in [50, 54]. A number of implementations of common text clustering algorithms, as applied to text data, may be found in several toolkits such as Lemur [114] and BOW toolkit in [64]. The problem of clustering finds applicability for a number of tasks:
對聚類問題的研究要早于其對文本領(lǐng)域的應(yīng)用,傳統(tǒng)的聚類方法通常集中在定量數(shù)據(jù)的情況 [44, 71, 50, 54, 108],其中數(shù)據(jù)是數(shù)字,這個問題在分類數(shù)據(jù)中也有研究 [10, 41, 43],其中屬性可以采用名義值撕贞,在[50, 54]中有對聚類的廣泛概述(涉及一般的數(shù)字和分類數(shù)據(jù)),一些常見的文本聚類方法的實(shí)現(xiàn),可以在下面找到相關(guān)的tookit,例如Lemur[114]和BOW[64],聚類問題可以適用于以下一些任務(wù):
-
Document Organization and Browsing: The hierarchical organization of documents into coherent categories can be very useful for systematic browsing of the document collection. A classical example of this is the Scatter/Gather method [25], which provides a systematic browsing technique with the use of clustered organization of the document collection.
文檔組織和瀏覽:將文檔分層組織成連貫的類別對于系統(tǒng)的瀏覽文檔集合非常有用,一個典型的例子是Scatter/Gather方法[25],它提供了一種系統(tǒng)的瀏覽技術(shù),使用聚類進(jìn)行文檔集合的組織. -
Corpus Summarization: Clustering techniques provide a coherent summary of the collection in the form of cluster-digests [83] or word-clusters [17, 18], which can be used in order to provide summary insights into the overall content of the underlying corpus. Variants of such methods, especially sentence clustering, can also be used for document summarization, a topic, discussed in detail in Chapter 3. The problem of clustering is also closely related to that of dimensionality reduction and topic modeling. Such dimensionality reduction methods are all different ways of summarizing a corpus of documents, and are covered in Chapter 5.
語料概要:聚類技術(shù)以摘要集[83]或詞集[17,18]的形式提供集合的清晰的概要测垛,這可以用于提供對底層語料庫的整體內(nèi)容的概要見解,第三章節(jié)有這種方法的變種捏膨,特別是句子聚類,也可以用于文檔摘要食侮,話題号涯,評論。聚類問題也與降維和主題建模密切相關(guān)锯七,這些不同方式的對文檔語料庫的降維方法链快,在第5章中介紹。 -
Document Classification: While clustering is inherently an unsupervised learning method, it can be leveraged in order to improve the quality of the results in its supervised variant. In particular, word-clusters [17, 18] and co-training methods [72] can be used in order to improve the classification accuracy of supervised applications with the use of clustering techniques.
文檔分類:聚類的本質(zhì)是一個無監(jiān)督學(xué)習(xí)方法眉尸,可以利用它來提高其監(jiān)督變體的結(jié)果質(zhì)量域蜗,在特定的情況下巨双,詞集[17,18]和協(xié)同訓(xùn)練方法可以提高通過聚類技術(shù)監(jiān)督應(yīng)用的分類精度。
We note that many classes of algorithms such as the k-means algorithm, or hierarchical algorithms are general-purpose methods, which can be extended to any kind of data, including text data. A text document can be represented either in the form of binary data, when we use the presence or absence of a word in the document in order to create a binary vector. In such cases, it is possible to directly use a variety of categorical data clustering algorithms [10, 41, 43] on the binary representation. A more enhanced representation would include refined weighting methods based on the frequencies of the individual words in the document as well as frequencies of words in an entire collection (e.g., TF-IDF weighting [82]). Quantitative data clustering algorithms [44, 71, 108] can be used in conjunction with these frequencies in order to determine the most relevant groups of objects in the data.
我們注意到了很多類型的算法霉祸,例如k-means,或者分級算法是通用方法筑累,可以擴(kuò)展至其他任何類型的數(shù)據(jù),包括文本數(shù)據(jù)丝蹭。當(dāng)我們用文檔中存在或不存在的詞構(gòu)建二進(jìn)制向量時,文本文檔可以以二進(jìn)制數(shù)據(jù)的形式表示.在這種情況下,它可能直接使用各種分類數(shù)據(jù)聚類算法[10,41,43]慢宗。更加增強(qiáng)的表示將包括基于文檔中各個單詞的頻率的精確加權(quán)方法以及整個集合中的單詞的頻率(例如,TF-IDF加權(quán)[82])奔穿。定量數(shù)據(jù)聚類算法[44,71,108]可以與這些頻率結(jié)合使用镜沽,以便確定數(shù)據(jù)中最相關(guān)的對象組。
However, such naive techniques do not typically work well for clustering text data. This is because text data has a number of unique properties which necessitate the design of specialized algorithms for the task. The distinguishing characteristics of the text representation are as follows:
然而贱田,這種naive的技術(shù)通常不適用于聚類文本數(shù)據(jù)淘邻,因?yàn)槲谋緮?shù)據(jù)有一些列獨(dú)有的屬性從而需要為任務(wù)設(shè)計(jì)專門的算法,以下這些用于文本表示的區(qū)別特征:
The dimensionality of the text representation is very large, but the underlying data is sparse. In other words, the lexicon from which the documents are drawn may be of the order of 105, but a given document may contain only a few hundred words. This problem is even more serious when the documents to be clustered are very short (e.g., when clustering sentences or tweets).
文本表示的特征維度特別大湘换,但是基礎(chǔ)數(shù)據(jù)是非常稀疏的,換一種說法,從中提取文件的詞典可以是105,但是某個給出的文檔僅包含100不到個詞,當(dāng)要聚類的文檔非常短時统阿,這個問題就顯得更加嚴(yán)重(e.g.,句子聚類或推特聚類)While the lexicon of a given corpus of documents may be large, the words are typically correlated with one another. This means that the number of concepts (or principal components) in the data is much smaller than the feature space. This necessitates the careful design of algorithms which can account for word correlations in the clustering process.
也可能文檔語料的詞庫非常巨大,這些詞通常彼此相關(guān).這個意思是一些數(shù)據(jù)中的主要概念或成分比特征空間小的多彩倚,這需要仔細(xì)設(shè)計(jì)可以解釋聚類過程中詞相關(guān)性的算法。The number of words (or non-zero entries) in the different documents may vary widely. Therefore, it is important to normalize the document representations appropriately during the clustering task.
不同文檔中的詞(或非零條目)的數(shù)量可能差異很大,因此,在聚類任務(wù)過程中扶平,適當(dāng)?shù)臍w一化文檔的表示是非常重要的帆离。
The sparse and high dimensional representation of the different documents necessitate the design of text-specific algorithms for document representation and processing, a topic heavily studied in the information retrieval literature where many techniques have been proposed to optimize document representation for improving the accuracy of matching a document with a query [82, 13]. Most of these techniques can also be used to improve document representation for clustering.
在文檔表示和處理的過程中,不同文檔的稀疏和高維表示需要設(shè)計(jì)特別的算法结澄,在深入研究信息檢索文獻(xiàn)中的課題中哥谷,很多技術(shù)提出優(yōu)化文檔表示,以提高文檔的查詢匹配的準(zhǔn)確性[82,13]麻献,這些技術(shù)大多也可以用于提升聚類問題中的文檔表示
In order to enable an effective clustering process, the word frequencies need to be normalized in terms of their relative frequency of presence in the document and over the entire collection. In general, a common representation used for text processing is the vector-space based TF-IDF representation [81]. In the TF-IDF representation, the term frequency for each word is normalized by the inverse document frequency, or IDF. The inverse document frequency normalization reduces the weight of terms which occur more frequently in the collection. This reduces the importance of common terms in the collection, ensuring that the matching of documents be more influenced by that of more discriminative words which have relatively low frequencies in the collection. In addition, a sub-linear transformation function is often applied to the term- frequencies in order to avoid the undesirable dominating effect of any single term that might be very frequent in a document. The work on document-normalization is itself a vast area of research, and a variety of other techniques which discuss different normalization methods may be found in [86, 82].
為了能有效的進(jìn)行聚類處理,詞頻必須存在于文檔中的所有詞的頻率進(jìn)行相對的歸一化们妥,一般來說,一個常用的用于文本處理的表示是基于TF-IDF的向量空間[81]勉吻,在TF-IDF表示中监婶,詞頻是根據(jù)逆文檔頻率或者IDF進(jìn)行歸一化.逆文檔頻率歸一化減少了在集合中更頻繁出現(xiàn)的詞的權(quán)重,這些減少對于集合中相同的詞非常重要齿桃,確保文檔的匹配更多地受到集合中頻率相對較低的且更具辨別性的詞的影響惑惶。此外,通常將次線性變換函數(shù)應(yīng)用于詞頻短纵,以避免在文檔中可能非常頻繁的任何單個詞的不期望的主導(dǎo)效應(yīng)带污。這項(xiàng)工作在文檔歸一化中是一個廣泛的研究領(lǐng)域,在[86,82]中討論不同歸一化技術(shù)香到。
Text clustering algorithms are divided into a wide variety of different types such as agglomerative clustering algorithms, partitioning algorithms, and standard parametric modeling based methods such as the EM-algorithm. Furthermore, text representations may also be treated as strings (rather than bags of words). These different representations necessitate the design of different classes of clustering algorithms. Different clustering algorithms have different tradeoffs in terms of effectiveness and efficiency. An experimental comparison of different clustering algorithms may be found in [90, 111]. In this chapter we will discuss a wide variety of algorithms which are commonly used for text clustering. We will also discuss text clustering algorithms for related scenarios such as dynamic data, network-based text data and semi-supervised scenarios.
文本聚類算法被分成各種各樣的不同類型鱼冀,例如凝聚聚類算法报破,分區(qū)算法和基于標(biāo)準(zhǔn)參數(shù)化建模的方法,例如EM算法雷绢,此外泛烙,文本表示也可以被視為字符串 (而不是 bags of words詞袋模型)。不同的表示必須設(shè)計(jì)不同的聚類算法翘紊,不同的聚類算法在有效性和效率方面具有不同的權(quán)衡蔽氨。
在[90,111]有比較不同聚類算法的實(shí)驗(yàn),在這一章節(jié)我們將會討論各種經(jīng)常使用的文本聚類算法.我們也將會討論聚類算法相關(guān)的場景,例如動態(tài)數(shù)據(jù),基于網(wǎng)絡(luò)的文本數(shù)據(jù)以及半監(jiān)督場景帆疟。
This chapter is organized as follows. In section 2, we will present feature selection and transformation methods for text clustering. Section 3 describes a number of common algorithms which are used for distance-based clustering of text documents. Section 4 contains the description of methods for clustering with the use of word patterns and phrases. Methods for clustering text streams are described in section 5. Section 6 describes methods for probabilistic clustering of text data. Section 7 contains a description of methods for clustering text which naturally occurs in the context of social or web-based networks. Section 8 discusses methods for semi-supervised clustering. Section 9 presents the conclusions and summary.
本章安排如下.在第2部分,我們將介紹當(dāng)下的文本聚類的特征選擇和轉(zhuǎn)換方法.第3部分描述了一些基于距離的通用文本聚類算法鹉究,第4部分包含通過文本模式和短語的聚類方法的描述。文本流聚類的方法在第5部分介紹.第6部分介紹了文本數(shù)據(jù)概率聚類的方法.第7部分包含社會自然發(fā)生或基于網(wǎng)絡(luò)的文本聚類方法的描述踪宠,第8部分討論了半監(jiān)督聚類方法自赔,第9部分介紹了結(jié)論和總結(jié)。
2. Feature Selection and Transformation Methods for Text Clustering(文本聚類的特征選擇和轉(zhuǎn)換)
The quality of any data mining method such as classification and clustering is highly dependent on the noisiness of the features that are used for the clustering process. For example, commonly used words such as “the”, may not be very useful in improving the clustering quality. Therefore, it is critical to select the features effectively, so that the noisy words in the corpus are removed before the clustering. In addition to feature selection, a number of feature transformation methods such as Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Analysis (PLSA), and Non-negative Matrix Factorization (NMF) are available to improve the quality of the document representation and make it more amenable to clustering. In these techniques (often called dimension reduction), the correlations among the words in the lexicon are leveraged in order to create features, which correspond to the concepts or principal components in the data. In this section, we will discuss both classes of methods. A more in-depth discussion of dimension reduction can be found in Chapter 5.
任何數(shù)據(jù)挖掘方法比如分類,聚類的質(zhì)量柳琢,非常依賴聚類過程中特征的噪聲減小程度绍妨,舉個例子,常用詞"the",對于提升聚類質(zhì)量不是太有用,因此,選擇有效的特征是至關(guān)重要的柬脸,所以他去,語料中的噪聲詞在聚類之前需要去掉,除了特征選擇外倒堕,特征轉(zhuǎn)換方法例如LSI,PLSA,NMF灾测,這些能有效的提升文本表示的質(zhì)量使其更適合聚類,在這些技術(shù)(通常稱為降維)中垦巴,利用詞典中單詞之間的相關(guān)性來創(chuàng)建與數(shù)據(jù)中的概念或主要成分相對應(yīng)的特征媳搪。在這一部分,我們將會討論這兩類方法骤宣。關(guān)于降維的更深層次的討論在第5章討論秦爆。
2.1 Feature Selection Methods(特征選擇方法)
Feature selection is more common and easy to apply in the problem of text categorization [99] in which supervision is available for the feature selection process. However, a number of simple unsupervised methods can also be used for feature selection in text clustering. Some examples of such methods are discussed below.
特征選擇更常見的應(yīng)用于監(jiān)督文本分類[99]在特征選擇過程中很有效,然而一些簡單的無監(jiān)督方法也能用于文本聚類特征選擇憔披,一些例子我們再下面討論
2.1.1 Document Frequency-based Selection.(基于文檔頻率的選擇) The simplest possible method for feature selection in document clustering is that of the use of document frequency to filter out irrelevant features. While the use of inverse document frequencies reduces the importance of such words, this may not alone be sufficient to reduce the noise effects of very frequent words. In other words, words which are too frequent in the corpus can be removed because they are typically common words such as “a”, “an”, “the”, or “of” which are not discriminative from a clustering perspective. Such words are also referred to as stop words. A variety of methods are commonly available in the literature [76] for stop-word removal.
文檔聚類中最簡單的特征選擇方法是通過文檔頻率篩除掉不恰當(dāng)?shù)奶卣? 使用逆文檔頻率降低了這些詞的重要性, 這可能不足以減少非常頻繁的單詞的噪音影響鲜结。換一種說法,語料中頻率很高的詞可以被刪掉活逆,因?yàn)樗麄兪且话愕耐ㄓ迷~精刷,例如"a","an","the","of",從聚類的角度來看,它們沒有區(qū)別,這些詞被稱為停用詞,文獻(xiàn)[76]中有多種通用方法可用于刪除停用詞蔗候。
Typically commonly available stop word lists of about 300 to 400 words are used for the retrieval process. In addition, words which occur extremely infrequently can also be removed from the collection. This is because such words do not add anything to the similarity computations which are used in most clustering methods. In some cases, such words may be misspellings or typographical errors in documents. Noisy text collections which are derived from the web, blogs or social networks are more likely to contain such terms. We note that some lines of research define document frequency based selection purely on the basis of very infrequent terms, because these terms contribute the least to the similarity calculations. However, it should be emphasized that very frequent words should also be removed, especially if they are not discriminative between clusters. Note that the TF-IDF weighting method can also naturally filter out very common words in a “soft” way. Clearly, the standard set of stop words provide a valid set of words to prune. Nevertheless, we would like a way of quantifying the importance of a term directly to the clustering process, which is essential for more aggressive pruning. We will discuss a number of such methods below.
檢索過程中,一般通用的停用詞列表大概有300-400個怒允,此外,一些出現(xiàn)頻率比較低的詞也可以在這個集合中刪掉.因?yàn)檫@些詞在很多聚類方法中對于相似性計(jì)算不會有任何用,在某些情況下,文檔中一些詞也許會拼寫錯誤或排版錯誤,噪音文本可能源自于網(wǎng)絡(luò)锈遥,博客或社交媒體等經(jīng)常包含這類詞纫事,我們注意到勘畔,一些研究領(lǐng)域僅僅基于非常罕見的詞來定義基于文檔頻率的選擇,因?yàn)檫@些術(shù)語對相似度計(jì)算的貢獻(xiàn)最小(理解的這句話是丽惶,包含這些詞文檔被認(rèn)為相似的可能性能最小???)炫七。不管怎樣,它應(yīng)該強(qiáng)調(diào)钾唬,非常頻繁的詞必須被刪除万哪,特別是在每個類之間沒有辨識度的,注意抡秆,TF-IDF計(jì)算權(quán)重方法可以在一定程度("soft")上很自然的篩除掉通用詞奕巍,標(biāo)準(zhǔn)停用詞表也提供了一些有效的集合以用于過濾,盡管如此,我們想要一種直接量化一個詞對于聚類過程的重要性的方法儒士,這對于更積極的修剪至關(guān)重要的止,我們將在接下來討論
2.1.2 Term Strength.(詞語強(qiáng)度) A much more aggressive technique for stop-word removal is proposed in [94]. The core idea of this approach is to extend techniques which are used in supervised learning to the unsupervised case. The term strength is essentially used to measure how informative a word is for identifying two related documents. For example, for two related documents x and y, the term strength s(t) of term t is defined in terms of the following probability:
一個更有效的去除停用詞的技術(shù)在[94],這種方法的核心思想是將監(jiān)督學(xué)習(xí)中使用的技術(shù)擴(kuò)展到無監(jiān)督的案例,術(shù)語強(qiáng)度主要用于衡量單詞用于識別兩個相關(guān)文檔的信息量着撩。例如诅福,兩個相關(guān)的文檔x,y,其中的詞t,定義詞語強(qiáng)度為s(t)拖叙,根據(jù)以下概率定義:
(大意是:t在兩篇文檔中出現(xiàn)的次數(shù)/他在第一篇文檔中出現(xiàn)的次數(shù))
Here, the first document of the pair may simply be picked randomly. In order to prune features, the term strength may be compared to the expected strength of a term which is randomly distributed in the training documents with the same frequency. If the term strength of t is not at least two standard deviations greater than that of the random word, then it is removed from the collection.
第一篇文檔可隨機(jī)挑選氓润,為了減少特征,可以將詞語強(qiáng)度與在具有相同頻率的訓(xùn)練文檔中隨機(jī)分布的詞語的預(yù)期強(qiáng)度進(jìn)行比較憋沿。如果t的強(qiáng)度值不比隨機(jī)詞的強(qiáng)度大至少兩個標(biāo)準(zhǔn)偏差,就需要將它從特征集合中刪掉
One advantage of this approach is that it requires no initial supervision or training data for the feature selection, which is a key requirement in the unsupervised scenario. Of course, the approach can also be used for feature selection in either supervised clustering [4] or categorization [100], when such training data is indeed available. One observation about this approach to feature selection is that it is particularly suited to similarity-based clustering because the discriminative nature of the underlying features is defined on the basis of similarities in the documents themselves.
這種方式的一個優(yōu)點(diǎn)是沪猴,不需要初始監(jiān)督或訓(xùn)練數(shù)據(jù),這是無監(jiān)督情景中的關(guān)鍵要求辐啄,當(dāng)然,如果訓(xùn)練數(shù)據(jù)確實(shí)可用時這種方法也可以用于監(jiān)督聚類的特征選擇[4]或者分類[100],關(guān)于這種特征選擇方法的一個看法是它特別適合基于相似性的聚類运嗜,因?yàn)楦鶕?jù)文檔本身的相似性來定義底層特征的辨別性壶辜。
2.1.3 Entropy-based Ranking.(基于熵排名) The entropy-based ranking approach was proposed in [27]. In this case, the quality of the term is measured by the entropy reduction when it is removed. Here the entropy E(t) of the term t in a collection of n documents is defined as follows:
基于熵排名的途徑在[27]中被提議,當(dāng)術(shù)語被移除時担租,通過熵降低的程度來測量術(shù)語的質(zhì)量砸民,n個文檔中詞t的熵E(t)的定義如下:
Here is the similarity between the ith and jth document in the collection, after the term t is removed, and is defined as follows:
是詞被刪除之后i和j文檔之間的相似性,定義如下:
Here dist(i,j) is the distance between the terms i and j after the term t is removed, and dist is the average distance between the documents after the term t is removed. We note that the computation of E(t) for each term t requires operations. This is impractical for a very large corpus containing many terms. It has been shown in [27] how this method may be made much more efficient with the use of sampling methods.
是術(shù)語t被移除后奋救,詞語i和詞語j的距離,"dist"是術(shù)語t被移除后文檔之間的平均距離,我們注意到岭参,的計(jì)算對每個術(shù)語t需要的計(jì)算,這對于一個包含很多術(shù)語的大語料集是不切實(shí)際的,在[27]中已經(jīng)表明尝艘,使用采樣方法可以使這種方法更有效演侯。
2.1.4 Term Contribution(詞貢獻(xiàn)). The concept of term contribution [62] is based on the fact that the results of text clustering are highly dependent on document similarity. Therefore, the contribution of a term can be viewed as its contribution to document similarity. For example, in the case of dot-product based similarity, the similarity between two documents is defined as the dot product of their normalized frequencies. Therefore, the contribution of a term of the similarity of two documents is the product of their normalized frequencies in the two documents.This needs to be summed over all pairs of documents in order to determine the term contribution. As in the previous case, this method requires time for each term, and therefore sampling methods may be required to speed up the contribution. A major criticism of this method is that it tends to favor highly frequent words without regard to the specific discriminative power within a clustering process.
詞語貢獻(xiàn)的概念[62]基于文本聚類高度依賴文本相似度的事實(shí),因此背亥,詞的貢獻(xiàn)可以被視為其對文檔相似性的貢獻(xiàn)秒际,舉個例子悬赏,在基于點(diǎn)積的相似性的情況下,兩個文檔間的相似性被定義成它們的頻率歸一化的點(diǎn)積娄徊,因此闽颇,兩個文檔的相似性詞的貢獻(xiàn)是兩個文檔中歸一化頻率的乘積。這需要對每一對文檔求和以確定詞的貢獻(xiàn)度寄锐,與前一種情況一樣兵多,這種方法對每個詞需要的時間復(fù)雜度,因此可能需要采樣方法來加快貢獻(xiàn)锐峭,對這種方法的一個主要詬病是它傾向于支持高頻率的詞而不考慮聚類過程中的特定判別力中鼠。
In most of these methods, the optimization of term selection is based on some pre-assumed similarity function (e.g., cosine). While this strategy makes these methods unsupervised, there is a concern that the term selection might be biased due to the potential bias of the assumed similarity function. That is, if a different similarity function is assumed, we may end up having different results for term selection. Thus the choice of an appropriate similarity function may be important for these methods.
在這些方法中,術(shù)語選擇的優(yōu)化是基于一些預(yù)先假定的相似性函數(shù)(e.g.,余弦).雖然這些策略使得這些方法無監(jiān)督,由于假設(shè)的相似性函數(shù)的潛在偏差沿癞,人們擔(dān)心術(shù)語選擇可能存在偏差援雇,如果是一個不同的假定相似性函數(shù),也許會得到一個不同的術(shù)語選擇結(jié)果椎扬,因此對于這些方法來說惫搏,選擇一個合適的相似性函數(shù)是非常重要的
2.2 LSI-based Methods(基于LSI方法)
In feature selection, we attempt to explicitly select out features from the original data set. Feature transformation is a different method in which the new features are defined as a functional representation of the features in the original data set. The most common class of methods is that of dimensionality reduction [53] in which the documents are transformed to a new feature space of smaller dimensionality in which the features are typically a linear combination of the features in the original data. Methods such as Latent Semantic Indexing (LSI) [28] are based on this common principle. The overall effect is to remove a lot of dimensions in the data which are noisy for similarity based applications such as clustering. The removal of such dimensions also helps magnify the semantic effects in the underlying data.
在特征選擇中,我們嘗試明確地從原始數(shù)據(jù)集選擇特征,特征轉(zhuǎn)換采用了不同的方法,其中新特征被定義為原始數(shù)據(jù)集中的特征的功能表示蚕涤。最常見的一類方法是降維[53]筐赔,其中文檔被轉(zhuǎn)換為較小維度的新特征空間,其中特征通常是原始數(shù)據(jù)中特征的線性組合揖铜。例如LSI[28]方法就是基于相同的原理,最終效果是刪除了很多對聚類有噪音影響的維度茴丰,刪除這些維度也有助于放大底層數(shù)據(jù)中的語義效應(yīng)。
Since LSI is closely related to problem of Principal Component Analysis (PCA) or Singular Value Decomposition (SVD), we will first discuss this method, and its relationship to LSI. For a d-dimensional data set, PCA constructs the symmetric d × d covariance matrix C of the data, in which the (i, j)th entry is the covariance between dimensions i and j. This matrix is positive semi-definite, and can be diagonalized as follows:
LSI與PCA或SVD問題密切相關(guān)天吓,我們首先討論這種方法以及它和LSI的關(guān)系贿肩,對于一個d維數(shù)據(jù)集,PCA構(gòu)造數(shù)據(jù)的對稱d*d協(xié)方差龄寞,其中第(i,j)個實(shí)體是i維和j維之間的協(xié)方差,該矩陣是半正定的汰规,并且可以對角線化成以下:
Here P is a matrix whose columns contain the orthonormal eigenvectors of C and D is a diagonal matrix containing the corresponding eigenvalues. We note that the eigenvectors represent a new orthonormal basis system along which the data can be represented. In this context, the eigenvalues correspond to the variance when the data is projected along this basis system. This basis system is also one in which the second order covariances of the data are removed, and most of variance in the data is captured by preserving the eigenvectors with the largest eigenvalues. Therefore, in order to reduce the dimensionality of the data, a common approach is to represent the data in this new basis system, which is further truncated by ignoring those eigenvectors for which the corresponding eigenvalues are small. This is because the variances along those dimensions are small, and the relative behavior of the data points is not significantly affected by removing them from consideration. In fact, it can be shown that the Euclidian distances between data points are not significantly affected by this transformation and corresponding truncation. The method of PCA is commonly used for similarity search in database retrieval applications.
這里P是一個列包含C的正交特征向量的矩陣,D是一個包含相應(yīng)特征值的對角線的矩陣物邑,我們注意到溜哮,特征向量代表一個新的正交基系統(tǒng),沿著該系統(tǒng)可以表示數(shù)據(jù)色解。在這個上下文中茂嗓,當(dāng)數(shù)據(jù)沿著這個基礎(chǔ)系統(tǒng)投射時,特征值對應(yīng)方差.該基礎(chǔ)系統(tǒng)也是去除數(shù)據(jù)的二階協(xié)方差的系統(tǒng),通過保留具有最大特征值的特征向量來捕獲數(shù)據(jù)中的大部分方差,因此為了減少數(shù)據(jù)的維度,一個常用的途徑是在一個新的基礎(chǔ)系統(tǒng)中表示數(shù)據(jù)科阎,通過忽略相應(yīng)特征值較小的那些特征向量進(jìn)一步截斷,這是因?yàn)檠剡@些維度的差異很小在抛,并且通過將它們從考慮中移除而不會顯著影響數(shù)據(jù)點(diǎn)的相對行為。 事實(shí)上萧恕,可以證明數(shù)據(jù)點(diǎn)之間的歐幾里德距離不會受到這種變換和相應(yīng)截斷的顯著影響刚梭。 PCA的方法通常用于數(shù)據(jù)庫檢索應(yīng)用程序中的相似性搜索肠阱。
LSI is quite similar to PCA, except that we use an approximation of the covariance matrix C which is quite appropriate for the sparse and high-dimensional nature of text data. Specifically, let A be the n × d term-document matrix in which the (i,j)th entry is the normalized frequency for term j in document i. Then, AT · A is a d × d matrix which is close (scaled) approximation of the covariance matrix, in which the means have not been subtracted out. In other words, the value of AT ·A would be the same as a scaled version (by factor n) of the covariance matrix, if the data is mean-centered. While text-representations are not mean-centered, the sparsity of text ensures that the use of AT · A is quite a good approximation of the (scaled) covariances. As in the case of numerical data, we use the eigenvectors of AT ·A with the largest variance in order to represent the text. In typical collections, only about 300 to 400 eigenvectors are required for the representation. One excellent characteristic of LSI [28] is that the truncation of the dimensions removes the noise effects of synonymy and polysemy, and the similarity computations are more closely affected by the semantic concepts in the data. This is particularly useful for a semantic application such as text clustering. However, if finer granularity clustering is needed, such low-dimensional space representation of text may not be sufficiently discriminative; in information retrieval, this problem is often solved by mixing the low-dimensional representation with the original high-dimensional word-based representation (see, e.g., [105]).
LSI和PCA非常相似,除了我們用了一個非常恰當(dāng)?shù)淖匀晃谋緮?shù)據(jù)的稀疏高維矩陣的協(xié)方差近似矩陣C,特別的朴读,讓A是n*d的文檔術(shù)語矩陣屹徘,其中第(i,j)項(xiàng)為文檔i中的術(shù)語j的頻率歸一化,然后AT·A 是一個d*d的是協(xié)方差矩陣的近似(縮放)矩陣,意思是沒有被減去,換一種說法,
AT·A是協(xié)方差矩陣的縮放版本,(if the data is mean-centerd,While text-representations are not mean-centered, ???),稀疏文本確保使用AT·A是一個非常好的相似協(xié)方差(縮放)矩陣,像數(shù)值矩陣一樣,我們使用具有最大方差的AT·A的特征向量來表示文本,在一般的集合衅金,僅需要300-400特征向量來表示噪伊,一個優(yōu)秀的LSI特征[28]是可以去除同義或一次多義噪音的,相似度計(jì)算非常受數(shù)據(jù)中語義概念的影響氮唯,這非常有用對于語義程序鉴吹,例如文本聚類,然而惩琉,如果需要細(xì)粒度聚類豆励,例如低維空間的文本表示可能無法充分辨別,在信息恢復(fù)瞒渠,這個問題經(jīng)常是通過低維表示和原始的高維基于詞的表示來混合使用
A similar technique to LSI, but based on probabilistic modeling is Probabilistic Latent Semantic Analysis (PLSA) [49]. The similarity and equivalence of PLSA and LSI are discussed in [49].
一個和LSI相似的技術(shù)良蒸,但是是概率模型的算法是PLSA[49],兩者間的相似在[49]討論.
2.2.1 Concept Decomposition using Clustering(通過聚類分解概念).One interesting observation is that while feature transformation is often used as a pre-processing technique for clustering, the clustering itself can be used for a novel dimensionality reduction technique known as concept decomposition [2, 29]. This of course leads to the issue of circularity in the use of this technique for clustering, especially if clustering is required in order to perform the dimensionality reduction. Nevertheless, it is still possible to use this technique effectively for pre-processing with the use of two separate phases of clustering.
一個有趣的觀察是特征轉(zhuǎn)換通常通過聚類預(yù)處理技術(shù),聚類可以通過一個新穎的降維技術(shù)被稱為概念分解[2,29],這當(dāng)然導(dǎo)致使用該技術(shù)進(jìn)行聚類時的循環(huán)性問題伍玖,特別是如果聚類以執(zhí)行降維,雖然嫩痰,通過使用兩個獨(dú)立的聚類階段,仍然可以有效地使用該技術(shù)進(jìn)行預(yù)處理窍箍。
The technique of concept decomposition uses any standard clustering technique [2, 29] on the original representation of the documents. The frequent terms in the centroids of these clusters are used as basis vectors which are almost orthogonal to one another. The documents can then be represented in a much more concise way in terms of these basis vectors. We note that this condensed conceptual representation allows for enhanced clustering as well as classification. Therefore, a second phase of clustering can be applied on this reduced representation in order to cluster the documents much more effectively. Such a method has also been tested in [87] by using word-clusters in order to represent documents. We will describe this method in more detail later in this chapter.
概念分解技術(shù)用于任何標(biāo)準(zhǔn)的聚類技術(shù)在原始文檔表示上,這些簇的質(zhì)心中的頻繁項(xiàng)用作基本矢量串纺,它們幾乎彼此正交。這些文檔可以以更簡潔的基礎(chǔ)向量來表示椰棘。我們注意到濃縮概念允許像聚類以及分類.因此纺棺,可以對這種縮減的表示應(yīng)用第二階段的聚類,以便更有效地聚類文檔晰搀。例如一種已經(jīng)測試過的方法[87]通過詞聚類以表示文檔五辽,我們將會描述對這種方法進(jìn)行更詳細(xì)的描述在本章節(jié).
2.3 Non-negative Matrix Factorization(非負(fù)矩陣因式分解)
The non-negative matrix factorization (NMF) technique is a latent-space method, and is particularly suitable to clustering [97]. As in the case of LSI, the NMF scheme represents the documents in a new axis-system which is based on an analysis of the term-document matrix. However, the NMF method has a number of critical differences from the LSI scheme from a conceptual point of view. In particular, the NMF scheme is a feature transformation method which is particularly suited to clustering. The main conceptual characteristics of the NMF scheme, which are very different from LSI are as follows:
非負(fù)矩陣因式分解(NMF)技術(shù)是一個潛在空間方法办斑,它尤其適合聚類[97],在LSI的例子中外恕,NMF是在一個基于術(shù)語-文檔矩陣的分析的axis-system中表示文檔,然而乡翅,從概念的角度來看鳞疲,NMF方法與LSI方案有許多重要的區(qū)別。特別的蠕蚜,NMF是一個特別適合聚類的特征轉(zhuǎn)換方法尚洽,NMF的主要概念特征,和LSI非常不同:
In LSI, the new basis system consists of a set of orthonormal vectors. This is not the case for NMF.
在LSI中靶累,新的基礎(chǔ)系統(tǒng)由一組正交向量組成腺毫,但在NMF中不是In NMF, the vectors in the basis system directly correspond to cluster topics. Therefore, the cluster membership for a document may be determined by examining the largest component of the document along any of the vectors. The coordinate of any document along a vector is always non-negative. The expression of each document as an additive combination of the underlying semantics makes a lot of sense from an intuitive perspective. Therefore, the NMF transformation is particularly suited to clustering, and it also provides an intuitive understanding of the basis system in terms of the clusters.
在NMF中癣疟,向量在基礎(chǔ)系統(tǒng)中直接對應(yīng)聚類主題,因此,可以通過沿任何向量檢查文檔的最大分量來確定文檔的聚類成員資格潮酒。沿矢量的任何文檔的坐標(biāo)始終是非負(fù)的睛挚,作為底層語義的附加組合,每個文檔的表達(dá)從直觀的角度來看很有意義急黎。因襲扎狱,NMF轉(zhuǎn)換特別適合聚類,它還提供了關(guān)于集群的基礎(chǔ)系統(tǒng)的直觀理解勃教。
Let A be the n × d term document matrix. Let us assume that we wish to create k clusters from the underlying document corpus. Then, the non-negative matrix factorization method attempts to determine the matrices U and V which minimize the following objective function:
A是n×d的術(shù)語文檔矩陣淤击,讓我么假設(shè)我們希望從基礎(chǔ)文檔語料集創(chuàng)建k 聚類,因此故源,非負(fù)矩陣的因式分解方法嘗試通過下列目標(biāo)函數(shù)確定矩陣U和V最小化
Here || · || represents the sum of the squares of all the elements in the matrix, U is an n×k non-negative matrix, and V is a m×k non-negative matrix. We note that the columns of V provide the k basis vectors which correspond to the k different clusters.
這里|| · ||表示矩陣中所有元素的平方和,U是一個n×k的非負(fù)矩陣,V是m×k非負(fù)矩陣,我們注意到V的列提供了對應(yīng)于k個不同聚類的k個基矢量污抬。
What is the significance of the above optimization problem? Note that by minimizing J, we are attempting to factorize A approximately as:
上述優(yōu)化問題的意義何在?通過最小化J,我們試圖將A大致分解為:
For each row a of A (document vector), we can rewrite the above equation as:
對于A的每一行心软,我們可以將上面的等式重寫為:
Here u is the corresponding row of U. Therefore, the document vector a can be rewritten as an approximate linear (non-negative) combination of the basis vector which corresponds to the k columns of . If the value of k is relatively small compared to the corpus, this can only be done if the column vectors of discover the latent structure in the data. Furthermore, the non-negativity of the matrices U and V ensures that the documents are expressed as a non-negative combination of the key concepts (or clustered) regions in the term-based feature space.
這里u對應(yīng)U的每一行,因此壕吹,文檔向量可以重寫為一個 的k列基礎(chǔ)向量的近似線性(非負(fù))組合,如果k和語料集對比是一個相對小的值,這只能在發(fā)現(xiàn)數(shù)據(jù)中潛在結(jié)構(gòu)的列向量時才能完成删铃,此外耳贬,矩陣U和V的非負(fù)性確保文檔表示為基于術(shù)語的特征空間中的關(guān)鍵概念(或聚類)區(qū)域的非負(fù)組合。
Next, we will discuss how the optimization problem for J above is actually solved. The squared norm of any matrix Q can be expressed as the trace of the matrix . Therefore, we can express the objective function above as follows:
接下來猎唁,我們將討論如何實(shí)際解決上述J的優(yōu)化問題咒劲,任何矩陣Q的平方范數(shù)可以表示為矩陣的軌跡.因此,我們可以用下面的目標(biāo)函數(shù)表示以上:
Thus, we have an optimization problem with respect to the matrices and , the entries and of which are the variables with respect to which we need to optimize this problem. In addition, since the matrices are non-negative, we have the constraints that and . This is a typical constrained non-linear optimization problem, and can be solved using the Lagrange method. Let and be matrices with the same dimensions as U and V respectively. The elements of the matrices α and β are the corresponding Lagrange multipliers for the non-negativity conditions on the different elements of
U and V respectively. We note that is simply equal to and tr(β · V ) is simply equal to i,j βij · vij . These correspond to the lagrange expressions for the non-negativity constraints. Then, we can express the Lagrangian optimization problem as follows:
從而,我們有一個關(guān)于矩陣和的優(yōu)化問題,其中的項(xiàng)和是我們需要優(yōu)化此問題的變量诫隅,此外,因?yàn)榫仃囀欠秦?fù)的,我們限制和,這是一個一般性的限制性非線性優(yōu)化問題,并且可以使用拉格朗日方法來解決,使,分別成為和U和V有相同維數(shù)的矩陣,矩陣α和β的元素是對應(yīng)的拉格朗日乘數(shù)腐魂,分別用于U和V元素的非負(fù)性條件,我們可以用以下公式表示拉格朗日優(yōu)化問題:
Then, we can express the partial derivative of L with respect to U and V as follows, and set them to 0:
我們可以如下表示L相對于U和V的偏導(dǎo)數(shù),并使其為0:
We can then multiply the (i,j)th entry of the above (two matrices of) conditions with and respectively. Using the Kuhn-Tucker conditions and ,we get the following:
然后逐纬,我們可以將上述(兩個矩陣)條件的第(i蛔屹,j)個條目分別與和相乘。
使用Kuhn-Tucker條件 and ,我們得到下面的公式:
We note that these conditions are independent of α and β. This leads to the following iterative updating rules for and :
我們注意到條件依賴α和β豁生,這即出現(xiàn)了以下的迭代更新規(guī)則:
It has been shown in [58] that the objective function continuously improves under these update rules, and converges to an optimal solution.
在[58]中已經(jīng)表明兔毒,目標(biāo)函數(shù)在這些更新規(guī)則下不斷改進(jìn),并收斂到最優(yōu)解甸箱。
One interesting observation about the matrix factorization technique is that it can also be used to determine word-clusters instead of document clusters. Just as the columns of V provide a basis which can be used to discover document clusters, we can use the columns of U to discover a basis which correspond to word clusters. As we will see later, document clusters and word clusters are closely related, and it is often useful to discover both simultaneously, as in frameworks such as co-clustering [30, 31, 75]. Matrix-factorization provides a natural way of achieving this goal. It has also been shown both theoretically and experimentally [33, 93] that the matrix-factorization technique is equivalent to another graph-structure based document clustering technique known as spectral clustering. An analogous technique called concept factorization was proposed in [98], which can also be applied to data points with negative values in them.
一個關(guān)于矩陣因式分解技術(shù)有趣的觀察是育叁,它可以用確定詞聚類替代文檔聚類,正如V的列提供了可用于發(fā)現(xiàn)文檔聚類的基礎(chǔ),我們可以用U的列發(fā)現(xiàn)對應(yīng)的詞聚類基礎(chǔ)芍殖,我們稍后會看到豪嗽,文檔聚類和詞聚類非常相關(guān),同時發(fā)現(xiàn)它們通常很有用,在框架中例如co-聚類,矩陣分解提供了一個自然的方式實(shí)現(xiàn)這一目標(biāo),理論和實(shí)驗(yàn)[33,93]也表明龟梦,矩陣分解技術(shù)相當(dāng)于另一種基于圖結(jié)構(gòu)的文檔聚類技術(shù)隐锭,稱為譜聚類亏钩。一個被稱為概念分解類似的技術(shù)在[98]腔稀,它也可以應(yīng)用于具有負(fù)值的數(shù)據(jù)點(diǎn)最筒。
3. Distance-based Clustering Algorithms(基于距離的聚類算法)
Distance-based clustering algorithms are designed by using a similarity function to measure the closeness between the text objects. The most well known similarity function which is used commonly in the text domain is the cosine similarity function. Let and be the damped and normalized frequency term vector in two different documents U and V . The values and represent the (normalized) term frequencies, and the function represents the damping function. Typical damping functions for f(·) could represent either the square-root or the logarithm [25]. Then, the cosine similarity between the two documents is defined as follows:
基于距離的聚類算法是基于相似性函數(shù)來測量兩個文本對象間的接近程度,最為熟知的通用相似性函數(shù)是余弦相似函數(shù).使和在兩個不同的文件U和V中是阻尼(可百度這個詞的含義)和歸一化的詞向量頻率宅粥。 和 表示(歸一化后的)詞頻,函數(shù)表示阻尼函數(shù),一般的阻尼函數(shù)可以代表平方根或?qū)?shù).所以劫哼,兩篇文檔的余弦相似性有以下定義:
Computation of text similarity is a fundamental problem in information retrieval. Although most of the work in information retrieval has focused on how to assess the similarity of a keyword query and a text document, rather than the similarity between two documents, many weighting heuristics and similarity functions can also be applied to optimize the similarity function for clustering. Effective information retrieval models generally capture three heuristics, i.e., TF weighting, IDF weighting, and document length normalization [36]. One effective way to assign weights to terms when representing a document as a weighted term vector is the BM25 term weighting method [78], where the normalized TF not only addresses length normalization, but also has an upper bound which improves the robustness as it avoids overly rewarding the matching of any particular term. A document can also be represented with a probability distribution over words (i.e., unigram language models), and the similarity can then be measured based an information theoretic measure such as cross entropy or Kullback-Leibler divergencce [105]. For clustering, symmetric variants of such a similarity function may be more appropriate.
計(jì)算文本相似度是信息檢索中的基本問題奥秆,雖然在信息檢索中的主要工作聚焦在怎樣評估一個文檔和一個查詢關(guān)鍵詞的相似性上,而不是兩個文檔間的相似性,很多加權(quán)式啟發(fā)算法和相似性函數(shù)函數(shù)也可以應(yīng)用于聚類相似性函數(shù)的有劃傷,有效的信息檢索模型通常捕獲三個啟發(fā)式算法说庭。即,TF加權(quán)玖像,IDF加權(quán)和文檔長度歸一化,在將文檔表示為一個權(quán)重詞向量的一個有效的給詞分配權(quán)重的方式是BM25詞權(quán)重方法[78],其中歸一化TF不僅地址長度歸一化樱溉,但也有一個上限挣输,提高了魯棒性,因?yàn)樗苊膺^多地獎勵任何特定詞語的匹配福贞。一個文檔也可以通過詞的概率分布(i.e.unigram模型),并且相似度也可以基于信息論測度被測量撩嚼,例如通過熵或者通過交叉熵發(fā)散[105]