[PED07]Feature Selection for Clustering:A Review聚類特征選擇綜述

@[toc]

0.1 introduction介紹

高通量技術(shù)導(dǎo)致數(shù)據(jù)維度以及樣本數(shù)量呈指數(shù)增長(zhǎng)，使得對(duì)數(shù)據(jù)集進(jìn)行手動(dòng)處理顯得不太實(shí)際。但是由于收集數(shù)據(jù)的技術(shù)不完善或者數(shù)據(jù)本身來(lái)源的性質(zhì)挠锥，導(dǎo)致數(shù)據(jù)噪聲。因此如何從龐大而嘈雜的數(shù)據(jù)集中提取有用的知識(shí)是一項(xiàng)艱巨的任務(wù)。
降維是一種可以消除噪聲和冗余屬性（特征）的技術(shù)鹏往。降維技術(shù)可以分為特征提取（feature extraction）和特征選擇（feature selection）骇塘。
- 特征提取：特征被投影到一個(gè)新的低維空間伊履。
  常見的特征提取技術(shù)有：PCA韩容、LDA、SVD唐瀑。（Principle Component Analysis 群凶，Linear Discriminant Analysis ，Singular Value Decomposition）
- 特征選擇：從特征中選出一個(gè)子集來(lái)最小化冗余和最大化與目標(biāo)的相關(guān)性哄辣。
  常用的特征選擇方法有：Information Gain信息增益请梢，Relief，Chi Squares力穗，F(xiàn)isher Score毅弧，Lasso。
特征提取和特征選擇方法都能提高學(xué)習(xí)性能睛廊，降低計(jì)算開銷并獲得更加泛化的模型形真。但是特征選擇優(yōu)于特征提取，因?yàn)樘卣鬟x擇有更好的可讀性和可解釋性超全，因?yàn)樗匀槐３衷瓉?lái)的特征咆霜，只是去掉了一些認(rèn)為冗余的。而特征提取將特征從原始空間映射到新的低維空間嘶朱，得到的轉(zhuǎn)換的特征沒有物理含義蛾坯。
特征選擇被分為四種類型：
- filter model
- wrapper model
- embedded model
- hybrid model
特征選擇選擇能夠區(qū)分不同類樣本的特征。監(jiān)督學(xué)習(xí)中疏遏，將帶標(biāo)簽的樣本作為訓(xùn)練集以選擇特征脉课，如果 $f_i$ 和 $c_j$ 高度相關(guān)，則稱特征 $f_i$ 與類 $c_j$ 相關(guān)财异。無(wú)監(jiān)督學(xué)習(xí)中相關(guān)性就比較難定義倘零，但是特征選擇可以類似于改進(jìn)監(jiān)督學(xué)習(xí)的方式改進(jìn)無(wú)監(jiān)督學(xué)習(xí)。最常用的無(wú)監(jiān)督學(xué)習(xí)的方法是聚類戳寸，通過(guò)最大化類內(nèi)相似性呈驶，最小化類間相似性得到不同的簇。利用特征選擇使用好的特征子集可以幫助聚類產(chǎn)生好的結(jié)果并且可以大幅降低計(jì)算開銷疫鹊。

0.1.1 Data Clustering 聚類

數(shù)據(jù)量太大袖瞻，人工做標(biāo)簽非常困難。通常用聚類的方式進(jìn)行數(shù)據(jù)標(biāo)記拆吆。在聚類中聋迎，給出未標(biāo)記的數(shù)據(jù)，將類似的樣本放在一個(gè)簇中枣耀，不同的樣本應(yīng)該在不同的簇中霉晕。
聚類在很多機(jī)器學(xué)習(xí)和數(shù)據(jù)挖掘任務(wù)中很有用，如：圖像分割，信息檢索娄昆，模式識(shí)別佩微，模式分類，網(wǎng)絡(luò)分析等萌焰。它可以被視為探索性任務(wù)或預(yù)處理步驟哺眯。如果目標(biāo)是探索和揭示數(shù)據(jù)中隱藏的模式，那么聚類本身就是一個(gè)獨(dú)立的探索任務(wù)扒俯。但是奶卓，如果生成的聚類結(jié)果將用于促進(jìn)另一個(gè)數(shù)據(jù)挖掘或機(jī)器學(xué)習(xí)任務(wù)，則在這種情況下撼玄，集群將是預(yù)處理步驟夺姑。
有許多聚類方法。這些方法可以大致分為：
- 分區(qū)方法
  - 使用基于距離的度量來(lái)基于它們的相似性對(duì)點(diǎn)進(jìn)行聚類掌猛。 K-means和k-medoids是流行的分區(qū)算法盏浙。
- 分層方法
  - 分層方法將數(shù)據(jù)劃分為不同級(jí)別，形成層次結(jié)構(gòu)荔茬。這種聚類有助于數(shù)據(jù)可視化和摘要废膘。分層聚類可以以自下而上（agglomerative匯聚）方式或自上而下（divisive分裂）方式進(jìn)行。這種類型的聚類的例子是BIRCH慕蔚，Chameleon丐黄，AGNES追城，DIANA氢妈。
- 基于密度的方法
  - 與這兩種聚類技術(shù)不同撬碟，基于密度的聚類可以捕獲任意形狀的聚類甩骏，例如S形。密集區(qū)域中的數(shù)據(jù)點(diǎn)將形成簇仁堪，而來(lái)自不同簇的數(shù)據(jù)點(diǎn)將由低密度區(qū)域分開阶捆。 DBSCAN和OPTICS是基于密度的聚類方法的流行示例倦卖。

0.1.2 Feature Selection Models 特征選擇

高維數(shù)據(jù)的維度之咒鸠匀，使得降維非常重要接校。特征選擇是降維的一種重要手段。
特征選擇是根據(jù)某些相關(guān)性評(píng)估標(biāo)準(zhǔn)狮崩，從原始特征中選擇一小部分相關(guān)特征，這通常會(huì)帶來(lái)更好的學(xué)習(xí)性能鹿寻，例如：更高的學(xué)習(xí)準(zhǔn)確性睦柴，更低的計(jì)算成本和更好的模型可解釋性。特征選擇已成功應(yīng)用于許多實(shí)際應(yīng)用毡熏，如模式識(shí)別坦敌，文本分類，圖像處理，生物信息學(xué)等狱窘。
特征選擇的分類
- 1杜顺、根據(jù)是否使用標(biāo)簽，可以分為無(wú)監(jiān)督蘸炸、半監(jiān)督躬络、有監(jiān)督算法。
- 2搭儒、根據(jù)不同的選擇策略穷当，特征選擇算法可以分為：
  - Filter模型
    - 獨(dú)立于任何分類器，通過(guò)使用某些統(tǒng)計(jì)標(biāo)準(zhǔn)研究特征的相關(guān)性來(lái)評(píng)估特征的相關(guān)性淹禾。
    - Relief [59]馁菜，F(xiàn)isher score[16]，CFS [24]和FCBF [76]是Filter模型中最具代表性的算法铃岔。
  - Wrapper模型
    - 利用分類器作為選擇標(biāo)準(zhǔn)汪疮，使用給定的分類器選擇一組具有最大判別力的特征，例如：SVM毁习，KNN等智嚷。
    - 例子有FSSEM[17]， $l_1$ SVM蜓洪。Wrapper模型的其他示例可以是優(yōu)先搜索策略和給定分類器的任何組合纤勒。
    - 由于Wrapper模型依賴于給定的分類器，因此評(píng)估過(guò)程中通常需要交叉驗(yàn)證隆檀。它們通常在計(jì)算上更昂貴并且依賴選擇的分類器摇天。因此實(shí)際應(yīng)用中，F(xiàn)ilter模型更受歡迎恐仑，特別是對(duì)大型數(shù)據(jù)集的問(wèn)題泉坐。但經(jīng)驗(yàn)證明，Wrapper模型在分類精度方面優(yōu)于Filter模型裳仆。
  - Hybrid模型
    - 混合模型[13,40]被提出來(lái)彌補(bǔ)Filter和Wrapper模型之間的差距腕让。首先，它結(jié)合了統(tǒng)計(jì)標(biāo)準(zhǔn)歧斟，如Filter模型那樣纯丸，選出幾個(gè)給定基數(shù)的候選特征子集。然后静袖，從中選擇具有最高分類精度的子集[40]觉鼻。因此，混合模型通扯映龋可以實(shí)現(xiàn)與Wrapper相當(dāng)?shù)木_度和與Filter模型相當(dāng)?shù)男省?/li>
    - 混合模型的代表性特征選擇算法包括：BBHFS [13]坠陈，HGA [53]萨惑。
  - Embedded模型
    - Embedded模型在學(xué)習(xí)時(shí)間內(nèi)執(zhí)行特征選擇。換句話說(shuō)仇矾，它同時(shí)實(shí)現(xiàn)了模型訓(xùn)練和特征選擇庸蔼。
    - Embedded模型的例子包括：C4.5 [54]，BlogReg [21]和SBMLR [21]贮匕。
特征選擇的輸出：
- 1）子集選擇
  - 返回選擇的子集姐仅，通過(guò)特征的索引標(biāo)識(shí)。
- 2）特征加權(quán)
  - 返回對(duì)應(yīng)每個(gè)特征的權(quán)重粗合。
  - 特征加權(quán)被認(rèn)為是特征選擇的推廣萍嬉。在特征選擇中，為特征分配二進(jìn)制權(quán)重隙疚，1表示選擇特征壤追，0表示不選擇。而特征加權(quán)為特征分配一個(gè)值供屉，通常在區(qū)間[0行冰，1]或[-1，1]中伶丐。該值越大悼做，該特征就越顯著。在特征相關(guān)性得分不同的任務(wù)中哗魂，特征加權(quán)被發(fā)現(xiàn)優(yōu)于特征選擇肛走，這在大多數(shù)現(xiàn)實(shí)問(wèn)題中都是如此。如果設(shè)置閾值來(lái)根據(jù)權(quán)重選擇特征录别，則特征加權(quán)也可以簡(jiǎn)化為特征選擇朽色。因此，本章中提到的大多數(shù)特征選擇算法都可以使用特征加權(quán)方案來(lái)考慮组题。
- 3）子集選擇和特征加權(quán)
  - 返回一個(gè)排好序的特征子集葫男。
特征選擇步驟：
- 1）子集生成
- 2）子集評(píng)估
- 3）停止標(biāo)準(zhǔn)
- 4）結(jié)果驗(yàn)證
- 首先基于給定的搜索策略來(lái)選擇候選特征子集 ；這些子集在第二步驟中根據(jù)某個(gè)評(píng)估標(biāo)準(zhǔn)被評(píng)估崔列；將從滿足停止標(biāo)準(zhǔn)之后的所有候選中選擇最佳子集梢褐；最后，使用領(lǐng)域知識(shí)或驗(yàn)證集來(lái)驗(yàn)證所選擇的子集赵讯。

0.1.3 Feature Selection for Clustering 聚類的特征選擇

從聚類的角度來(lái)看盈咳，刪除不相關(guān)的特征不會(huì)對(duì)聚類準(zhǔn)確性產(chǎn)生負(fù)面影響，且可以減少所需的存儲(chǔ)和計(jì)算時(shí)間边翼。
圖2表示 $f_1$ 可以區(qū)分出兩個(gè)簇鱼响，而 $f_2$ 和 $f_3$ 不能區(qū)分（(b)中 $f_2$ 方向上藍(lán)色紅色都從0到1都有分布，故 $f_2$ 無(wú)法區(qū)分讯私；而 $f_1$ 方向上藍(lán)色分布在2-3热押，紅色分布在4-5，所以可以區(qū)分斤寇。）桶癣，所以 $f_2$ 和 $f_3$ 不會(huì)向聚類添加任何重要信息，刪除也不會(huì)影響聚類娘锁。

在這里插入圖片描述
相關(guān)特征的不同子集可能導(dǎo)致不同的聚類
圖3(a)顯示了利用特征 $f_1$ 和 $f_2$ 形成的的四個(gè)簇牙寞，而圖3(b)顯示了僅使用 $f_1$ 形成了兩個(gè)簇。類似地莫秆，( c )顯示了僅使用 $f_2$ 形成了兩個(gè)簇间雀。因此，相關(guān)特征的不同子集可能導(dǎo)致不同的聚類镊屎，這極大地幫助發(fā)現(xiàn)數(shù)據(jù)中的不同隱藏模式惹挟。

在這里插入圖片描述

受這些事實(shí)的啟發(fā)，提出了很多不同的聚類技術(shù)缝驳，通過(guò)利用特征選擇方法消除不相關(guān)和冗余的特征连锯，同時(shí)保留相關(guān)特征，以提高聚類效率和質(zhì)量用狱。后面我們將描述基于域的不同的特征選擇聚類（FSC）方法运怖。介紹：傳統(tǒng)FSC，文本數(shù)據(jù)中的FSC夏伊，流數(shù)據(jù)中的FSC和FSC鏈接數(shù)據(jù)摇展。

與監(jiān)督學(xué)習(xí)的特征選擇類似，用于聚類的特征選擇也被分類為Filter[15]溺忧、Wrapper[55]咏连、Hybrid[19]。

Wrapper模型通過(guò)聚類質(zhì)量評(píng)估候選特征子集砸狞。
Filter模型獨(dú)立于聚類算法捻勉。Filter模型在計(jì)算時(shí)間方面更好，并且對(duì)任何聚類方法都是無(wú)偏的刀森。但是如果我們事先知道聚類方法踱启，Wrapper模型產(chǎn)生更好的聚類。
為減輕Wrapper模型的計(jì)算成本研底，利用過(guò)濾標(biāo)準(zhǔn)來(lái)選擇Hybrid中的候選特征子集埠偿。

0.1.3.1 Filter Model

不是使用聚類算法測(cè)試特征的質(zhì)量，通過(guò)一個(gè)確定的標(biāo)準(zhǔn)來(lái)給特征的打分榜晦，然后選擇最高評(píng)分的特征冠蒋。

Dash等人在總結(jié)Ben-Bassat等人、Doak等人的工作后將評(píng)價(jià)準(zhǔn)則分為五類：
距離度量（Distance Measure）乾胶、
信息增益度量（Information Gain Measure）抖剿、
依賴性度量（Dependence Measure）朽寞、
一致性度量（Consistency Measure）、
分類器錯(cuò)誤率度量（Classifier Error Rate Measure）斩郎。

（1）距離度量：距離度量一般認(rèn)為是差異性或者分離性的度量脑融，常用的距離度量方法有歐式距離等。對(duì)于一個(gè)二元分類問(wèn)題缩宜，對(duì)于兩個(gè)特征f1f1和f2f2肘迎，如果特征f1f1引起的兩類條件概率差異大于特征f2f2，則認(rèn)為特征f1f1優(yōu)于特征f2f2锻煌。
（2）信息增益度量：特征f的信息增益定義為使用特征f的先驗(yàn)不確定性與期望的后驗(yàn)不確性之間的差異妓布。若特征f1f1的信息增益大于特征f2f2的信息增益，則認(rèn)為特征f1f1優(yōu)于特征f2f2宋梧。
（3）依賴性度量：依賴性度量又稱為相關(guān)性度量（Correlation Measure）匣沼、通常可采用皮爾遜相關(guān)系數(shù)（Pearson correlation coefficient）來(lái)計(jì)算特征f與類別C之間的相關(guān)度乃秀，若特征f1f1與類別C之間的相關(guān)性大于特征f2f2與類別C之間的相關(guān)性肛著，則認(rèn)為特征f1f1優(yōu)于特征f2f2。同樣也可以計(jì)算得到屬性與屬性之間的相關(guān)度跺讯，屬性與屬性之間的相關(guān)性越低越好枢贿。
（4）一致性度量：假定兩個(gè)樣本，若它們的特征值相同刀脏，且所屬類別也相同局荚，則認(rèn)為它們是一致的：否則，則稱它們不一致愈污。一致性常用不一致率來(lái)衡量耀态，其嘗試找出與原始特征集具有一樣辨別能力的最小的屬性子集。
（5）分類器錯(cuò)誤率度量：該度量使用學(xué)習(xí)器的性能作為最終的評(píng)價(jià)閾值暂雹。它傾向于選擇那些在分類器上表現(xiàn)較好的子集首装。

-以上5種度量方法中，距離度量（Distance Measure）杭跪、信息增益度量（Information Gain Measure）仙逻、依賴性度量（Dependence Measure）、一致性度量（Consistency Measure）常用于過(guò)濾式（filter）涧尿；
-分類器錯(cuò)誤率度量(Classifier Error Rate Measure)則用于包裹式(wrapper）系奉。
https://blog.csdn.net/u012328159/article/details/53954522
特征評(píng)估可以是單變量(univariate)或多變量(multivariate)。
- 單變量意味著每個(gè)特征的評(píng)估與特征空間無(wú)關(guān)姑廉。比多變量更快缺亮、更有效。
- 多變量可以根據(jù)其他特征評(píng)估特征桥言。與單變量方法不同萌踱，多變量能夠處理冗余特征葵礼。
算法：SPEC（0.2.1.1），是單變量Filter模型的一個(gè)例子并鸵，在[78]中擴(kuò)展到多變量方法章咧。feature dependency [62]（特征依賴）, entropy-based distance [15]（基于熵的距離）, and laplacian score [26, 80]（拉普拉斯分?jǐn)?shù)）。

0.1.3.2 Wrapper Mode

Wrapper模型是利用聚類算法進(jìn)行評(píng)估的特征選擇模型能真。
- 首先找一個(gè)特征子集。
- 然后使用這個(gè)特征子集進(jìn)行聚類扰柠，評(píng)估聚類效果粉铐。
- 重復(fù)上述兩個(gè) 過(guò)程直到得到期望的效果出現(xiàn)。
問(wèn)題：評(píng)估所有的可能的特征子集對(duì)于高維數(shù)據(jù)集是不可能的卤档，所以常常采用啟發(fā)式搜索策略來(lái)縮小搜索空間蝙泼。即便如此 Wrapper模型比Filter模型計(jì)算復(fù)雜性上還是要昂貴的多。
算法：[18]中提出的方法是一個(gè)包含最大似然準(zhǔn)則劝枣、特征選擇和高斯混合作為聚類方法的包裝器的例子汤踏。[32]中是使用傳統(tǒng)的聚類方法，如k-means和任何搜索策略作為特征選擇器舔腾。

0.1.3.3 Hybrid Model

結(jié)合Filter和Wrapper模型：
- 利用Filter的標(biāo)準(zhǔn)選擇出不同的候選特征子集
- 評(píng)估候選特征子集的聚類結(jié)果的質(zhì)量
- 聚類結(jié)果最好的那個(gè)子集就是我們要的特征選擇的集合
比Filter的聚類效果好溪胶，比Wrapper的效率高。

0.2 Feature Selection for Clustering 聚類的特征選擇

一些算法處理文本數(shù)據(jù)稳诚，一些算法處理流數(shù)據(jù)哗脖。還有一些算法能夠處理不同類型的數(shù)據(jù)。在本節(jié)中扳还，我們將討論一下算法以及它們可以處理的數(shù)據(jù)類型才避。

0.2.1 Algorithms for Generic Data 通用數(shù)據(jù)算法

能夠處理通用數(shù)據(jù)集的聚類特征選擇

0.2.1.1 Spectral Feature Selection (SPEC)譜特征選擇

SPEC[80]既可以監(jiān)督也可以無(wú)監(jiān)督學(xué)習(xí)，這里作為<font color=red>Filter模型無(wú)監(jiān)督特征選擇</font>方法氨距。

[80]提出了一種基于"譜圖理論"（spectral graph）的特征選取框架桑逝，像Laplacian score 和 ReliefF 都屬于這個(gè)框架的一個(gè)特殊情況而已。而這個(gè)框架的假設(shè)俏让，依然是本著原數(shù)據(jù)最重要的原則楞遏，假設(shè)一個(gè)好的特征應(yīng)該與原來(lái)（訓(xùn)練）數(shù)據(jù)構(gòu)成的圖有著相似的結(jié)構(gòu)。當(dāng)然一個(gè)特征畢竟是有限的（比如用性別來(lái)區(qū)分人有沒有錢）舆驶，可是這個(gè)特征與訓(xùn)練數(shù)據(jù)的相關(guān)性越大橱健，我們就覺得這個(gè)特征越好，越可取沙廉。
通過(guò)評(píng)估 從相似矩陣S導(dǎo)出的譜矩陣的特征一致性 來(lái)評(píng)估特征相關(guān)性拘荡。
使用徑向基函數(shù)（Radial Basis Function）作為樣本 $x_i$ 和 $x_j$ 之間的相似度函數(shù)。徑向基函數(shù)是某種沿徑向?qū)ΨQ的標(biāo)量函數(shù)撬陵，通常定義為樣本到數(shù)據(jù)中心之間徑向距離（通常是歐氏距離）的單調(diào)函數(shù)珊皿。常用的高斯徑向基函數(shù)形如：
<img src="https://img-blog.csdnimg.cn/2019080716164376.png" width="30%">
算法：
<img src="https://img-blog.csdnimg.cn/20190809104130163.png" width="60%">
- 1.構(gòu)建數(shù)據(jù)的相似性矩陣S网缝，以及由此基礎(chǔ)推出的圖的表示G，和D蟋定，W粉臊，L。
  - 在這里插入圖片描述
  - $\overline{D}_{ii}=\sum_{j=1}^nW_{ij}$ 驶兜， $\overline{D}$ 是對(duì)角矩陣扼仲。
  - 在這里插入圖片描述
    
    G由S構(gòu)造，鄰接矩陣W由G構(gòu)造抄淑。
- 2.使用三個(gè)權(quán)重函數(shù)評(píng)估特征的權(quán)重屠凶。函數(shù)來(lái)源于正則化割函數(shù)和圖譜，并可以擴(kuò)展到更加一般的形式肆资。我們假設(shè)給定特征向量 $f_i$ 矗愧，每個(gè)函數(shù) $\psi$ 基于歸一化拉普拉斯算子 $L$ 返回權(quán)重。

0.2.1.2 Laplacian Score (LS)拉普拉斯分?jǐn)?shù)

如果將SPEC 中<img src="https://img-blog.csdnimg.cn/20190809104238212.png" width="15%" align=center>替換為:
<img src="https://img-blog.csdnimg.cn/20190809104931193.png" width="60%" align=center>
則LS拉普拉斯分?jǐn)?shù)是SPEC的一個(gè)特殊的案例郑原。

LS在數(shù)據(jù)大小方面非常有效唉韭。與SPEC相似，LS中最耗時(shí)的是構(gòu)造相似矩陣s犯犁。該算法的優(yōu)點(diǎn)是既能處理帶標(biāo)記的數(shù)據(jù)属愤，又能處理無(wú)標(biāo)記的數(shù)據(jù)。

0.2.1.3 Feature Selection for Sparse Clustering稀疏聚類特征選擇

[71]用Lasso和 $L_1$ 范數(shù)作為特征選擇方法嵌入在聚類過(guò)程中酸役。特征選擇的數(shù)量L使用gap statistics選擇春塌，類似于[67]中的選擇聚類數(shù)量。

目標(biāo)函數(shù)：
<img src="https://img-blog.csdnimg.cn/20190810101249215.png" width="65%" align=center>
$n_c$ 是某一類中樣本數(shù)簇捍。
$Sim(i,i',j)$ 是只使用特征 $j$ 時(shí)樣本 $i$ 和樣本 $i'$ 的相似度只壳。
<img src="https://img-blog.csdnimg.cn/20190810100705149.png" width="65%" align=center>
優(yōu)化：
采用交替優(yōu)化方法，首先固定 $w$ 暑塑，優(yōu)化關(guān)于 $\{C_1,…,C_K\}$ 的(0.4)式吼句，在這一步，僅使用第 $j$ 個(gè)特征對(duì) $n\times n$ 的相似度矩陣上用標(biāo)準(zhǔn)K-means聚類事格。得到一個(gè)聚類之后再優(yōu)化關(guān)于 $w$ 的(0.4)式惕艳。
算法：
<img src="https://img-blog.csdnimg.cn/20190810101842989.png" width="70%" align=center>

0.2.1.4 Localized Feature Selection Based on Scatter Separability(LFSBSS) 基于離散分離性的局部特征選擇

[35]借鑒了Dy和Brodley[18]中離散分離性的概念，并將其作為局部特征選擇驹愚。他們將分散可分性定義為:
<img src="https://img-blog.csdnimg.cn/20190810104222900.png" width="20%">
其中 $S_w^{-1}$ 是類內(nèi)分離性的逆远搪， $S_b$ 是類間分離性。
只要聚類任務(wù)不變逢捺， $\Omega_i$ 隨維數(shù)增加單調(diào)遞增谁鳍。為了解決這個(gè)問(wèn)題，分離性標(biāo)準(zhǔn)必須根據(jù)特征選擇的維數(shù)進(jìn)行標(biāo)準(zhǔn)化。此外倘潜，由于局部的特征選擇嘗試為每個(gè)簇選擇不同的相關(guān)特征集绷柒，因此簇之間的分離性也需要進(jìn)行適當(dāng)?shù)囊?guī)范化。這是通過(guò)對(duì)單個(gè)簇的交叉投影來(lái)實(shí)現(xiàn)的涮因。
LFSBSS采用序列向后特性選擇废睦。這意味著，集群首先使用整個(gè)特征空間生成养泡。然后嗜湃，迭代地從每個(gè)集群中刪除基于a的不相關(guān)或有噪聲的特性。
算法：
<img src="https://img-blog.csdnimg.cn/20190810102457344.png" width="80%">

0.2.1.5 Multi-Cluster Feature Selection (MCFS)

0.2.1.6 Feature Weighting k-means

0.2.2 Algorithms for Text Data

0.2.2.1 Term Frequency (TF)

0.2.2.2 Inverse Document Frequency (IDF)

0.2.2.3 Term Frequency-Inverse Document Frequency (TF-IDF)

0.2.2.4 Chi Square statistic

0.2.2.5 Frequent Term-Based Text Clustering

0.2.2.6 Frequent Term Sequence

0.2.3 Algorithms for Streaming Data

0.2.3.1 Text Stream Clustering Based on Adaptive Feature Selection (TSC-AFS)

0.2.3.2 High-dimensional Projected Stream Clustering (HPStream)

0.2.4 Algorithms for Linked Data

0.2.4.1 Challenges and Opportunities

0.2.4.2 LUFS: An Unsupervised Feature Selection Framework for Linked Data

0.2.4.3 Conclusion and Future Work for Linked Data

0.3 Discussions and Challenges

0.3.1 The Chicken or the Egg Dilemma

0.3.2 Model Selection: K and l

0.3.4 Stability

Bibliography

[1] Feature selection for dna methylation based cancer classi_cation. Bioinformatics, 17Suppl 1:S157-S164, 2001.
[2] A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507-2517, Oct 2007.
[3] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu. A framework for clustering evolving data streams. In Proceedings of the 29th international conference on Very large data bases-Volume 29, pages 81-92. VLDB Endowment, 2003.
[4] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu. A framework for projected clustering of high dimensional data streams. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 852-863. VLDB Endowment, 2004.
[5] C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, and J.S. Park. Fast algorithms for projected clustering. ACM SIGMOD Record, 28(2):61-72, 1999.
[6] T.M. Akhriza, Y. Ma, and J. Li. Text clustering using frequent contextual termset. In Information Management, Innovation Management and Industrial Engineering(ICIII), 2011 International Conference on, volume 1, pages 339-342. IEEE, 2011.
[7] Salem Alelyani, LeiWang, and Huan Liu. The e_ect of the characteristics of the dataset on the selection stability. In Proceedings of the 23rd IEEE International Conference on Tools with Arti_cial Intelligence, 2011.
[8] F. Beil, M. Ester, and X. Xu. Frequent term-based text clustering. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 436-442. ACM, 2002.
[9] C. Boutsidis, M.W. Mahoney, and P. Drineas. Unsupervised feature selection for the k-means clustering problem. Advances in Neural Information Processing Systems, 22:153-161, 2009.
[10] P.S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. pages 82-90. Morgan Kaufmann, 1998.
[11] D. Cai, C. Zhang, and X. He. Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 333-342. ACM, 2010.
[12] A.C. Carvalho, R.F. Mello, S. Alelyani, H. Liu, et al. Quantifying features using false nearest neighbors: An unsupervised approach. In Tools with Arti_cial Intelligence(ICTAI), 2011 23rd IEEE International Conference on, pages 994-997. IEEE, 2011.
[13] Sanmay Das. Filters, wrappers and a boosting-based hybrid for feature selection. In ICML '01: Proceedings of the Eighteenth International Conference on Machine Learning, pages 74-81, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.29 30
[14] M. Dash and Y.S. Ong. Relief-c: E_cient feature selection for clustering over noisy data. In Tools with Arti_cial Intelligence (ICTAI), 2011 23rd IEEE International Conference on, pages 869-872. IEEE, 2011.
[15] Manoranjan Dash, Kiseok Choi, Peter Scheuermann, and Huan Liu. Feature selection for clustering - a filter solution. In In Proceedings of the Second International Conference on Data Mining, pages 115-122, 2002.
[16] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classi_cation. John Wiley & Sons, New York, 2 edition, 2001.
[17] Jennifer G. Dy and Carla E. Brodley. Feature subset selection and order identi_cation for unsupervised learning. In In Proc. 17th International Conf. on Machine Learning, pages 247-254. Morgan Kaufmann, 2000.
[18] Jennifer G. Dy and Carla E. Brodley. Feature selection for unsupervised learning. J. Mach. Learn. Res., 5:845-889, 2004.
[19] J.G. Dy. Unsupervised feature selection. Computational Methods of Feature Selection, pages 19-39, 2008.
[20] B.C.M. Fung, K. Wang, and M. Ester. Hierarchical document clustering using frequent itemsets. In Proceedings of the SIAM International Conference on Data Mining, volume 30, pages 59-70, 2003.
[21] Nicola L. C. Talbot Gavin C. Cawley and Mark Girolami. Sparse multinomial logistic regression via bayesian l1 regularisation. In NIPS, 2006.
[22] L. Gong, J. Zeng, and S. Zhang. Text stream clustering algorithm based on adaptive feature selection. Expert Systems with Applications, 38(3):1393-1399, 2011.
[23] I. Guyon and A. Elissee. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157-1182, 2003.
[24] Mark A. Hall. Correlation-based feature selection for machine learning. Technical report, 1999.
[25] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2001.
[26] X. He, D. Cai, and P. Niyogi. Laplacian score for feature selection. Advances in Neural Information Processing Systems, 18:507, 2006.
[27] J.Z. Huang, M.K. Ng, H. Rong, and Z. Li. Automated variable weighting in k-means type clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(5):657-668, 2005.
[28] Anil Jain and Douglas Zongker. Feature selection: Evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:153-158, 1997.
[29] D. Jensen and J. Neville. Linkage and autocorrelation cause feature selection bias in relational learning. In ICML, pages 259-266, 2002.
[30] L. Jing, M.K. Ng, and J.Z. Huang. An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. Knowledge and Data Engineering, IEEE Transactions on, 19(8):1026-1041, 2007.31
[31] Thorsten Joachims, Fachbereich Informatik, Fachbereich Informatik, Fachbereich Informatik, Fachbereich Informatik, and Lehrstuhl Viii. Text categorization with support vector machines: Learning with many relevant features, 1997.
[32] Y.S. Kim, W.N. Street, and F. Menczer. Evolutionary model selection in unsupervised learning. Intelligent Data Analysis, 6(6):531-556, 2002.
[33] Ron Kohavi and George H. John. Wrappers for feature subset selection, 1996.
[34] Y. Li, S.M. Chung, and J.D. Holt. Text document clustering based on frequent word meaning sequences. Data & Knowledge Engineering, 64(1):381-404, 2008.
[35] Y. Li, M. Dong, and J. Hua. Localized feature selection for clustering. Pattern Recognition Letters, 29(1):10-18, 2008.
[36] Y. Li, C. Luo, and S.M. Chung. Text clustering with feature selection by using statistical data. Knowledge and Data Engineering, IEEE Transactions on, 20(5):641-652,2008.
[37] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Boston: Kluwer Academic Publishers, 1998.
[38] H. Liu and H. Motoda, editors. Computational Methods of Feature Selection. Chapman and Hall/CRC Press, 2007.
[39] Huan Liu and Rudy Setiono. A probabilistic approach to feature selection - a filter solution. pages 319-327. Morgan Kaufmann.
[40] Huan Liu and Lei Yu. Toward integrating feature selection algorithms for classi_cation and clustering. Knowledge and Data Engineering, IEEE Transactions on, 17(4):491 -502, April 2005.
[41] B. Long, Z.M. Zhang, X. Wu, and P.S. Yu. Spectral clustering for multi-type relational data. In Proceedings of the 23rd international conference on Machine learning, pages 585-592. ACM, 2006.
[42] B. Long, Z.M. Zhang, and P.S. Yu. A probabilistic framework for relational clustering. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 470-479. ACM, 2007.
[43] H.P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of research and development, 1(4):309-317, 1957.
[44] Ulrike Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395-416, 2007.
[45] S.A. Macskassy and F. Provost. Classi_cation in networked data: A toolkit and a univariate case study. The Journal of Machine Learning Research, 8:935-983, 2007.
[46] Pabitra Mitra, Student Member, C. A. Murthy, and Sankar K. Pal. Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 24:301-312, 2002.
[47] D.S. Modha and W.S. Spangler. Feature weighting in k-means clustering. Machine learning, 52(3):217-237, 2003.32
[48] K. Morik, A. Kaspari, M. Wurst, and M. Skirzynski. Multi-objective frequent termset clustering. Knowledge and Information Systems, pages 1-24, 2012.
[49] MSK Mugunthadevi, M. Punitha, and M. Punithavalli. Survey on feature selection in document clustering. International Journal, 3, 2011.
[50] Mark E. J. Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):26113, 2004.
[51] Andrew Y. Ng. On feature selection: Learning with exponentially many irrelevant features as training examples. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 404-412. Morgan Kaufmann, 1998.
[52] Kamal Nigam, Andrew Kachites Mccallum, Sebastian Thrun, and Tom Mitchell. Text classi_cation from labeled and unlabeled documents using em. In Machine Learning,pages 103-134, 1999.
[53] I.S. Oh, J.S. Lee, and B.R. Moon. Hybrid genetic algorithms for feature selection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(11):1424-1437,2004.
[54] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [55] V. Roth and T. Lange. Feature selection in clustering problems. Advances in neural information processing systems, 16, 2003.
[56] Yong Rui and Thomas S. Huang. Image retrieval: Current techniques, promising directions and open issues. Journal of Visual Communication and Image Representation, 10:39-62, 1999.
[57] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Collective classi_cation in network data. AI magazine, 29(3):93, 2008.
[58] Wojciech Siedlecki and Jack Sklansky. On automatic feature selection. pages 63-87, 1993.
[59] M. R. Sikonja and I. Kononenko. Theoretical and empirical analysis of Relief and ReliefF. Machine Learning, 53:23-69, 2003.
[60] L. Song, A. Smola, A. Gretton, K. Borgwardt, and J. Bedo. Supervised feature selection via dependence estimation. In International Conference on Machine Learning, 2007.
[61] C. Su, Q. Chen, X. Wang, and X. Meng. Text clustering approach based on maximal frequent term sets. In Systems, Man and Cybernetics, 2009. SMC 2009. IEEE International Conference on, pages 1551-1556. IEEE, 2009.
[62] L. Talavera. Feature selection as a preprocessing step for hierarchical clustering. In MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pages 389-397. MORGAN KAUFMANN PUBLISHERS, INC., 1999.
[63] Jiliang Tang and Huan Liu. Feature selection with linked data in social media. In SDM, 2012.
[64] Jiliang Tang and Huan Liu. Unsupervised feature selection for linked social media data. In KDD, 2012.33
[65] L. Tang and H. Liu. Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 817-826. ACM, 2009.
[66] B. Taskar, P. Abbeel, M.F. Wong, and D. Koller. Label and link prediction in relational data. In Proceedings of the IJCAI Workshop on Learning Statistical Models from Relational Data. Citeseer, 2003.
[67] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411-423, 2001.
[68] C.Y. Tsai and C.C. Chiu. Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Computational statistics & data analysis, 52(10):4658-4672, 2008.
[69] J. Weston, A. Elisse, B. Schoelkopf, and M. Tipping. Use of the zero norm with linear odels and kernel methods. Journal of Machine Learning Research, 3:1439-1461, 2003.
[70] Dietrich Wettschereck, David W. Aha, and Takao Mohri. A review and empirical valuation of feature weighting methods for a class of lazy learning algorithms. Arti_cial ntelligence Review, 11:273-314, 1997.
[71] D.M. Witten and R. Tibshirani. A framework for feature selection in clustering. Journal f the American Statistical Association, 105(490):713-726, 2010.
[72] I.H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann Pub, 2005.
[73] Zenglin Xu, Rong Jin, Jieping Ye, Michael R. Lyu, and Irwin King. Discriminative semi-upervised feature selection via manifold regularization. In IJCAI' 09: Proceedings of he 21th International Joint Conference on Arti_cial Intelligence, 2009.
[74] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text ategorization. pages 412-420. Morgan Kaufmann Publishers, 1997.
[75] L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In T. Fawcett and N. Mishra, editors, Proceedings of the 20th Inter-
national Conference on Machine Learning (ICML-03),, pages 856-863, Washington,D.C., August 21-24, 2003 2003. Morgan Kaufmann.
[76] L. Yu and H. Liu. E_cient feature selection via analysis of relevance and redundancy.Journal of Machine Learning Research (JMLR), 5(Oct):1205-1224, 2004.
[77] W. Zhang, T. Yoshida, X. Tang, and Q. Wang. Text clustering using frequent itemsets. Knowledge-Based Systems, 23(5):379-388, 2010.
[78] Z. Zhao and H. Liu. Spectral Feature Selection for Data Mining. Chapman & Hall/Crc Data Mining and Knowledge Discovery. Taylor & Francis, 2011.
[79] Zheng Zhao and Huan Liu. Semi-supervised feature selection via spectral analysis. In Proceedings of SIAM International Conference on Data Mining (SDM), 2007.
[80] Zheng Zhao and Huan Liu. Spectral feature selection for supervised and unsupervised learning. In ICML '07: Proceedings of the 24th international conference on Machine

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末澜掩，一起剝皮案震驚了整個(gè)濱河市净蚤，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌输硝，老刑警劉巖，帶你破解...
沈念sama閱讀 221,548評(píng)論 6贊 515
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件程梦，死亡現(xiàn)場(chǎng)離奇詭異点把，居然都是意外死亡，警方通過(guò)查閱死者的電腦和手機(jī)屿附，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 94,497評(píng)論 3贊 399
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門郎逃，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)，“玉大人挺份，你說(shuō)我怎么就攤上這事褒翰。” “怎么了匀泊？”我有些...
開封第一講書人閱讀 167,990評(píng)論 0贊 360
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵优训，是天一觀的道長(zhǎng)。經(jīng)常有香客問(wèn)我各聘，道長(zhǎng)揣非，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 59,618評(píng)論 1贊 296
?港島之戀（遺憾婚禮）
正文為了忘掉前任躲因，我火速辦了婚禮早敬，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘大脉。我一直安慰自己搞监，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 68,618評(píng)論 6贊 397
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開白布镰矿。她就那樣靜靜地躺著琐驴，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上棍矛，一...
開封第一講書人閱讀 52,246評(píng)論 1贊 308
城市分裂傳說(shuō)
那天安疗，我揣著相機(jī)與錄音，去河邊找鬼够委。笑死荐类，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的茁帽。我是一名探鬼主播玉罐，決...
沈念sama閱讀 40,819評(píng)論 3贊 421
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼潘拨！你這毒婦竟也來(lái)了吊输？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 39,725評(píng)論 0贊 276
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤铁追，失蹤者是張志新（化名）和其女友劉穎季蚂，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體琅束，經(jīng)...
沈念sama閱讀 46,268評(píng)論 1贊 320
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡扭屁，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 38,356評(píng)論 3贊 340
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了涩禀。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片料滥。...
茶點(diǎn)故事閱讀 40,488評(píng)論 1贊 352
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖艾船，靈堂內(nèi)的尸體忽然破棺而出葵腹，到底是詐尸還是另有隱情，我是刑警寧澤屿岂，帶...
沈念sama閱讀 36,181評(píng)論 5贊 350
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布践宴，位于F島的核電站，受9級(jí)特大地震影響爷怀，放射性物質(zhì)發(fā)生泄漏浴井。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,862評(píng)論 3贊 333
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一霉撵、第九天我趴在偏房一處隱蔽的房頂上張望磺浙。院中可真熱鬧，春花似錦徒坡、人聲如沸撕氧。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,331評(píng)論 0贊 24
一樁弒父案喇完，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)伦泥。三九已至，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間不脯，已是汗流浹背府怯。一陣腳步聲響...
開封第一講書人閱讀 33,445評(píng)論 1贊 272
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留防楷，地道東北人牺丙。一個(gè)月前我還...
沈念sama閱讀 48,897評(píng)論 3贊 376
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像复局，于是被迫代替她去往敵國(guó)和親冲簿。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,500評(píng)論 2贊 359