數(shù)據(jù)挖掘?qū)д?/h1>

一绷柒、為什么要挖掘數(shù)據(jù)

????1.數(shù)據(jù)豐富但信息貧乏

????2.電腦便宜且強大

????3.大量數(shù)據(jù)超出人的理解范圍

????4.數(shù)據(jù)收集與儲存的速度快(science)

????5.傳統(tǒng)工具對原始數(shù)據(jù)不可行(science)

????6.數(shù)據(jù)倉庫有大量數(shù)據(jù)(industy)

????7.企業(yè)競爭太大(industy)

注1:Data mining turns a large collection of data into knowledge


二、什么是數(shù)據(jù)挖掘(DM)

????Data Mining: process of semi-automatically analyzing large databases to find patterns that are:

? ??????Valid: hold on new data with some certainty

? ??????Novel: non-obvious to the system

? ??????Useful: should be possible to act on the item

? ??????Understandable: humans should be able to interpret the pattern


注1:Data Mining is the progress of discovering interesting patterns from massive amounts of data(數(shù)據(jù)挖掘是從大量數(shù)據(jù)中發(fā)現(xiàn)有趣的模式的過程)


三涮因、:什么樣的數(shù)據(jù)可以被挖掘What Kinds of Data Can be Mined?

????1. Database Data ??

????2. Data Warehouses ??

????3. Transaction Data(事物數(shù)據(jù))

????4. Spatial-Temporal Data(時空數(shù)據(jù))

????5. Graph and networked data ??

????6. Hypertext and multimedia data:Text, image, video, and audio data ?

????7. Time-related sequence data:Historical records ?Stock exchange ?eg.

????8. Data Stream:Video surveillance


四辉巡、什么樣的模式可以被挖掘What Kinds of Patterns Can be Mined?

????1. Class/Concept Description(類/概念描述): Characterization and Discrimination

? ??????Data Characterization(數(shù)據(jù)特征化):

????????????Tools: Statistical measures and plots

????????????Outputs: Pie charts, bar charts, curves, multi-dimensional data cubes, and multi-? ? ? ?dimensional tables, generalized relations.

?

? ??????Data Discrimination(數(shù)據(jù)識別/區(qū)分)

????????????Comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes.(將目標類數(shù)據(jù)對象的一般特性與來自一個或多個對比類的對象的一般特性進行比較)

????????????Outputs: comparative measures that help to distinguish between the target and contrasting classes(有助于區(qū)分目標類和對比類的比較度量)


????2. Mining Frequent Patterns, Associations and Correlations

????????Mining frequent patterns leads to the discovery of interesting associations and ????correlations within data(挖掘頻繁模式可以發(fā)現(xiàn)數(shù)據(jù)中有趣的關(guān)聯(lián)和關(guān)聯(lián))

????3. Classification(分類) and Regression(回歸) for Predictive Analysis

? ??????Classification

????????????Training data: data objects with class labels are known(具有類標簽的數(shù)據(jù)對象是已知的)

????????????Output: Find a model that describes and distinguishes data classes or concepts

? ??????Regression Analysis

????????????Regression models continuous-valued functions

????4. Cluster Analysis(聚類分析)

????????Points (objects) that are “close” in the attribute (feature) space are assigned to the same cluster(屬性(特性)空間中“接近”的點(對象)被分配給同一個集群)

????5. Outlier Analysis(異常分析)

????????In some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones. The analysis of outlier data is referred to as outlier analysis or anomaly mining(在一些應(yīng)用程序中,例如欺詐檢測蕊退,罕見事件可能比經(jīng)常發(fā)生的事件更有趣。異常數(shù)據(jù)的分析稱為異常分析或異常挖掘)

????????Outliners may be detected using statistical tests, or using distance measures(可以使用統(tǒng)計測試或使用距離度量來檢測外線程)


注1:A big data-mining risk is that you will “discover” patterns that are meaningless(一個大的數(shù)據(jù)挖掘風(fēng)險是您將“發(fā)現(xiàn)”毫無意義的模式)

注2:A pattern is interesting if it is valid on test data with some degree of certainty, novel, potentially useful(如果模式在測試數(shù)據(jù)上是有效的憔恳,并且具有一定的確定性瓤荔、新奇性和潛在的有用性,那么模式就是有趣的)

五钥组、使用了哪些技術(shù)?Which technologies are used?

????????machine learning, statistics, artificial intelligence, databases, visualization(可視化) but more stress on:

????????Scalability?of number of features and instances 特性和實例數(shù)量的擴展

????????Stress on algorithms and architectures whereas foundations of methods and formulations provided?by statistics and machine learning 強調(diào)算法和架構(gòu)输硝,而統(tǒng)計和機器學(xué)習(xí)提供的方法和公式的基礎(chǔ)

????????Automation?for handling large, heterogeneous data 用于處理大型異構(gòu)數(shù)據(jù)的自動化



注1:Data mining, as a highly application-driven domain, has incorporated knowledge from many other domains(數(shù)據(jù)挖掘作為一個高度應(yīng)用驅(qū)動的領(lǐng)域,吸收了許多其他領(lǐng)域的知識)


六:什么樣的應(yīng)用是目標?What kinds of applications are targeted?

????1.Web Mining:

????????Decide the importance of pages: PageRank algorithm

????2.Market and Sales

????3.Medicine:

????????Disease outcome, effectiveness of treatments

????4.Molecular/Pharmaceutical:

????????Identify new drugs

????5.Scientific data analysis:

????????Identify new galaxies by searching for sub clusters


注1:Data mining has many successful applications, such as business intelligence, Web search, bioinformatics, health informatics, finance, digital libraries, and digital governments(數(shù)據(jù)挖掘有許多成功的應(yīng)用程梦,如商業(yè)智能点把、網(wǎng)絡(luò)搜索橘荠、生物信息學(xué)、衛(wèi)生信息學(xué)郎逃、金融哥童、數(shù)字圖書館和數(shù)字政府)

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

  • 序言:七十年代末,一起剝皮案震驚了整個濱河市褒翰,隨后出現(xiàn)的幾起案子贮懈,更是在濱河造成了極大的恐慌,老刑警劉巖优训,帶你破解...
    沈念sama閱讀 218,525評論 6 507
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件朵你,死亡現(xiàn)場離奇詭異,居然都是意外死亡揣非,警方通過查閱死者的電腦和手機抡医,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,203評論 3 395
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來早敬,“玉大人忌傻,你說我怎么就攤上這事「樯ぃ” “怎么了芯勘?”我有些...
    開封第一講書人閱讀 164,862評論 0 354
  • 文/不壞的土叔 我叫張陵,是天一觀的道長腺逛。 經(jīng)常有香客問我荷愕,道長,這世上最難降的妖魔是什么棍矛? 我笑而不...
    開封第一講書人閱讀 58,728評論 1 294
  • 正文 為了忘掉前任安疗,我火速辦了婚禮,結(jié)果婚禮上够委,老公的妹妹穿的比我還像新娘荐类。我一直安慰自己,他們只是感情好茁帽,可當(dāng)我...
    茶點故事閱讀 67,743評論 6 392
  • 文/花漫 我一把揭開白布玉罐。 她就那樣靜靜地躺著,像睡著了一般潘拨。 火紅的嫁衣襯著肌膚如雪吊输。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,590評論 1 305
  • 那天铁追,我揣著相機與錄音季蚂,去河邊找鬼。 笑死,一個胖子當(dāng)著我的面吹牛扭屁,可吹牛的內(nèi)容都是我干的算谈。 我是一名探鬼主播,決...
    沈念sama閱讀 40,330評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼料滥,長吁一口氣:“原來是場噩夢啊……” “哼然眼!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起幔欧,我...
    開封第一講書人閱讀 39,244評論 0 276
  • 序言:老撾萬榮一對情侶失蹤罪治,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后礁蔗,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體觉义,經(jīng)...
    沈念sama閱讀 45,693評論 1 314
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,885評論 3 336
  • 正文 我和宋清朗相戀三年浴井,在試婚紗的時候發(fā)現(xiàn)自己被綠了晒骇。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 40,001評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡磺浙,死狀恐怖洪囤,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情撕氧,我是刑警寧澤瘤缩,帶...
    沈念sama閱讀 35,723評論 5 346
  • 正文 年R本政府宣布,位于F島的核電站伦泥,受9級特大地震影響剥啤,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜不脯,卻給世界環(huán)境...
    茶點故事閱讀 41,343評論 3 330
  • 文/蒙蒙 一府怯、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧防楷,春花似錦牺丙、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,919評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至亿昏,卻和暖如春峦剔,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背龙优。 一陣腳步聲響...
    開封第一講書人閱讀 33,042評論 1 270
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人彤断。 一個月前我還...
    沈念sama閱讀 48,191評論 3 370
  • 正文 我出身青樓野舶,卻偏偏與公主長得像,于是被迫代替她去往敵國和親宰衙。 傳聞我的和親對象是個殘疾皇子平道,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 44,955評論 2 355