一绷柒、為什么要挖掘數(shù)據(jù)
????1.數(shù)據(jù)豐富但信息貧乏
????2.電腦便宜且強大
????3.大量數(shù)據(jù)超出人的理解范圍
????4.數(shù)據(jù)收集與儲存的速度快(science)
????5.傳統(tǒng)工具對原始數(shù)據(jù)不可行(science)
????6.數(shù)據(jù)倉庫有大量數(shù)據(jù)(industy)
????7.企業(yè)競爭太大(industy)
注1:Data mining turns a large collection of data into knowledge
二、什么是數(shù)據(jù)挖掘(DM)
????Data Mining: process of semi-automatically analyzing large databases to find patterns that are:
? ??????Valid: hold on new data with some certainty
? ??????Novel: non-obvious to the system
? ??????Useful: should be possible to act on the item
? ??????Understandable: humans should be able to interpret the pattern
注1:Data Mining is the progress of discovering interesting patterns from massive amounts of data(數(shù)據(jù)挖掘是從大量數(shù)據(jù)中發(fā)現(xiàn)有趣的模式的過程)
三涮因、:什么樣的數(shù)據(jù)可以被挖掘What Kinds of Data Can be Mined?
????1. Database Data ??
????2. Data Warehouses ??
????3. Transaction Data(事物數(shù)據(jù))
????4. Spatial-Temporal Data(時空數(shù)據(jù))
????5. Graph and networked data ??
????6. Hypertext and multimedia data:Text, image, video, and audio data ?
????7. Time-related sequence data:Historical records ?Stock exchange ?eg.
????8. Data Stream:Video surveillance
四辉巡、什么樣的模式可以被挖掘What Kinds of Patterns Can be Mined?
????1. Class/Concept Description(類/概念描述): Characterization and Discrimination
? ??????Data Characterization(數(shù)據(jù)特征化):
????????????Tools: Statistical measures and plots
????????????Outputs: Pie charts, bar charts, curves, multi-dimensional data cubes, and multi-? ? ? ?dimensional tables, generalized relations.
?
? ??????Data Discrimination(數(shù)據(jù)識別/區(qū)分):
????????????Comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes.(將目標類數(shù)據(jù)對象的一般特性與來自一個或多個對比類的對象的一般特性進行比較)
????????????Outputs: comparative measures that help to distinguish between the target and contrasting classes(有助于區(qū)分目標類和對比類的比較度量)
????2. Mining Frequent Patterns, Associations and Correlations
????????Mining frequent patterns leads to the discovery of interesting associations and ????correlations within data(挖掘頻繁模式可以發(fā)現(xiàn)數(shù)據(jù)中有趣的關(guān)聯(lián)和關(guān)聯(lián))
????3. Classification(分類) and Regression(回歸) for Predictive Analysis
? ??????Classification
????????????Training data: data objects with class labels are known(具有類標簽的數(shù)據(jù)對象是已知的)
????????????Output: Find a model that describes and distinguishes data classes or concepts
? ??????Regression Analysis
????????????Regression models continuous-valued functions
????4. Cluster Analysis(聚類分析)
????????Points (objects) that are “close” in the attribute (feature) space are assigned to the same cluster(屬性(特性)空間中“接近”的點(對象)被分配給同一個集群)
????5. Outlier Analysis(異常分析)
????????In some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones. The analysis of outlier data is referred to as outlier analysis or anomaly mining(在一些應(yīng)用程序中,例如欺詐檢測蕊退,罕見事件可能比經(jīng)常發(fā)生的事件更有趣。異常數(shù)據(jù)的分析稱為異常分析或異常挖掘)
????????Outliners may be detected using statistical tests, or using distance measures(可以使用統(tǒng)計測試或使用距離度量來檢測外線程)
注1:A big data-mining risk is that you will “discover” patterns that are meaningless(一個大的數(shù)據(jù)挖掘風(fēng)險是您將“發(fā)現(xiàn)”毫無意義的模式)
注2:A pattern is interesting if it is valid on test data with some degree of certainty, novel, potentially useful(如果模式在測試數(shù)據(jù)上是有效的憔恳,并且具有一定的確定性瓤荔、新奇性和潛在的有用性,那么模式就是有趣的)
五钥组、使用了哪些技術(shù)?Which technologies are used?
????????machine learning, statistics, artificial intelligence, databases, visualization(可視化) but more stress on:
????????Scalability?of number of features and instances 特性和實例數(shù)量的可擴展性
????????Stress on algorithms and architectures whereas foundations of methods and formulations provided?by statistics and machine learning 強調(diào)算法和架構(gòu)输硝,而統(tǒng)計和機器學(xué)習(xí)提供的方法和公式的基礎(chǔ)
????????Automation?for handling large, heterogeneous data 用于處理大型異構(gòu)數(shù)據(jù)的自動化
注1:Data mining, as a highly application-driven domain, has incorporated knowledge from many other domains(數(shù)據(jù)挖掘作為一個高度應(yīng)用驅(qū)動的領(lǐng)域,吸收了許多其他領(lǐng)域的知識)
六:什么樣的應(yīng)用是目標?What kinds of applications are targeted?
????1.Web Mining:
????????Decide the importance of pages: PageRank algorithm
????2.Market and Sales
????3.Medicine:
????????Disease outcome, effectiveness of treatments
????4.Molecular/Pharmaceutical:
????????Identify new drugs
????5.Scientific data analysis:
????????Identify new galaxies by searching for sub clusters
注1:Data mining has many successful applications, such as business intelligence, Web search, bioinformatics, health informatics, finance, digital libraries, and digital governments(數(shù)據(jù)挖掘有許多成功的應(yīng)用程梦,如商業(yè)智能点把、網(wǎng)絡(luò)搜索橘荠、生物信息學(xué)、衛(wèi)生信息學(xué)郎逃、金融哥童、數(shù)字圖書館和數(shù)字政府)