數(shù)據(jù)挖掘ch1

What is Big Data?
“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” — Gartner

“Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” — Mckinsey & Company

Paste_Image.png

Data mining
People have been analysing and investigating data for centuries.

Statistics
Mean, Variance, Correlation, Distribution …

In modern days, data are often far beyond human comprehension.
Diversity, Volume, Dimensionality

Definition
Data Mining is the process of automatically extracting interesting and useful hidden patterns from usually massive, incomplete and noisy data.

Not a fully automatic process
Human interventions are often inevitable.
Domain Knowledge
Data Collection and Pre-processing

Synonym: Knowledge Discovery

Paste_Image.png

Data Integration & Analysis

Paste_Image.png

Process of Data Mining

Paste_Image.png

DM Techniques - Classification
“Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics (referred to as variables) and based on a training set of previously labeled items.”

Given a training set: {(x1, y1), …, (xn, yn)}, produce a classifier (function) that maps any unknown object xi to its class label yi.

Algorithms
Decision Trees
K-Nearest Neighbours
Neural Networks
Support Vector Machines

Applications
Churn Prediction
Medical Diagnosis
Classification Boundaries

Paste_Image.png

Overfitting – Classification

Paste_Image.png

Confusion Matrix

Paste_Image.png

TPR=TP/(TP+FN)

TNR=TN/(TN+FP)

Accuracy=(TP+TN)/(P+N)

Receiver Operating Characteristic


Paste_Image.png

DM Techniques - Clustering
“Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.”

Distance Metrics
Euclidean Distance
Manhattan Distance
Mahalanobis Distance

Algorithms
K-Means
Sequential Leader
Affinity Propagation

Applications
Market Research
Image Segmentation
Social Network Analysis

Paste_Image.png

Hierarchical Clustering

Paste_Image.png

DM Techniques – Association Rule

Paste_Image.png
Paste_Image.png

DM Techniques – Regression

Paste_Image.png
Paste_Image.png
Paste_Image.png

Overfitting – Regression

Paste_Image.png

Data Preprocessing
Real data are often surprisingly dirty.
A Major Challenge for Data Mining

Typical Issues
Missing Attribute Values
Different Coding/Naming Schemes
Infeasible Values
Inconsistent Data
Outliers

Data Quality
Accuracy
Completeness
Consistency
Interpretability
Credibility
Timeliness

Paste_Image.png

Data Cleaning
Fill in missing values.
Correct inconsistent data.
Identify outliers and noisy data.

Data Integration
Combine data from different sources.

Data Transformation
Normalization
Aggregation
Type Conversion

Data Reduction
Feature Selection
Sampling

Privacy Protection
Data: A Double-Edged Sword
People can benefit greatly from data analysis.
The consequence of information leakage can be catastrophic.

People may be reluctant to give sensitive information due to privacy concerns.
Drug, Tax, Sexuality …

How to find out the percentage of people with a certain attribute?
The interviewer should not know the true answer of each respondent.

Randomized Response
Used in structured survey research.
Can maintain the confidentiality of respondents.
Two questions are presented:
Q1: I have the attribute A.
Q2: I do not have the attribute A.

The respondent uses a random device to:
Answer Q1 with probability p.
Answer Q2 with probability 1-p.
The interviewer has no idea about which question is answered.

Paste_Image.png

Cloud Computing

Paste_Image.png
Paste_Image.png

Why bother so many different algorithms?

No algorithm is always superior to others.

No parameter setting is optimal over all problems.

Look for the best match between problem and algorithm.
Experience
Trial and Error

Factors to consider:
Applicability
Computational Complexity
Interpretability

Always start with simple ones.

Grouping

Paste_Image.png
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末沉桌,一起剝皮案震驚了整個濱河市谢鹊,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌留凭,老刑警劉巖佃扼,帶你破解...
    沈念sama閱讀 222,729評論 6 517
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異蔼夜,居然都是意外死亡兼耀,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 95,226評論 3 399
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來瘤运,“玉大人窍霞,你說我怎么就攤上這事≌兀” “怎么了但金?”我有些...
    開封第一講書人閱讀 169,461評論 0 362
  • 文/不壞的土叔 我叫張陵,是天一觀的道長郁季。 經(jīng)常有香客問我冷溃,道長,這世上最難降的妖魔是什么梦裂? 我笑而不...
    開封第一講書人閱讀 60,135評論 1 300
  • 正文 為了忘掉前任似枕,我火速辦了婚禮,結(jié)果婚禮上塞琼,老公的妹妹穿的比我還像新娘菠净。我一直安慰自己,他們只是感情好彪杉,可當(dāng)我...
    茶點故事閱讀 69,130評論 6 398
  • 文/花漫 我一把揭開白布毅往。 她就那樣靜靜地躺著,像睡著了一般派近。 火紅的嫁衣襯著肌膚如雪攀唯。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 52,736評論 1 312
  • 那天渴丸,我揣著相機(jī)與錄音侯嘀,去河邊找鬼。 笑死谱轨,一個胖子當(dāng)著我的面吹牛戒幔,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播土童,決...
    沈念sama閱讀 41,179評論 3 422
  • 文/蒼蘭香墨 我猛地睜開眼诗茎,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了献汗?” 一聲冷哼從身側(cè)響起敢订,我...
    開封第一講書人閱讀 40,124評論 0 277
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎罢吃,沒想到半個月后楚午,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,657評論 1 320
  • 正文 獨居荒郊野嶺守林人離奇死亡尿招,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 38,723評論 3 342
  • 正文 我和宋清朗相戀三年矾柜,在試婚紗的時候發(fā)現(xiàn)自己被綠了阱驾。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 40,872評論 1 353
  • 序言:一個原本活蹦亂跳的男人離奇死亡怪蔑,死狀恐怖啊易,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情饮睬,我是刑警寧澤,帶...
    沈念sama閱讀 36,533評論 5 351
  • 正文 年R本政府宣布篮奄,位于F島的核電站捆愁,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏窟却。R本人自食惡果不足惜昼丑,卻給世界環(huán)境...
    茶點故事閱讀 42,213評論 3 336
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望夸赫。 院中可真熱鬧菩帝,春花似錦、人聲如沸茬腿。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,700評論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽切平。三九已至握础,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間悴品,已是汗流浹背禀综。 一陣腳步聲響...
    開封第一講書人閱讀 33,819評論 1 274
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留苔严,地道東北人定枷。 一個月前我還...
    沈念sama閱讀 49,304評論 3 379
  • 正文 我出身青樓,卻偏偏與公主長得像届氢,于是被迫代替她去往敵國和親欠窒。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 45,876評論 2 361

推薦閱讀更多精彩內(nèi)容

  • 這篇只是筆記而已,用于記錄python編程中那些比較好的做法. python中經(jīng)常使用的序列化模塊是pickle,...
    Yihulee閱讀 89評論 0 0
  • 《頭上長出櫻桃樹》 金子最喜歡吃櫻桃悼沈,每到春天櫻桃上市贱迟,媽媽總會給她買很多櫻桃。媽媽還告訴金子絮供,櫻桃籽不能吞進(jìn)...
    春遲秋暮閱讀 1,311評論 8 6
  • 時間過得真快衣吠,轉(zhuǎn)眼間2016年就要結(jié)束,回想著走過的路壤靶,感觸頗多缚俏。 記得年初的時候,給自己制定了年度目標(biāo),而如今2...
    陳慕讀歷史閱讀 381評論 0 0