data mining 知識(shí)大綱

說(shuō)明:該知識(shí)大綱是根據(jù)電子科技大學(xué)計(jì)算機(jī)學(xué)院研究生學(xué)位課《Data Mining》的授課內(nèi)容整理而成毁涉。該課程由邵俊明老師進(jìn)行講授,且是英文授課捞蚂。

Chapter 1

Definition of data mining

Data mining consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data

image.png

Key factors:

  • Data storage

  • Data availability

  • Computation power

Application:

  • Target marketing: Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.

  • Cross-market analysis: Find associations/co-relations between product sales, & predict based on such association.

  • Resource planning: summarize and compare the resources and spending

  • Fraud detection

Task:

  • Association rule mining

  • Cluster analysis

  • Classification

  • Outlier detection

Direction:

  • Volume (Scale of Data)

  • Velocity (Data Stream)

  • Variety (Different format of data, difference sources)

  • Veracity (Uncertainty, missing value)

Chapter 2

Nearest Neighbor

image.png

Predict class label of test instance with major vote strategy

SVM Kernel

image.png

Ensemble learning

  1. bagging: random forest

  2. boosting: adaboost

  3. stacking

image.png
image.png

Chapter 3

Why do we need Hashing?

Challenge in big data applications:

  • Curse of dimensionality

  • Storage cost

  • Query speed

Examples:

  • Information retrieval

  • Storage cost

  • Fast nearest neighbor search

Three steps for similar documents:

  • shingling

  • Min hashing

  • Locality-sensitive hashing

image.png

Min-hashing

  1. Compute signatures of columns = small summaries of columns.

  2. Examine pairs of signatures to find similar signatures.

  3. (Optional) check that columns with similar signatures are really similar.

image.png

Use several (e.g., 100) independent hash functions to create a signature.

Locality-sensitive hashing

  • General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated.

  • For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.

LSH for min-hash signatures

Matrix M is the matrix of signatures.

image.png

For each band, hash its portion of each column to a hash table with k buckets.

image.png

Tradeoff

Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives.

Learn to hash

  • PCA hashing: The basic idea is rotating the data to minimize quantization loss.

  • Spectral hashing

image.png

Chapter 4

Definition of sampling

Giving a p(x), we want to draw some samples to represent p(x).

Inverse transform sampling

image.png

Drawbacks: Usually, it’s hard to get the inverse function

Rejection sampling

image.png

Adaptive reject sampling: only if p(x) is log-concave

Importance sampling

image.png
image.png

Markov chain Monte Carlo(MCMC)

image.png

Detailed balance condition: π(i)Pij = π(j)Pij

Acceptance ratio

image.png

Drawbacks: acceptance ratio is too small

Metropolis–Hastings (MH) Sampling

Based on MCMC rewriting the acceptance ratio

image.png

But acceptance ratio still isn’t 100%

Gibbs sampling (based on MCMC)

Idea: Gibbs sampling further make acceptance ratio being 100%

image.png

other features of Gibbs:

  • Do not need p(x)
image.png

Sampling on data stream

  • Bernoulli Sampling

  • Reservoir Sampling: not need to know stream size;
    image.png

Chapter 5

What is data stream?

A data stream is a massive sequence of data objects which have some unique features: One by One; Potentially Unbounded; Concept Drift

image.png

Challenges:

  • Single Pass Handling

  • Memory Limitation

  • Low Time Complexity

  • Concept Drift

What is concept drift?

The probability distribution changes.

image.png
image.png

Concept drift detection

  • Distribution-based detector

  • Error-rate based detector

image.png
image.png

Data stream classification

image.png

Typical algorithm

  • VFDT (very fast decision tree): A decision-tree learning system based on the Hoeffding tree algorithm

  • CVFDT (Concept-adapting Very Fast Decision Tree learner)

VFDT

image.png
image.png
image.png
image.png
image.png

CVFDT

image.png
image.png
image.png
image.png

Data stream clustering

  • Online phase: Summarize the data into memory-efficient data structures

  • Offline phase: Use a clustering algorithm to find the data partition (k-means, decision tree)

Framework

image.png
image.png
image.png
image.png

Chapter 6

Key node identification

image.png
image.png

PageRank

image.png
image.png

Community detection (graph clustering)

  • Minimum cut: find a graph partition such that the number of edges between the two sets is minimized.
    image.png

    But minimum cut always return an imbalanced partition.

  • Normalized cut & ratio cut
    image.png

    凄鼻,prefer a balanced partition.

  • modularity
    image.png
  • Random walk

  • Multi-level clustering

  • Dynamic community detection: a new viewpoint for community detection, the basic idea is Simulate the change of edge distances.

    • View network as dynamical system (Dynamic vs. Static)

    • Simulate the distance dynamics based on different interaction patterns (Distance dynamics vs. Node dynamics)

    • All edge distances will converge, and the community structure is intuitively identified.

Chapter 7

What is hadoop?

  • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

  • Hadoop is open-source implementation for Google MapReduce

  • Hadoop is based on a simple programming model called MapReduce

  • Hadoop is based on a simple data model, any data will fit

image.png
image.png
image.png

Core
Filesystems and I/O:

  • Abstraction APIs
  • RPC / Persistence

Avro
Cross-language serialization:

  • RPC / persistence
  • ~ Google ProtoBuf / FB Thrift

MapReduce
Distributed execution (batch)

  • Programming model
  • Scalability / fault-tolerance

HDFS
Distributed storage (read-opt.)

  • Replication / scalability
  • ~ Google filesystem (GFS)

Zoo keeper
Coordination service

  • Locking / configuration
  • ~ Google Chubby

HBase
Column-oriented, sparse store

  • Batch & random access
  • ~ Google BigTable

Pig
Data flow language

  • Procedural SQL-inspired lang.
  • Execution environment

Hive
Distributed data warehouse

  • SQL-like query language
  • Data mgmt / query execution

image.png
image.png
image.png
image.png
image.png

Limitation of MapReduce

  • Inefficient for multi-pass algorithm

  • No efficient primitives for data sharing

    • State between steps goes to distributed file system

    • Slow due to replication & disk storage

Spark

Apache Spark is a fast and general-purpose cluster computing system. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for streaming processing.

image.png
image.png
image.png
image.png
image.png
image.png
image.png

Row-key is unique for a row

image.png
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市还栓,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌传轰,老刑警劉巖剩盒,帶你破解...
    沈念sama閱讀 206,126評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異慨蛙,居然都是意外死亡辽聊,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,254評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門(mén)期贫,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)跟匆,“玉大人,你說(shuō)我怎么就攤上這事通砍÷瓯郏” “怎么了?”我有些...
    開(kāi)封第一講書(shū)人閱讀 152,445評(píng)論 0 341
  • 文/不壞的土叔 我叫張陵封孙,是天一觀的道長(zhǎng)迹冤。 經(jīng)常有香客問(wèn)我,道長(zhǎng)虎忌,這世上最難降的妖魔是什么泡徙? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 55,185評(píng)論 1 278
  • 正文 為了忘掉前任,我火速辦了婚禮呐籽,結(jié)果婚禮上锋勺,老公的妹妹穿的比我還像新娘。我一直安慰自己狡蝶,他們只是感情好庶橱,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,178評(píng)論 5 371
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著贪惹,像睡著了一般苏章。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書(shū)人閱讀 48,970評(píng)論 1 284
  • 那天枫绅,我揣著相機(jī)與錄音泉孩,去河邊找鬼。 笑死并淋,一個(gè)胖子當(dāng)著我的面吹牛寓搬,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播县耽,決...
    沈念sama閱讀 38,276評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼句喷,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了兔毙?” 一聲冷哼從身側(cè)響起唾琼,我...
    開(kāi)封第一講書(shū)人閱讀 36,927評(píng)論 0 259
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎澎剥,沒(méi)想到半個(gè)月后锡溯,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 43,400評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡哑姚,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,883評(píng)論 2 323
  • 正文 我和宋清朗相戀三年祭饭,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片蜻懦。...
    茶點(diǎn)故事閱讀 37,997評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡甜癞,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出宛乃,到底是詐尸還是另有隱情,我是刑警寧澤蒸辆,帶...
    沈念sama閱讀 33,646評(píng)論 4 322
  • 正文 年R本政府宣布征炼,位于F島的核電站,受9級(jí)特大地震影響躬贡,放射性物質(zhì)發(fā)生泄漏谆奥。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,213評(píng)論 3 307
  • 文/蒙蒙 一拂玻、第九天 我趴在偏房一處隱蔽的房頂上張望酸些。 院中可真熱鬧,春花似錦檐蚜、人聲如沸魄懂。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 30,204評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)市栗。三九已至,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間填帽,已是汗流浹背蛛淋。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 31,423評(píng)論 1 260
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留篡腌,地道東北人褐荷。 一個(gè)月前我還...
    沈念sama閱讀 45,423評(píng)論 2 352
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像嘹悼,于是被迫代替她去往敵國(guó)和親诚卸。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,722評(píng)論 2 345

推薦閱讀更多精彩內(nèi)容