主成分分析 (PCA, principal component analysis)是一種數(shù)學(xué)降維方法。
PCA降維過程;
1)數(shù)據(jù)標(biāo)準(zhǔn)化
2)求協(xié)方差矩陣
3)特征向量排序
4)投影矩陣
5)數(shù)據(jù)轉(zhuǎn)換
將樣本數(shù)據(jù)求一個維度的協(xié)方差矩陣嘀韧,然后求解這個協(xié)方差矩陣的特征值和對應(yīng)的特征向量,將這些特征向量按照對應(yīng)的特征值從大到小排列遣总,組成新的矩陣,被稱為特征向量矩陣轨功,也可以稱為投影矩陣旭斥,然后用改投影矩陣將樣本數(shù)據(jù)轉(zhuǎn)換。取前K維數(shù)據(jù)即可古涧,實現(xiàn)對數(shù)據(jù)的降維垂券。
案例1
創(chuàng)建數(shù)據(jù)集
- 用R模擬芯片數(shù)據(jù)矩陣,矩陣為10000行(10000個基因)羡滑,100列(100個樣本)菇爪,生成均值為0的正態(tài)分布的隨機(jī)數(shù)據(jù)算芯。
chip.data<-matrix(rnorm(10000*100,mean=0),nrow=10000,ncol=100)
顯示結(jié)果:
2,在10000個基因中凳宙,假定有100個基因在兩組間存在差異熙揍,前50個上調(diào),另50個下調(diào)氏涩;
1)創(chuàng)建1000個1~1000的隨機(jī)數(shù),作為索引
2)創(chuàng)建50*10的正態(tài)分布矩陣届囚,均值為2,通過sha上一步的隨機(jī)數(shù)讀取1:50的數(shù)字作為行號是尖,前10列意系,賦值給chip.data,作為上調(diào)數(shù)據(jù)集析砸。
3)相同方法得到50個下調(diào)的數(shù)據(jù)集
diff.index<-sample(1:1000,1000)
chip.data[diff.index[1:50],1:10]<-rnorm(50*10,mean=2)
chip.data[diff.index[1:50],1:10]<-rnorm(50*10,mean=-2)
- PCA作圖
princomp函數(shù)使用方法
Description
princomp performs a principal components analysis on the given numeric data matrix and returns the results as an object of class princomp.
## Default S3 method:
princomp(x, cor = FALSE, scores = TRUE, covmat = NULL,
subset = rep_len(TRUE, nrow(as.matrix(x))), ...)
PCA統(tǒng)計
chip.data<-princomp(chip.data)
顯示chip.data的數(shù)據(jù)
> chip.data
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -8.764830e-01 -2.585436e+00 1.7486665932 0.6825088090 0.8905718598 2.2543743674
[2,] 2.756559e+00 9.191507e-01 1.7224333465 2.5164729313 0.3655551313 0.3940460436
[3,] 9.754316e-01 -9.121371e-01 -0.0534088859 0.4711108467 -0.6567994543 -0.9404594391
[4,] -1.443449e+00 6.328793e-01 0.7067575122 -2.0083705142 -0.0641474431 0.5404051953
[5,] -1.678596e+00 -4.086325e-01 -0.6946972480 0.9941794052 1.9677986393 0.4281278343
[6,] 2.318705e+00 2.574536e+00 2.4483722951 3.7352614791 0.6849518201 2.5269332706
[7,] 1.368299e+00 -6.396757e-01 -0.3016863422 -0.9881343210 0.7250075490 -1.1474935276
[8,] 4.547110e-01 -1.388434e+00 0.5724884590 1.3446862438 0.2708813623 0.0768302649
[9,] -3.320154e-01 1.015236e+00 0.0524039788 0.8327729956 1.5803932962 -1.1469311968
[10,] 1.442150e+00 -1.005228e+00 0.9377764607 1.5061633084 -0.7742683227 -1.9687078752
顯示統(tǒng)計結(jié)果
> summary(chip.data)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
Standard deviation 3.240085 3.2099856 3.1956557 3.1691590 3.1505363 3.13960683 3.11757677 3.10222437 3.07273039 3.05572866
Proportion of Variance 0.105799 0.1038424 0.1029174 0.1012178 0.1000317 0.09933886 0.09794967 0.09698734 0.09515192 0.09410186
Cumulative Proportion 0.105799 0.2096414 0.3125588 0.4137765 0.5138082 0.61314710 0.71109677 0.80808411 0.90323603 0.99733790
Standard deviation # 標(biāo)準(zhǔn)方差
Proportion of Variance # 貢獻(xiàn)度
Cumulative Proportion # 累計貢獻(xiàn)度
前10個主成分已可以dad達(dá)到解析0.99733790的數(shù)據(jù)
- 畫圖
1)設(shè)置兩組100個差異基因的顏色”郏可以通過更改首繁,“2”“7”的1:10范圍的數(shù)字,更改兩組的顏色
2)plot3d(xlab,ylab,zlab三維數(shù)據(jù)集陨囊,分組顏色弦疮,圖形類型,半徑)
以下為type:s蜘醋,代表圖形為球星
colour<-c(rep(2,50),rep(7,50))
library(rgl)
plot3d(chip.data.pca$loadings[,1:3],col=colour,type="s",radius = 0.025)
顯示結(jié)果3D圖胁塞,可以使用鼠標(biāo)進(jìn)行旋轉(zhuǎn)和方法縮小,直到最清晰角度為止压语。
plot3d(chip.data.pca$loadings[,1:3],col=colour,type="l",radius = 0.025)
顯示線性結(jié)果:
案例2
加載包和數(shù)據(jù)集
rm(list=ls())
library(pca3d)
library(rgl)
data(metabo)
head(metabo)
數(shù)據(jù)集介紹
Metabolic profiles in tuberculosis. # 肺結(jié)核代謝數(shù)據(jù)集
Description
Relative abundances of metabolites from serum samples of three groups of individuals
# 三組血清樣本的相對豐度
Details
A data frame with 136 observations on 425 metabolic variables.
136個觀測值啸罢,425ge個daixie個代謝變量
Serum samples from three groups of individuals were compared: tuberculin skin test negative (NEG), positive (POS) and clinical tuberculosis (TB).
#比較三組患者的血清樣本:結(jié)核菌素皮膚試驗陰性(NEG)、陽性(POS)和臨床結(jié)核(TB)胎食。
PCA計算
prcomp函數(shù)使用方法
Principal Components Analysis
Description
Performs a principal components analysis on the given data matrix and returns the results as an object of class prcomp.
## Default S3 method:
prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE,
tol = NULL, rank. = NULL, ...)
1)去除數(shù)據(jù)集的第一列行名作為數(shù)據(jù)集扰才,標(biāo)準(zhǔn)化數(shù)據(jù)
2)以數(shù)據(jù)集的第一列行名作為分組因子
metabo.pca <- prcomp(metabo[,-1], scale.=TRUE)
groups <- factor(metabo[,1])
統(tǒng)計計算結(jié)果
> summary(metabo.pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14
Standard deviation 5.86992 5.38923 4.74978 4.11434 3.88969 3.81589 3.30208 3.09675 2.9872 2.9157 2.80259 2.71364 2.60341 2.56392
Proportion of Variance 0.08146 0.06866 0.05333 0.04002 0.03577 0.03442 0.02578 0.02267 0.0211 0.0201 0.01857 0.01741 0.01602 0.01554
Cumulative Proportion 0.08146 0.15012 0.20345 0.24347 0.27924 0.31366 0.33944 0.36211 0.3832 0.4033 0.42187 0.43928 0.45530 0.47084
作圖
pca3d使用方法
pca2d {pca3d} R Documentation
Show a three- or two-dimensional plot of a prcomp object
Description
Show a three- two-dimensional plot of a prcomp object or a matrix, using different symbols and colors for groups of data
Usage
pca3d(pca, components = 1:3, col = NULL, title = NULL, new = FALSE,
axes.color = "grey", bg = "white", radius = 1, group = NULL,
shape = NULL, palette = NULL, fancy = FALSE, biplot = FALSE,
biplot.vars = 5, legend = NULL, show.scale = FALSE,
show.labels = FALSE, labels.col = "black", show.axes = TRUE,
show.axe.titles = TRUE, axe.titles = NULL, show.plane = TRUE,
show.shadows = FALSE, show.centroids = FALSE, show.group.labels = FALSE,
show.shapes = TRUE, show.ellipses = FALSE, ellipse.ci = 0.95)
pca3d(數(shù)據(jù)集,分組厕怜,是否顯示置信區(qū)間衩匣,顯示默認(rèn)值是0.95,而橢圓的大小為95粥航。是否實現(xiàn)分隔平面)
pca3d(metabo.pca, group=groups, show.ellipses=TRUE, elle.ci=0.75, show.plane=FALSE)
顯示結(jié)果3D圖琅捏,可以使用鼠標(biāo)進(jìn)行旋轉(zhuǎn)和方法縮小,直到最清晰角度為止递雀。
取消外包圍分隔平面
pca3d(metabo.pca, group=groups, show.ellipses=TRUE, ellipse.ci=0.75, show.plane=FALSE)
顯示結(jié)果: