接前面的流程,我們使用FindIntegrationAnchors()函數(shù)識(shí)別錨點(diǎn)颂碘,該函數(shù)將Seurat對(duì)象列表作為輸入督笆,并使用IntegratedData()函數(shù)亮曹,用這些錨點(diǎn)將兩個(gè)數(shù)據(jù)集集成在一起。
ifnb.list
# $CTRL
# An object of class Seurat
# 14053 features across 6548 samples within 1 assay
# Active assay: RNA (14053 features, 2000 variable features)
#
# $STIM
# An object of class Seurat
# 14053 features across 7451 samples within 1 assay
# Active assay: RNA (14053 features, 2000 variable features)
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features)
# this command creates an 'integrated' data assay
immune.combined <- IntegrateData(anchorset = immune.anchors)
先跟著FindIntegrationAnchors()函數(shù)過(guò)一遍代碼,看看如何實(shí)現(xiàn)的鹿蜀。
step1:傳參:seurat對(duì)象和高可變基因傳入object.list和anchor.features參數(shù)夏哭,其余均為默認(rèn)參數(shù)检柬。看下seurat官網(wǎng)對(duì)這個(gè)函數(shù)參數(shù)的解釋竖配。
object.list = ifnb.list
anchor.features = features
assay = NULL
reference = NULL
scale = TRUE
normalization.method = "LogNormalize"
sct.clip.range = NULL
reduction = "cca"
l2.norm = TRUE
dims = 1:30
k.anchor = 5
k.filter = 200
k.score = 30
max.features = 200
nn.method = "annoy"
n.trees = 50
eps = 0
verbose = TRUE
step2:并行運(yùn)算:FindIntegrationAnchors()函數(shù)是可以做并行運(yùn)算的何址,先暫時(shí)忽略此步。詳見(jiàn)https://satijalab.org/seurat/articles/future_vignette.html进胯。
my.lapply <- ifelse(
test = verbose && future::nbrOfWorkers() == 1,
yes = pbapply::pblapply,
no = future.apply::future_lapply
)
step3:數(shù)據(jù)校驗(yàn):查看每個(gè)樣本的細(xì)胞數(shù)目头朱,默認(rèn)是30維(dims = 1:30),每個(gè)樣本的細(xì)胞數(shù)目要大于dims值龄减。另外项钮,確保每個(gè)樣本seurat對(duì)象的Integration插槽為空,樣本的barcode信息不重復(fù)希停。
object.ncells <- sapply(X = object.list, FUN = function(x) dim(x = x)[2])
object.ncells
# CTRL STIM
# 6548 7451
assay <- sapply(X = object.list, FUN = DefaultAssay)
assay
# CTRL STIM
# "RNA" "RNA"
# check tool
object.list <- lapply(
X = object.list,
FUN = function (obj) {
slot(object = obj, name = "tools")$Integration <- NULL
return(obj)
})
object.list <- CheckDuplicateCellNames(object.list = object.list)
step4:基因表達(dá)值縮放:對(duì)每個(gè)樣本seurat對(duì)象的高可變基因進(jìn)行scale縮放烁巫;
slot <- "data"
if (scale) {
if (verbose) {
message("Scaling features for provided objects")
}
object.list <- my.lapply(
X = object.list,
FUN = function(object) {
ScaleData(object = object, features = anchor.features, verbose = FALSE)
}
)
}
# 查看scale縮放后的矩陣
head(object.list[['CTRL']]@assays$RNA@scale.data[anchor.features,1:5])
step5:采用cca降維方法,計(jì)算兩個(gè)樣本有幾種組合的方式combinations宠能,expand.grid()函數(shù)返回幾個(gè)元素所有可能的組合亚隙,2個(gè)樣本有1種組合。確定錨點(diǎn)索引的偏移量违崇,當(dāng)前樣本的偏移量為前面樣本細(xì)胞數(shù)目的累加值阿弃。
比如有兩個(gè)樣本的細(xì)胞數(shù)為6548,7451 羞延,它們的偏移量為前面樣本數(shù)目的累加值 0,6548渣淳。第一個(gè)樣本的偏移量為0。
nn.reduction <- reduction
# determine pairwise combinations
combinations <- expand.grid(1:length(x = object.list), 1:length(x = object.list))
combinations <- combinations[combinations$Var1 < combinations$Var2, , drop = FALSE]
combinations
# Var1 Var2
# 3 1 2
# determine the proper offsets for indexing anchors
objects.ncell <- sapply(X = object.list, FUN = ncol)
objects.ncell
# CTRL STIM
# 6548 7451
offsets <- as.vector(x = cumsum(x = c(0, objects.ncell)))[1:length(x = object.list)]
offsets
# [1] 0 6548
step6:因?yàn)闊o(wú)reference伴箩,尋找成對(duì)的錨點(diǎn)入愧。對(duì)每個(gè)樣本間組合調(diào)用核心的錨點(diǎn)函數(shù)anchoring.fxn。anchoring.fxn函數(shù)較復(fù)雜,又會(huì)調(diào)用RunCCA和FindAnchors等函數(shù)棺蛛。
我們的樣本就一組樣本組合(1-2)怔蚌,通過(guò)這個(gè)實(shí)操逐步分解代碼,進(jìn)入到anchoring.fxn函數(shù)旁赊。
step6-1:object.1和object.2分別為兩個(gè)樣本只包含高可變基因的seurat對(duì)象桦踊,包含RNA插槽的數(shù)據(jù)。重新生成ToIntegrate插槽终畅,包含歸一化和縮放的數(shù)據(jù)钞钙。
object.1 <- DietSeurat(
object = object.list[[i]],
assays = assay[i],
features = anchor.features,
counts = FALSE,
scale.data = TRUE,
dimreducs = reduction
)
object.2 <- DietSeurat(
object = object.list[[j]],
assays = assay[j],
features = anchor.features,
counts = FALSE,
scale.data = TRUE,
dimreducs = reduction
)
head(object.1@assays$RNA@data[1:5,1:5])
head(object.1@assays$RNA@scale.data[1:5,1:5])
# suppress key duplication warning
suppressWarnings(object.1[["ToIntegrate"]] <- object.1[[assay[i]]])
DefaultAssay(object = object.1) <- "ToIntegrate"
object.1 <- DietSeurat(object = object.1, assays = "ToIntegrate", scale.data = TRUE, dimreducs = reduction)
suppressWarnings(object.2[["ToIntegrate"]] <- object.2[[assay[j]]])
DefaultAssay(object = object.2) <- "ToIntegrate"
object.2 <- DietSeurat(object = object.2, assays = "ToIntegrate", scale.data = TRUE, dimreducs = reduction)
head(object.1@assays$ToIntegrate@data[1:5,1:5])
head(object.1@assays$ToIntegrate@scale.data[1:5,1:5])
step6-2:針對(duì)cca,pca声离,lsi三種降維方式芒炼,查找配對(duì)的錨點(diǎn),此步驟調(diào)用RunCCA术徊,將前面生成的object.1本刽,object.2傳入該函數(shù),我們只關(guān)注cca降維赠涮。
object.pair <- switch(
EXPR = reduction,
'cca' = {
object.pair <- RunCCA(
object1 = object.1,
object2 = object.2,
assay1 = "ToIntegrate",
assay2 = "ToIntegrate",
features = anchor.features,
num.cc = max(dims),
renormalize = FALSE,
rescale = FALSE,
verbose = verbose
)
if (l2.norm){
object.pair <- L2Dim(object = object.pair, reduction = reduction)
reduction <- paste0(reduction, ".l2")
nn.reduction <- reduction
}
reduction.2 <- character()
object.pair
},
'pca' = {... },
'lsi' = {... },
stop("Invalid reduction parameter. Please choose either cca, rpca, or rlsi")
)
step6-3:進(jìn)入到RunCCA函數(shù)子寓,由于RunCCA函數(shù)為S3泛型函數(shù),object.1和object.2均為seurat對(duì)象笋除,先進(jìn)入RunCCA.Seurat函數(shù)斜友。
RunCCA.Seurat傳入的參數(shù)
object1 = object.1
object2 = object.2
assay1 = "ToIntegrate"
assay2 = "ToIntegrate"
features = anchor.features
num.cc = max(dims),
renormalize = FALSE
rescale = FALSE
compute.gene.loadings = TRUE
add.cell.id1 = NULL
add.cell.id2 = NULL
rescale = FALSE
verbose = TRUE
RunCCA.Seurat代碼比較好理解,從object.1和object.2提取高可變基因的縮放后的表達(dá)矩陣垃它,生成data1和data2鲜屏。此處有個(gè)細(xì)節(jié),CheckFeatures函數(shù)使高可變基因由2000個(gè)国拇,變成了1866個(gè)洛史。
# ToIntegrate
assay1 <- assay1 %||% DefaultAssay(object = object1)
assay2 <- assay2 %||% DefaultAssay(object = object2)
nfeatures <- length(x = features)
if (!(rescale)) {
data.use1 <- GetAssayData(object = object1, assay = assay1, slot = "scale.data")
data.use2 <- GetAssayData(object = object2, assay = assay2, slot = "scale.data")
features <- CheckFeatures(data.use = data.use1, features = features, object.name = "object1", verbose = FALSE)
length(features)
# [1] 1938
features <- CheckFeatures(data.use = data.use2, features = features, object.name = "object2", verbose = FALSE)
length(features)
# [1] 1866
data1 <- data.use1[features, ]
data2 <- data.use2[features, ]
}
cca.results <- RunCCA(
object1 = data1,
object2 = data2,
standardize = TRUE,
num.cc = num.cc,
verbose = verbose,
)
問(wèn)題1:CheckFeatures函數(shù)做了什么操作,為什么要處理酱吝?
對(duì)于一個(gè)高可變基因縮放后的表達(dá)矩陣(基因數(shù)為2000)也殖,在兩個(gè)樣本,分別按列計(jì)算每個(gè)基因在所有細(xì)胞的方差务热,去除方差為0的基因忆嗜,去除后剩下1866個(gè)。這些基因的scale.data數(shù)值也為零值崎岂。這些在某個(gè)樣本均為零值的基因會(huì)影響下游找錨點(diǎn)的步驟捆毫。
features.var <- RowVar(x = data.use[features, ])
no.var.features <- features[features.var == 0]
features <- setdiff(x = features, y = no.var.features)
features <- features[!is.na(x = features)]
/* Calculates the variance of rows of a matrix */
// [[Rcpp::export(rng = false)]]
NumericVector RowVar(Eigen::Map<Eigen::MatrixXd> x){
NumericVector out(x.rows());
for(int i=0; i < x.rows(); ++i){
Eigen::ArrayXd r = x.row(i).array();
double rowMean = r.mean();
out[i] = (r - rowMean).square().sum() / (x.cols() - 1);
}
return out;
}
進(jìn)入到RunCCA.default函數(shù),查看該函數(shù)该镣,cells1和cells2為每個(gè)樣本的細(xì)胞barcode信息冻璃,standardize步驟是針對(duì)每個(gè)樣本高可變基因的縮放數(shù)據(jù)默認(rèn)進(jìn)行standardize標(biāo)準(zhǔn)化响谓。
問(wèn)題2:standardize標(biāo)準(zhǔn)化具體如何執(zhí)行的损合,為何要做此操作省艳?
standardize的源碼位置詳見(jiàn)鏈接,通過(guò)Rcpp調(diào)用C++代碼嫁审,加速計(jì)算運(yùn)行跋炕。
# 傳入的參數(shù)
object1 = data1
object2 = data2
standardize = TRUE
num.cc = 30
verbose = TRUE
# 主要代碼
set.seed(seed = 42)
cells1 <- colnames(x = object1)
cells2 <- colnames(x = object2)
if (standardize) {
object1 <- Standardize(mat = object1, display_progress = FALSE)
object2 <- Standardize(mat = object2, display_progress = FALSE)
}
standardize的源碼
/* Performs column scaling and/or centering. Equivalent to using scale(mat, TRUE, apply(x,2,sd)) in R.
Note: Doesn't handle NA/NaNs in the same way the R implementation does, */
// [[Rcpp::export(rng = false)]]
NumericMatrix Standardize(Eigen::Map<Eigen::MatrixXd> mat, bool display_progress = true){
Progress p(mat.cols(), display_progress);
NumericMatrix std_mat(mat.rows(), mat.cols());
for(int i=0; i < mat.cols(); ++i){
p.increment();
Eigen::ArrayXd r = mat.col(i).array();
double colMean = r.mean();
double colSdev = sqrt((r - colMean).square().sum() / (mat.rows() - 1));
NumericMatrix::Column new_col = std_mat(_, i);
for(int j=0; j < new_col.size(); j++) {
new_col[j] = (r[j] - colMean) / colSdev;
}
}
return std_mat;
}
怎么理解這個(gè)standardize的函數(shù)呢?Standardize函數(shù)是等價(jià)scale(object, TRUE, apply(object, 2, sd))的律适,對(duì)矩陣的每列(每個(gè)細(xì)胞)的高可變基因求標(biāo)準(zhǔn)差sd辐烂,再scale縮放。
standardize函數(shù)的縮放和ScaleData函數(shù)有何不同捂贿?
ScaleData函數(shù)是使每個(gè)基因在所有細(xì)胞的表達(dá)量均值為 0纠修,使每個(gè)基因在所有細(xì)胞中的表達(dá)量方差為 1,針對(duì)的是單個(gè)基因在所有細(xì)胞的縮放厂僧;
standardize函數(shù)的縮放針對(duì)的是單個(gè)細(xì)胞所有高可變基因集的縮放扣草。表達(dá)量方差此處不是1,是 每列求得的sd颜屠,apply(object, 2, sd)辰妙;
testthat::expect_equal(Standardize(object1, display_progress = FALSE), scale(object1, TRUE, apply(object1, 2, sd)),check.attributes = FALSE)
dim(scale(object1, TRUE, apply(object1, 2, sd)))
# [1] 1866 6548
length(apply(object1, 2, sd))
# [1] 6548
data1[1:5, 1:5]
object1[1:5, 1:5]
重新回到RunCCA.default程序上看后面的運(yùn)行,
1)crossprod是計(jì)算兩個(gè)矩陣object1甫窟,object2的內(nèi)積密浑,生成mat3 ,crossprod(a,b)等價(jià)于t(a)%*%b粗井,但計(jì)算速度更快尔破;
2)調(diào)用irlba包,針對(duì)大型密集和稀疏矩陣的快速截?cái)嗥娈愔捣纸夂椭鞒煞址治龅腞包浇衬。奇異值分解(SVD)的作用呆瞻,就是提取一個(gè)較復(fù)雜矩陣中的關(guān)鍵部分*,然后用一個(gè)簡(jiǎn)單的矩陣代表其關(guān)鍵部分径玖,以達(dá)到簡(jiǎn)化的目的痴脾。在遇到維度災(zāi)難的時(shí)候,數(shù)據(jù)處理常用的降維方法是SVD(奇異值分解)和PCA(主成分分析)梳星。
3)cca.svd$u是第一個(gè)樣本的特征矩陣赞赖,cca.svd$v是第二個(gè)樣本的特征矩陣;
U, V, S are the matrices corresponding to the estimated left and right singular vectors, and diagonal matrix of estimated singular values, respectively.
通過(guò)前面這么多步驟冤灾,終于將矩陣2000*Ncells降到30*Ncells前域,實(shí)現(xiàn)了2000維降至30維。
順便韵吨,我們也理解下CCA(Cannonical Correlation Analysis)的定義匿垄,Seurat每次只組合一對(duì)數(shù)據(jù)集(兩個(gè)樣本),為每個(gè)多維數(shù)據(jù)集降維到一個(gè)較低維的子空間。
mat3 <- crossprod(x = object1, y = object2)
cca.svd <- irlba(A = mat3, nv = num.cc)
cca.data <- rbind(cca.svd$u, cca.svd$v)
colnames(x = cca.data) <- paste0("CC", 1:num.cc)
rownames(cca.data) <- c(cells1, cells2)
cca.data <- apply(
X = cca.data,
MARGIN = 2,
FUN = function(x) {
if (sign(x[1]) == -1) {
x <- x * -1
}
return(x)
}
)
cca.svd$d
# [1] 651320.30 102499.22 94495.34 84746.89 52127.27 49764.53 41900.86
# [8] 38378.88 35422.20 34252.97 32273.63 25540.07 25339.30 23556.11
# [15] 22299.70 22187.58 20179.09 19572.33 18928.07 17757.20 16859.69
# [22] 16709.28 16388.62 16282.70 15720.46 15625.57 15478.67 15203.39
# [29] 14970.87 14797.46
return(list(ccv = cca.data, d = cca.svd$d))
step7:運(yùn)行完上面的RunCCA.default函數(shù)椿疗,回到RunCCA.Seurat函數(shù)漏峰,將上面得到cca矩陣賦值給cca.results。
通過(guò)cca.results的整合結(jié)果構(gòu)造了一個(gè)combined.object的seurat對(duì)象届榄。默認(rèn)compute.gene.loadings為true浅乔,這個(gè)操作會(huì)調(diào)用ProjectDim函數(shù),大致意思是計(jì)算每個(gè)基因在CC_1到CC_30的權(quán)重铝条,方便查看30個(gè)維度主要是哪些基因占主要權(quán)重靖苇。
cca.results <- list(ccv = cca.data, d = cca.svd$d)
combined.object <- merge(
x = object1,
y = object2,
merge.data = TRUE
)
dim(combined.object)
# [1] 2000 13999
rownames(x = cca.results$ccv) <- Cells(x = combined.object)
colnames(x = data1) <- Cells(x = combined.object)[1:ncol(x = data1)]
colnames(x = data2) <- Cells(x = combined.object)[(ncol(x = data1) + 1):length(x = Cells(x = combined.object))]
combined.object[['cca']] <- CreateDimReducObject(
embeddings = cca.results$ccv[colnames(combined.object), ],
assay = assay1,
key = "CC_"
)
combined.object[['cca']]@assay.used <- DefaultAssay(combined.object)
combined.scale <- cbind(data1,data2)
combined.object <- SetAssayData(object = combined.object,new.data = combined.scale, slot = "scale.data")
if (compute.gene.loadings) {
combined.object <- ProjectDim(
object = combined.object,
reduction = "cca",
verbose = FALSE,
overwrite = TRUE)
}
combined.object
# An object of class Seurat
# 2000 features across 13999 samples within 1 assay
# Active assay: ToIntegrate (2000 features, 0 variable features)
# 1 dimensional reduction calculated: cca
new.feature.loadings.full <- data.use %*% cell.embeddings
head(new.feature.loadings.full)
step8:切回到anchoring.fxn函數(shù),將上面得到combined.object賦值給object.pair班缰,另外贤壁,默認(rèn)情況下,是需要對(duì)降維的矩陣進(jìn)行l(wèi)2.norm歸一化處理的埠忘。
object.pair <- list(ccv = cca.data, d = cca.svd$d)
if (l2.norm){
object.pair <- L2Dim(object = object.pair, reduction = reduction)
reduction <- paste0(reduction, ".l2")
nn.reduction <- reduction
}
reduction.2 <- character()
問(wèn)題3:為什么需要對(duì)典型相關(guān)向量進(jìn)行 L2-normalization脾拆?
Canonical correlation vectors (CCV) project the two datasets into a correlated low-dimensional space, but global differences in scale (for example, differences in normalization between datasets) can still preclude comparing CCV across datasets. To address this, we perform L2-normalization of the cell embeddings.
We perform canonical correlation analysis, followed by L2 normalization of the canonical correlation vectors, to project the datasets into a subspace defined by shared correlation structure across datasets.
典型相關(guān)向量 (CCV) 將兩個(gè)數(shù)據(jù)集投影到相關(guān)的低維空間中,但全局scale差異(例如给梅,數(shù)據(jù)集之間的歸一化差異)仍然可以進(jìn)行跨數(shù)據(jù)集的 CCV 比較假丧。為了解決這個(gè)問(wèn)題,我們對(duì)細(xì)胞嵌入進(jìn)行 L2 歸一化动羽。我們執(zhí)行典型相關(guān)分析包帚,然后對(duì)典型相關(guān)向量進(jìn)行 L2 歸一化,以將數(shù)據(jù)集投影到由跨數(shù)據(jù)集的共享相關(guān)結(jié)構(gòu)定義的子空間中运吓。
當(dāng)批次效應(yīng)與細(xì)胞狀態(tài)之間的生物學(xué)差異具有相似的模式時(shí)渴邦,為了克服這個(gè)問(wèn)題,我們首先使用對(duì)角化 CCA 聯(lián)合降低兩個(gè)數(shù)據(jù)集的維數(shù)拘哨,然后將 L2 歸一化應(yīng)用于典型相關(guān)向量谋梭。
step9:最后調(diào)用FindAnchors函數(shù),該函數(shù)也較復(fù)雜倦青,繼續(xù)調(diào)用其他函數(shù)瓮床。關(guān)于此函數(shù),后續(xù)再寫(xiě)产镐。
到這里隘庄,整個(gè)FindIntegrationAnchors函數(shù)運(yùn)行完畢。
anchors <- FindAnchors(
object.pair = object.pair,
assay = c("ToIntegrate", "ToIntegrate"),
slot = slot,
cells1 = colnames(x = object.1),
cells2 = colnames(x = object.2),
internal.neighbors = internal.neighbors,
reduction = reduction,
reduction.2 = reduction.2,
nn.reduction = nn.reduction,
dims = dims,
k.anchor = k.anchor,
k.filter = k.filter,
k.score = k.score,
max.features = max.features,
nn.method = nn.method,
n.trees = n.trees,
eps = eps,
verbose = verbose
)
anchors[, 1] <- anchors[, 1] + offsets[i]
anchors[, 2] <- anchors[, 2] + offsets[j]
return(anchors)
相關(guān)資料:
1.https://github.com/satijalab/seurat
2.https://www.cell.com/cell/pdf/S0092-8674(19)30559-8.pdf