最近真的是懶得要死第岖,很多東西真的是做了沒(méi)有及時(shí)整理出來(lái),像這篇文獻(xiàn)已經(jīng)讀過(guò)了侵续,方法也去嘗試過(guò)了,但是就是沒(méi)有整理出來(lái)憨闰,正所謂好記性不如爛筆頭状蜗。因?yàn)樽约阂恢痹诜治鰡渭?xì)胞測(cè)序的數(shù)據(jù),總是畫(huà)了聚類(lèi)圖后鹉动,想著怎么鑒定出細(xì)胞類(lèi)型來(lái)轧坎,然后剛好讀到這篇文獻(xiàn),講了用CHETAH泽示,基于已發(fā)表的文獻(xiàn)的鑒定出來(lái)的細(xì)胞類(lèi)型來(lái)鑒定缸血,這種感覺(jué)就像是以前用已知數(shù)據(jù)來(lái)訓(xùn)練未知數(shù)據(jù)一樣,我感覺(jué)隨著單細(xì)胞測(cè)序數(shù)據(jù)越來(lái)越多械筛,這種鑒定方法將成為一種趨勢(shì)捎泻,以后應(yīng)該會(huì)用更好的算法來(lái)實(shí)現(xiàn),我想想自己就是個(gè)菜鳥(niǎo)哈哈哈变姨,能用能看懂就很開(kāi)心呢了谜诫。
一历涝、文章總體思路
文章的思路:
A:利用已知的數(shù)據(jù),構(gòu)建一棵分類(lèi)樹(shù)聚请。比如一套數(shù)據(jù)中有四種細(xì)胞類(lèi)型 RP1 RP2 RP3 RP4,文章首先是計(jì)算這四種細(xì)胞類(lèi)型的平均值怒竿,然后算著四種類(lèi)型細(xì)胞的相關(guān)性砍鸠,然后進(jìn)行層次聚類(lèi),構(gòu)建出分類(lèi)樹(shù)耕驰。
B: 拿到一個(gè)cell j爷辱,比如假定說(shuō)屬于RP1 類(lèi),那么久計(jì)算這個(gè)細(xì)胞和RP1的相關(guān)性朦肘,也要計(jì)算cell j與除RP1 以外的 【RP3 RP4】(將這兩個(gè)當(dāng)作一個(gè)整體)的相關(guān)性饭弓,為什么不計(jì)算RP2呢,因?yàn)槲沂沁@么想的媒抠,RP2 和RP1屬于都一個(gè)根下弟断,所以都加進(jìn)去計(jì)算會(huì)不太好吧,具體原理不是很清楚趴生。這里計(jì)算完之后阀趴,因?yàn)檫@個(gè)算法是對(duì)每個(gè)細(xì)胞選定前200個(gè)差異基因(RP1 對(duì)RP3 RP4的差異基因昏翰,文章說(shuō)效果比較好)來(lái)計(jì)算相關(guān)性,因此每個(gè)細(xì)胞的200個(gè)基因是不一樣刘急,所以作者這邊就計(jì)算RP1 【RP3 RP4】的理論分布棚菊,然后把相關(guān)性轉(zhuǎn)化為落在理論分布的profile score,之后還計(jì)算了像p值一樣的confidence score叔汁。
C:就是對(duì)每個(gè)細(xì)胞做上述過(guò)程统求,然后得到cell type.
細(xì)胞類(lèi)型的鑒定過(guò)程取決于你所選的reference data的可靠性咯。
作者還比較這個(gè)算法與其他方法的結(jié)果攻柠,與之前的SingleR的結(jié)果不相上下球订,但是SingleR是用bulk RNA數(shù)據(jù)。
二瑰钮、代碼實(shí)現(xiàn)
1. 網(wǎng)址
vignette("CHETAH_introduction")
2. 輸入輸出
1)輸入
If you have your data stored as SingleCellExperiments, continue to the next step. Otherwise, you need the following data before you begin:
- 輸入你要分析的單細(xì)胞數(shù)據(jù)矩陣【 input scRNA-seq count data of the cells to be classified a data.frame or matrix, with cells in the columns and genes in the rows】
- 標(biāo)準(zhǔn)化后的參考的單細(xì)胞數(shù)據(jù)矩陣(!) 【normalized scRNA-seq count data of reference cells in similar format as the input】
- 參考的單細(xì)胞數(shù)據(jù)的細(xì)胞類(lèi)型【the cell types of the reference cells
a named character vector (names corresponding to the colnames of the reference counts)】
- 可選用來(lái)可視化的要分析的單細(xì)胞數(shù)據(jù)的2維的TSNE PCA數(shù)值【(optional) a 2D reduced dimensional representation of your input data for visualization: e.g. t-SNE1, PCA冒滩,a two-column matrix/data.frame, with the cells in the rows and the two dimensions in the columns】
2)輸出
CHETAH constructs a classification tree by hierarchically clustering the reference data. The classification is guided by this tree. In each node of the tree, input cells are either assigned to the right, or the left branch. A confidence score is calculated for each of these assignments. When the confidence score for an assignment is lower than the threshold (default = 0.1), the classification for that cell stops in that node.
This results in two types of classifications:
- 1.最終的細(xì)胞類(lèi)型【 final types: Cells that are classified to one of the leaf nodes of the tree (i.e. a cell type of the reference)】
- 中間細(xì)胞類(lèi)型,其實(shí)就沒(méi)辦法細(xì)分的細(xì)胞類(lèi)型浪谴,也就是處于根的時(shí)候开睡,左右兩邊的細(xì)胞類(lèi)型的得分相似」冻埽【intermediate types: Cells that had a confidence score lower than the threshold in a certain node are assigned to that intermediate node of the tree. This happens when a cell has approximately the same similarity to the cell types in right and the left branch of that node.】
- 3.中間細(xì)胞類(lèi)型在圖中的顯示就是Unassigned “Node1”, “Node2”等
【CHETAH generates generic names for the intermediate types: Unassigned for cells that are classified to the very first node, and “Node1”, “Node2”, etc for the additional nodes】
2.代碼實(shí)例
(1)安裝加載包
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("CHETAH")
vignette("CHETAH_introduction")
library(CHETAH)
(2) 構(gòu)建reference data
1) 首先你需要下載你要參考的單細(xì)胞的數(shù)據(jù)篇恒,我是從Mouse Cell Atlas中下載小鼠腦的數(shù)據(jù),然后導(dǎo)入R中分析
setwd("數(shù)據(jù)存的路徑/Mouse Cell Atls")
adult_brain_exp <- read.csv("adult_brain_cell_exp.csv",header=T)
adult_cell_type <- read.csv("adult_brain_cell.csv",header=T)
## filter gene by a gene is expressed at least in 50 cells 因?yàn)榛驍?shù)目比較多凶杖,我就把在少于50個(gè)細(xì)胞中表達(dá)的基因過(guò)濾了
row_sum <- apply(adult_brain_exp[,-1],1,sum)
filter_exp <- adult_brain_exp[row_sum>50,]#一個(gè)基因至少在50個(gè)細(xì)胞中表達(dá)
genename <- as.character(filter_exp[,1])#genename中有兩個(gè)2-Mar 基因名字
final_exp <- as.matrix(filter_exp[,-1])
index <- which(genename=="2-Mar")
genename[index[1]]<- "2-Mar.1"
rownames(final_exp)<-genename
final_exp <- na.omit(final_exp) # 我數(shù)據(jù)中不知道怎么有缺失值胁艰,然后之前跑一直不成功,我試了那么久才發(fā)現(xiàn)
2)我把數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化和中心化智蝠,我使用seurat來(lái)做的腾么,可以自己寫(xiě)代碼
#Normalization and Scale data and use the Top 3000 variable gene
library(Seurat)
exp_1 <- CreateSeuratObject(counts = final_exp)
exp_1 <- SCTransform(exp_1)
exp_2 <- exp_1@assays$SCT@scale.data
3) 因?yàn)閰⒖嫉臄?shù)據(jù)里面可能不同細(xì)胞類(lèi)型的數(shù)目存在差異,作者認(rèn)為細(xì)胞數(shù)目在10-200之間是比較好杈湾,計(jì)算量不會(huì)太大解虱,所以有一步選擇細(xì)胞的過(guò)程。我原始數(shù)據(jù)中有一類(lèi)細(xì)胞有1000多漆撞,如果不選的話殴泰,會(huì)導(dǎo)致構(gòu)建分類(lèi)樹(shù)的時(shí)候,結(jié)果不準(zhǔn)確浮驳。
# Cell selction
cell_selection <- unlist(lapply(unique(adult_cell_type$cell.type), function(type) {
type_cells <- which(adult_cell_type$cell.type == type)
if (length(type_cells) > 200) {
type_cells[sample(length(type_cells), 200)]
} else type_cells
}))
4) 構(gòu)建reference data 并用自身數(shù)據(jù)進(jìn)行分類(lèi)看看結(jié)果
## make reference data
ref_count <- exp_2[,cell_selection]
ref_cellid <- adult_cell_type$cell.id[cell_selection]
ref_celltype <- adult_cell_type$cell.type[cell_selection]
all.equal(as.character(ref_cellid), colnames(ref_count))
adult_brain_ref <- SingleCellExperiment(assays = list(counts = ref_count),
colData = DataFrame(celltypes = ref_celltype))
#Reference QC
CorrelateReference(ref_cells = adult_brain_ref)
ClassifyReference(ref_cells = adult_brain_ref)
(3) 導(dǎo)入自己的數(shù)據(jù)
雖然文章沒(méi)有說(shuō)自己的數(shù)據(jù)需不需要標(biāo)準(zhǔn)化悍汛,但是我還是做了,我感覺(jué)沒(méi)有多大影響
## input data
input_count <- readsCountSM_TSNE@assays$RNA@counts
## Normalization
input_count <- apply(input_count,2,function(column) log2((column/sum(column) * 100000) + 1))
reduced_tsne <- Embeddings(readsCountSM_TSNE, reduction = "tsne")#使用embeddings 函數(shù)來(lái)調(diào)用tsne 值
all.equal(rownames(reduced_tsne), colnames(input_count))
input_drug <- SingleCellExperiment(assays = list(counts = input_count),
reducedDims = SimpleList(TSNE = reduced_tsne))
(4)分類(lèi)至会,然后把分類(lèi)結(jié)果導(dǎo)入從Seurat得到的object中离咐, 就可以了
## Classfication
input_drug <- CHETAHclassifier(input = input_drug,
ref_cells = adult_brain_ref)
PlotCHETAH(input_drug)
input_drug <- Classify(input_drug, 0)
PlotCHETAH(input = input_drug, tree = FALSE)
pdf("cell_type_analyzed_by_adult_brain_type_from_Mouse_cell_atlas.pdf",8,8)
DimPlot(object = readsCountSM_TSNE, reduction = 'tsne',label = TRUE,group.by = 'ident',shape.by="orig.ident",repel=TRUE,label.size=5,pt.size=1.5)
dev.off()
levels(readsCountSM_TSNE$orig.ident)[1]="3-HB"
levels(readsCountSM_TSNE$orig.ident)[2]="Control"
Sample_cluster(readsCountSM_TSNE)
3. 總結(jié)
我感覺(jué)這個(gè)方法還是不錯(cuò)的,結(jié)果還是挺準(zhǔn)的奋献。但是還是取決于所參考的數(shù)據(jù)健霹。