單細胞項目:來自于30個病人的49個組織樣品掂骏,跨越3個治療階段
Therapy-Induced Evolution of Human Lung Cancer Revealed by Single-Cell RNA Sequencing
這篇教程我將分為四個階段完整的闡述單細胞的主流的下游分析流程
- 數(shù)據(jù)預(yù)處理(數(shù)據(jù)準備階段)
- seurat 基礎(chǔ)分析
- 免疫細胞識別
- inferCNV 的實現(xiàn)
先來第一部分數(shù)據(jù)預(yù)處理
這里的數(shù)據(jù)預(yù)處理很自由,其中一部分是必須按照嚴格的標準進行比如計算外源基因墓赴,線粒體, 核糖體等等
# 導(dǎo)入包
library(Seurat)
library(dplyr)
library(Matrix)
library(magrittr)
library(stringr)
library(tidyverse)
options(stringsAsFactors = F)
options(as.is = T)
setwd('/Users/yuansh/Desktop/單細胞/scell_lung_adenocarcinoma-master')
文章的話一共是有 2 個表達矩陣需要處理
我們這里的話要很明確的知道表達譜的預(yù)處理到底預(yù)處理什么東西
- 看一下 count 矩陣是否是 hista 比對結(jié)果眠副,如果是需要剔除 hista 比對的注釋信息。(有的也是已經(jīng)處理完了东涡,但是也得檢查)
- 確定一下 cell 名的組成哩牍,構(gòu)建好 metadata(行名) 和 count(列名) 矩陣能夠?qū)?yīng)
- 過濾掉外源 RNA (ERCC)
- 合并成為 seurat 對象
- 計算線粒體(MT)棚潦,核糖體的 RNA(^RP[SL]) 比例并篩選數(shù)據(jù)
我們先處理第一個表達矩陣
# 這個是治療前的第一個表達矩陣
# 導(dǎo)入數(shù)據(jù)
raw.data = read.csv('csv_files/S01_datafinal.csv', header=T, row.names = 1)
metadata = read.csv("csv_files/S01_metacells.csv", row.names=1, header=T)
dim(raw.data)
dim(metadata)
判斷一下是否是 hista 比對的結(jié)果并且沒有處理,如果是的話后面 5 行是沒用的
然后我們發(fā)現(xiàn)并不是
繼續(xù)觀察發(fā)現(xiàn),raw.data 是 count 矩陣,metadata 是信息矩陣
count 矩陣的列名是 meta 矩陣的 well 和 plate 合并的
稍微對數(shù)據(jù)進行處理一下,然后保存膝昆,確保 count 矩陣的列名和 metadata 的行名一樣
這樣的話下次讀取就很快
> tail(raw.data[,1:4])
A10_1001000329 A10_1001000362 A10_1001000366 A10_1001000367
ZXDC 0 0 0 0
ZYG11A 0 0 0 0
ZYG11B 0 0 0 0
ZYX 292 0 0 0
ZZEF1 66 0 0 0
ZZZ3 0 0 0 0
> raw.data[1:4,1:4]
A10_1001000329 A10_1001000362 A10_1001000366 A10_1001000367
A1BG 3 0 0 0
A1BG-AS1 0 0 0 0
A1CF 0 0 0 0
A2M 68 0 0 0
> metadata[1:4,1:4]
well plate cell_id sample_name
1 A10 1001000329 A10_1001000329 LT_S07
2 A10 1001000362 A10_1001000362 LT_S12
3 A10 1001000366 A10_1001000366 LT_S13
4 A10 1001000367 A10_1001000367 LT_S13
paste0(metadata$well[1],'_',metadata$plate[1]) # 在進行大幅度修改的時候要判斷代碼有沒有寫錯
colnames(raw.data)[1]
# 修改一下 metadata 的行名
rownames(metadata) = paste0(metadata$well,'_',metadata$plate)
# 確定了行名metadata 的行名和 count 矩陣 raw.data 的列名相等
identical(colnames(raw.data),rownames(metadata))
接下來就是基因的處理丸边, 按照上面說的步驟:
1. 剔除外源基因,然后構(gòu)建 seurat 對象 2. 計算線粒體基因(有沒有都算就行了) 3. 計算核糖體基因
# 這里要過濾 ERCC(External RNA Control Consortium)
# ERCC 是外源 RNA 主要的作用就是用來質(zhì)控的
erccs = grep('^ERCC-', x= rownames(x=raw.data),value = T) # value = T 獲取名字
# 計算 ercc 比例
percent.ercc = Matrix::colSums(raw.data[erccs, ])/Matrix::colSums(raw.data)
fivenum(percent.ercc)
ercc.index = grep(pattern = "^ERCC-", x = rownames(x = raw.data), value = FALSE) # 獲取 index
# 刪除 Count 矩陣中的 ERCC
raw.data = raw.data[-ercc.index,]
dim(raw.data)
# [1] 26485 27489
構(gòu)建 seurat 對象
# 構(gòu)建對象
# min.cells 某一個基因至少在多少個基因中表達
# min.features 某個細胞至少表達多少個基因
min.cells = 0 # 3 因為文章沒有篩選外潜,所以這里設(shè)為0
min.features = 0 # 150 # 同上
sce = CreateSeuratObject(counts = raw.data,metadata = metadata,min.cells =min.cells,min.features =min.features)
sce
我們先看一下 sce 對象中的 meta.data,這個是構(gòu)建 seurat 對象自動生成的原环,缺少了很多信息,所以要把 metdata 添加上去
> sce@meta.data[1:5,1:5]
orig.ident nCount_RNA nFeature_RNA sample_name patient_id
A10_1001000329 SeuratProject 681122 2996 LT_S07 TH103
A10_1001000362 SeuratProject 30 17 LT_S12 TH172
A10_1001000366 SeuratProject 29 19 LT_S13 TH169
A10_1001000367 SeuratProject 286 3 LT_S13 TH169
A10_1001000372 SeuratProject 599 240 LT_S14 TH153
sce = AddMetaData(object = sce, metadata = metadata)
sce = AddMetaData(object = sce, percent.ercc, col.name = "percent.ercc")
# Head to check
head(sce@meta.data)
然后就是數(shù)據(jù)清洗
數(shù)據(jù)清洗原則: 1. 不符合質(zhì)量的:這里一般指細胞中基因數(shù)太少或者某些基因只在兩三個細胞中表達 2. 過濾線粒體 DNA占比過高的 3. 過濾核糖體占比過高的 在過濾前先計算
table(grepl("^MT-",rownames(sce)))
table(grepl("^RP[SL][[:digit:]]",rownames(sce)))
sce[["percent.mt"]] = PercentageFeatureSet(sce, pattern = "^MT-")
sce[["percent.rp"]] = PercentageFeatureSet(sce, pattern = "^RP[SL][[:digit:]]")
dim(sce)
summary(sce[["nCount_RNA"]])
summary(sce[["nFeature_RNA"]])
summary(sce[["percent.mt"]] )
summary(sce[["percent.ercc"]] )
summary(sce[["percent.rp"]] )
文章中是啥都沒去处窥,就正常的過濾了一下基因和表達過低的細胞
# 過濾篩選
# 一般根據(jù)實際情況進行篩選嘱吗,這次的篩選主要還是根據(jù)文章里面的流程
# 這個過濾蠻自由的,所以隨便搞一下就行
myData1 = subset(x=sce, subset = nCount_RNA > 50000 & nFeature_RNA > 500)
myData1
save(myData1,file='myData1.Rdata')
第一階段的處理流程就是這樣
接著處理第二個表達矩陣
# 這個是治療前的第一個表達矩陣
# 導(dǎo)入數(shù)據(jù)
raw.data = read.csv('Data_input/csv_files/neo-osi_rawdata.csv', header=T, row.names = 1)
metadata = read.csv("Data_input/csv_files/neo-osi_metadata.csv", row.names=1, header=T)
raw.data[1:4,1:4]
metadata[1:4,1:4]
判斷一下是否是 hista 比對的結(jié)果并且沒有處理滔驾,如果是的話后面 5 行是沒用的
然后發(fā)現(xiàn)果然是
繼續(xù)觀察發(fā)現(xiàn),第二套數(shù)據(jù)的命名規(guī)則和第一套有點不太一樣,多了_S10.homo,因此要把他去掉
稍微對數(shù)據(jù)進行處理一下,然后保存谒麦,確保 count 矩陣的列名和 metadata 的行名一樣 這樣的話下次讀取就很快
> tail(raw.data[,1:4])
A10_B000561_S10.homo A10_B000563_S10.homo A10_B000568_S10.homo A10_B001544_S10.homo
ZZZ3 0 0 0 0
__no_feature 31084 92098 131276 1197071
__ambiguous 353 2231 570 21247
__too_low_aQual 0 0 0 0
__not_aligned 0 0 0 0
__alignment_not_unique 67598 44778 47959 500993
> raw.data[1:4,1:4]
A10_B000561_S10.homo A10_B000563_S10.homo A10_B000568_S10.homo A10_B001544_S10.homo
A1BG 0 0 0 0
A1BG-AS1 0 0 0 0
A1CF 0 0 0 0
A2M 0 0 0 0
> metadata[1:4,1:4]
plate sample_name patient_id sample_type
1 B001567 AZ_01 AZ003 tumor
2 B001544 AZ_01 AZ003 tumor
3 B001546 AZ_01 AZ003 tumor
4 B001481 AZ_01 AZ003 tumor
然后我們進一步的發(fā)現(xiàn),這套數(shù)據(jù)好像是沒有 well 信息的 所以又要進一步處理一下哆致,確保count 矩陣的列名和 metadata 矩陣的行名可以對應(yīng)
gsub('_S[[:digit:]]*.homo','',colnames(raw.data))[1:5]
colnames(raw.data) = gsub('_S[[:digit:]]*.homo','',colnames(raw.data))
raw.data = raw.data[-grep('__',rownames(raw.data)),]
tail(raw.data[,1:4])
metadata$well[1] # 發(fā)現(xiàn)這套數(shù)據(jù)中的 metadata 中沒有 well 信息
metadata$plate[1] # 好在有 plate信息
所以稍微進行處理一下
# 將 count 矩陣的名字拆開自己構(gòu)建 well 信息矩陣
library(stringr)
wellInfo = str_split(colnames(raw.data),'_',simplify = T)%>% as.data.frame()
colnames(wellInfo) = c("well", "plate")
rownames(wellInfo) = paste0(wellInfo[,1],'_',wellInfo[,2])
# 修改一下 metadata 的行名
wellInfo[1:4,]
metadata = left_join(wellInfo, metadata, by = "plate")
metadata[1:4,]
wellInfo[1:4,]
rownames(metadata) = rownames(wellInfo)
# 判斷一下是否一致
identical(colnames(raw.data),rownames(metadata))
> metadata[1:5,1:5]
well plate sample_name patient_id sample_type
A10_B000561 A10 B000561 AZ_04 AZ008_NAT NAT
A10_B000563 A10 B000563 AZ_04 AZ008_NAT NAT
A10_B000568 A10 B000568 AZ_05 AZ008 tumor
A10_B001544 A10 B001544 AZ_01 AZ003 tumor
A10_B001546 A10 B001546 AZ_01 AZ003 tumor
接下來的步驟和上面完全一樣
# ERCC 是外源 RNA 主要的作用就是用來質(zhì)控的
erccs = grep('^ERCC-', x= rownames(x=raw.data),value = T) # value = T 獲取名字
# 計算 ercc 比例
percent.ercc = Matrix::colSums(raw.data[erccs, ])/Matrix::colSums(raw.data)
fivenum(percent.ercc) # 查看前 5 個結(jié)果
ercc.index = grep(pattern = "^ERCC-", x = rownames(x = raw.data), value = FALSE) # 獲取 index
# 刪除 Count 矩陣中的 ERCC
raw.data = raw.data[-ercc.index,]
dim(raw.data)
# [1] 26485 3507
# 構(gòu)建對象
# 構(gòu)建對象
# min.cells 某一個基因至少在多少個基因中表達
# min.features 某個細胞至少表達多少個基因
min.cells = 0 # 3
min.features = 0 # 150 #
sce = CreateSeuratObject(counts = raw.data,metadata = metadata,min.cells =min.cells,min.features =min.features)
# 添加metadata 和 外源基因信息
sce = AddMetaData(object = sce,metadata = metadata)
sce = AddMetaData(object = sce, percent.ercc, col.name = "percent.ercc")
# 2.數(shù)據(jù)清洗
# 數(shù)據(jù)清洗原則
# 1\. 不符合質(zhì)量的
# 2\. 過濾線粒體 DNA
# 3\. 過濾外原DNA
# 在過濾前先計算
table(grepl("^MT-",rownames(sce)))
table(grepl("^RP[SL][[:digit:]]",rownames(sce)))
sce[["percent.mt"]] = PercentageFeatureSet(sce, pattern = "^MT-")
sce[["percent.rp"]] = PercentageFeatureSet(sce, pattern = "^RP[SL][[:digit:]]")
dim(sce)
summary(sce[["nCount_RNA"]])
summary(sce[["nFeature_RNA"]])
summary(sce[["percent.mt"]] )
summary(sce[["percent.ercc"]] )
summary(sce[["percent.rp"]] )
# 過濾篩選
# 一般根據(jù)實際情況進行篩選绕德,這次的篩選主要還是根據(jù)文章里面的流程
myData2 = subset(x=sce, subset = nCount_RNA > 50000 & nFeature_RNA > 500)
myData2
# 26485 features across 2070 samples within 1 assay
save(myData2,file='myData2.Rdata')
第一部分結(jié)束,準備進入第二部分
第二部分 seurat 標準流程
小提琴圖觀察數(shù)據(jù)分布情況
rm(list = ls())
# 小提琴圖可視化
load("myData1.Rdata")
load("myData2.Rdata")
allData = merge(x = myData1, y = myData2)
table(allData@meta.data$sample_name) %>% length()
VlnPlot(allData, features = 'nFeature_RNA', group.by = 'sample_name')
接下來就是走流程 先觀察一下數(shù)據(jù)的分布情況
rm(list = ls())
# 小提琴圖可視化
load("allData.Rdata")
raw_sce = allData
rm(allData)
plot1 = FeatureScatter(raw_sce, feature1 = "nCount_RNA", feature2 = "percent.ercc")
plot2 = FeatureScatter(raw_sce, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
CombinePlots(plots = list(plot1, plot2))
流程: 1. log : NormalizeData 2. 找特征 : FindVariableFeatures 3. 標準化 : ScaleData 4. pca : RunPCA 5. 構(gòu)建圖 : FindNeighbors 6. 聚類 : FindClusters 7. tsne /umap : RunTSNE RunUMAP 8. 差異基因 : FindAllMarkers / FindMarkers
VlnPlot(raw_sce, features = c("percent.rp", "percent.ercc"), ncol = 2)
# 開始走流程
# 1.log
sce = raw_sce
rm(raw_sce)
sce = NormalizeData(object = sce,normalization.method = "LogNormalize", scale.factor = 1e6)
# 2\. 特征提取
sce = FindVariableFeatures(object = sce,selection.method = "vst", nfeatures = 2000)
# 所有特征
VariableFeatures(sce)
top20=head(VariableFeatures(sce),20)# 提取差異最大的 top20 基因
plot1= VariableFeaturePlot(sce)
plot2=LabelPoints(plot = plot1, points = top20, repel = TRUE)
plot2
# 3.標準化
sce = ScaleData(object = sce)
# 4\. PCA
sce = RunPCA(object = sce, do.print = FALSE)
# 可視化
ElbowPlot(sce)
VizDimLoadings(sce, dims = 1:5, reduction = "pca", nfeatures = 10)
DimHeatmap(sce, dims = 1:10, cells = 100, balanced = TRUE)
# 5.構(gòu)建圖
#細胞聚類
#首先我們在PCA空間里根據(jù)歐氏距離構(gòu)建一個KNN圖摊阀,并在任意兩個細胞間要確定它們的邊緣權(quán)重耻蛇,這個過程#用FindNeighbors函數(shù),這里的input就是前面我們定義的dataset的維數(shù)胞此。
# 選 20 個主成分
sce= FindNeighbors(sce, dims = 1:20)
# 6\. 聚類
#再用FindClusters函數(shù)臣咖,該函數(shù)有一個“分辨率”的參數(shù),該參數(shù)設(shè)置下游聚類的“粒度”漱牵,值
#越高夺蛇,得到的聚類數(shù)越多。這個參數(shù)設(shè)置在0.4-1.2之間酣胀,
#對于3千個左右的單細胞數(shù)據(jù)通常會得到
#比較好的結(jié)果刁赦。對于較大的數(shù)據(jù)集娶聘,最佳分辨率通常會增加。
sce = FindClusters(sce, resolution = 0.5)
# 小劇場
# 不同的 resolution 分群的結(jié)果不一樣
# 下面介紹兩種可視化方法
# 首先先用兩種不同的resolution 進行計算
sce = FindClusters(sce, resolution = 0.2)
sce = FindClusters(sce, resolution = 0.8)
library(gplots)
tab.1=table(sce@meta.data$RNA_snn_res.0.2,sce@meta.data$RNA_snn_res.0.8)
# 這個是第一幅圖
balloonplot(tab.1)
# Check clustering stability at given resolution
# Set different resolutions
res.used = c(0.2,0.8)
res.used
# Loop over and perform clustering of different resolutions
for(i in res.used){
sce = FindClusters(object = sce, verbose = T, resolution = res.used)
}
# Make plot
library(clustree)
clustree(sce) +
theme(legend.position = "bottom") +
scale_color_brewer(palette = "Set1") +
scale_edge_color_continuous(low = "grey80", high = "red")
# 把數(shù)據(jù)刪除
sce@meta.data = sce@meta.data[,-grep("snn_res.",ids,value = T)]
# 然后用文章給的 0.5
sce = FindClusters(sce, resolution = 0.5)
這兩張圖顯示了不同的分辨率情況下的分類結(jié)果(這兩張圖要是看不懂就真的極其的過分)
resolution 等于0.2的時候分了 19 類甚脉,等于 0.8 的時候分成了 31 類
# 7.tsne
#Seurat提供了幾種非線性的降維技術(shù)丸升,如tSNE和UMAP。這些算法的目標是在低維空間中將相似的#細胞放在一起宦焦。上面所確定的基于圖的集群中的單元應(yīng)該在這些降維圖上共同定位发钝。作為
#UMAP和tSNE的input顿涣,我們建議使用與聚類分析的輸入相同的PCs作為輸入波闹。
sce=RunTSNE(sce,dims.use = 1:20) ##tsne降維
sce=RunUMAP(sce,dims = 1:20) ##umap降維
##可視化
p1=DimPlot(sce, reduction = "tsne")
p1
p2=DimPlot(sce, reduction = "umap")
p2
CombinePlots(plots =list(p1, p2))
# 用 ggplot 可視化
phe=data.frame(cell=rownames(sce@meta.data),
cluster =sce@meta.data$seurat_clusters)
head(phe)
table(phe$cluster)
tsne_pos=Embeddings(sce,'tsne')
head(phe)
table(phe$cluster)
head(tsne_pos)
dat=cbind(tsne_pos,phe)
library(ggplot2)
p=ggplot(dat,aes(x=tSNE_1,y=tSNE_2,color=cluster))+geom_point(size=0.95)
p=p+stat_ellipse(data=dat,aes(x=tSNE_1,y=tSNE_2,fill=cluster,color=cluster),
geom = "polygon",alpha=0.2,level=0.9,type="t",linetype = 2,show.legend = F)+coord_fixed()
print(p)
theme= theme(panel.grid =element_blank()) + ## 刪去網(wǎng)格
theme(panel.border = element_blank(),panel.background = element_blank()) + ## 刪去外層邊框
theme(axis.line = element_line(size=1, colour = "black"))
p=p+theme+guides(colour = guide_legend(override.aes = list(size=5)))
print(p)
# umpa 也是一樣的就不畫了
# 8.差異基因
# 尋找某幾個簇之間的差異基因
#cluster5.markers = FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
# 找全部的差異基因(要非常久)
sce.markers <- FindAllMarkers(object = sce, only.pos = TRUE, min.pct = 0.25,
thresh.use = 0.25)
DT::datatable(sce.markers)
write.csv(sce.markers,file=paste0(pro,'_sce.markers.csv'))
library(dplyr)
top10 <- sce.markers %>% group_by(cluster) %>% top_n(10, avg_logFC)
# 這個圖要畫太久,所以只選 15 個
DoHeatmap(sce,top10$gene[1:15],size=3)
ggsave(filename=paste0(pro,'_sce.markers_heatmap.pdf'))
sce.first=sce
save(sce.first,sce.markers,file = 'first_sce.Rdata')
rm(list=ls())
options(stringsAsFactors = F)
library(Seurat)
load(file = 'first_sce.Rdata')
sce = sce.first
rm(sce.first)
# 這個圖要畫非常就涛碑,所一就畫一幅基于可以了
# 這個圖是看合個組分之間的差異
for( i in 1 ){
markers_df <- FindMarkers(object = sce, ident.1 = i, min.pct = 0.25)
print(x = head(markers_df))
markers_genes = rownames(head(x = markers_df, n = 5))
VlnPlot(object = sce, features =markers_genes,log =T )
FeaturePlot(object = sce, features=markers_genes )
}
第二部分結(jié)束精堕,準備進入第三部分
第三部分免疫細胞識別
免疫細胞的識別,一般情況下分為兩種
第一種就是你自己一直知道某些基因和自己要找的特定的免疫細胞相關(guān)
這里你可以參考文獻蒲障,也可以自己定義歹篓,非常的自由
這種方法不止針對免疫細胞,而且針對所有你自己想 diy 識別的細胞揉阎,只要能夠找到靶dna
例如以下這些代表基因:
epithelial/cancer (EpCAM+,EPCAM),
immune (CD45+,PTPRC),
stromal (CD10+,MME,fibo or CD31+,PECAM1,endo)
rm(list=ls())
options(stringsAsFactors = F)
library(Seurat)
library(ggplot2)
load(file = 'first_sce.Rdata')
sce=sce.first
rm(sce.first)
genes_to_check = c("PTPRC","EPCAM",'PECAM1','MME',"CD3G","CD3E", "CD79A")
p = DotPlot(sce, features = genes_to_check,
assay='RNA' )
p = p$data
p = p[which((p$pct.exp>50) & p$avg.exp.scaled>-0.5),]
imm = p[which(p$features.plot == 'PTPRC'),]$id %>% as.character()
epi = p[which(p$features.plot == 'EPCAM'),]$id %>% as.character()
stromal=setdiff(as.character(0:27),c(imm,epi));
例如按照文章的定義
genes_to_check = c("PTPRC","EPCAM","CD3G","CD3E", "CD79A", "BLNK","MS4A1", "CD68", "CSF1R",
"MARCO", "CD207", "PMEL", "ALB", "C1QB", "CLDN5", "FCGR3B", "COL1A1")
# All on Dotplot
p = DotPlot(sce, features = genes_to_check) + coord_flip()
p
#然后根據(jù)自定義規(guī)則區(qū)分細胞
sce@meta.data$immune_annotation = ifelse(sce@meta.data$seurat_clusters %in% imm ,'immune',
ifelse(sce@meta.data$seurat_clusters %in% epi ,'epi','stromal') )
# MAke a table
table(sce@meta.data$immune_annotation)
p = TSNEPlot(object = sce, group.by = 'immune_annotation')
p
genes_to_check = c("PTPRC","EPCAM","CD3G","CD3E", "CD79A", "BLNK","MS4A1", "CD68", "CSF1R",
"MARCO", "CD207", "PMEL", "ALB", "C1QB", "CLDN5", "FCGR3B", "COL1A1")
# All on Dotplot
p = DotPlot(sce, features = genes_to_check,group.by = 'immune_annotation') + coord_flip()
p
# Generate immune and non-immune cell lists
phe=sce@meta.data
table(phe$immune_annotation)
save(phe,file = 'phe-of-first-anno.Rdata')
其實庄撮,細胞注釋到這里已經(jīng)結(jié)束了,其實不難發(fā)現(xiàn)毙籽,細胞注釋這個流程洞斯,沒有什么難的,只需要做兩件事情
- 定義基因坑赡,根據(jù)基因集合畫點圖 DotPlot + coord_flip()
- 根據(jù)點圖自己定義什么是什么細胞
那么到這里的話基本上面免疫細胞大群體已經(jīng)區(qū)分開了烙如,下一步就是直接對大群體內(nèi)部細分
rm(list=ls())
options(stringsAsFactors = F)
library(Seurat)
library(ggplot2)
load(file = 'first_sce.Rdata')
load(file = 'phe-of-first-anno.Rdata')
sce=sce.first
table(phe$immune_annotation)
sce@meta.data=phe
# 既然是對免疫群體內(nèi)部分類,則只需要提取免疫細胞就行
cells.use = row.names(sce@meta.data)[which(phe$immune_annotation=='immune')]
length(cells.use)
sce =subset(sce, cells=cells.use)
sce
# 走標準流程毅否,一直走到 TSNE
#1.Normalize data
require(Seurat)
sce = NormalizeData(object = sce, normalization.method = "LogNormalize",scale.factor = 1e6)
#2.Find variable genes
sce = FindVariableFeatures(object = sce, selection.method = "vst", nfeatures = 2000)
#3.Scale data
sce = ScaleData(object = sce)
#4.Perform PCA
sce = RunPCA(object = sce, pc.genes = VariableFeatures(sce))
ElbowPlot(sce)
VizDimLoadings(sce, dims = 1:5, reduction = "pca", nfeatures = 10)
DimHeatmap(sce, dims = 1:10, cells = 100, balanced = TRUE)
pca.obj = sce@reductions$pca
pc.coords = pca.obj@cell.embeddings
df1 = sce@meta.data[,c("nFeature_RNA","nCount_RNA","percent.rp")]
df2 = pc.coords[,c(1:10)]
cordf12 = cor(df1,df2)
# Make a correlation plot
library(corrplot)
# 補一張相關(guān)性圖
corrplot(cordf12, method = "number", main="Correlation of PCs and metadata")
#5\. Construct Neighbor graph
sce = FindNeighbors(object = sce, dims = 1:15)
#6\. 分辨率
tiss_subset = FindClusters(object = tiss_subset, verbose = T, resolution = 1)
#7\. Run and project TSNEs
sce = RunTSNE(sce, dims = 1:15)
# 可視化
DimPlot(sce, reduction = "tsne", label = T)
DimPlot(sce,reduction = "tsne",label=T, group.by = "patient_id")
TSNEPlot(object = sce, do.label = F, group.by = "driver_gene")
TSNEPlot(object = sce, do.label = F, group.by = "patient_id")
tab=table(sce@meta.data$seurat_clusters,sce@meta.data$patient_id)
balloonplot(tab)
根據(jù)參考數(shù)據(jù)庫自動化注釋
sce_for_SingleR = GetAssayData(sce, slot="data")
sce_for_SingleR
library(SingleR)
# 根據(jù)數(shù)據(jù)庫進行自動注釋
hpca.se = HumanPrimaryCellAtlasData() # 這個要搞非常久
hpca.se
clusters=sce@meta.data$seurat_clusters
clusters[1:5]
# test 稀疏矩陣
# ref 參考數(shù)據(jù)庫
table( hpca.se$label.main)
pred.hesc = SingleR(test = sce_for_SingleR, ref = hpca.se, labels = hpca.se$label.main,
method = "cluster", clusters = clusters,
assay.type.test = "logcounts", assay.type.ref = "logcounts")
table(pred.hesc$labels)
celltype = data.frame(ClusterID=rownames(pred.hesc), celltype=pred.hesc$labels, stringsAsFactors = F)
sce@meta.data$singleR=celltype[match(clusters,celltype$ClusterID),'celltype']
DimPlot(sce, reduction = "tsne", group.by = "singleR")
phe=sce@meta.data
table(phe$singleR)
save(phe,file = 'phe-of-subtypes-Immune-by-singleR.Rdata')
小結(jié)
那么這里再一次的簡單總結(jié)一下細胞注釋的流程(并不只是免疫細胞亚铁,任何的都可以):
首先我們需要找出和要注釋的細胞的相關(guān)的基因
graph TB
找出注釋細胞相關(guān)基因-->繪制點圖
繪制點圖-->根據(jù)自定義閾值確定細胞類型
根據(jù)自定義閾值確定細胞類型-->提取感興趣的類
提取感興趣的類-->走標準流程1-7
走標準流程1-7-->數(shù)據(jù)庫自動化注釋
不同的注釋庫的功能不太一樣(這些都是大佬整理好的,講實話我很好奇這些數(shù)據(jù)庫都是哪兒找的)
?MonacoImmuneData
?BlueprintEncodeData
?DatabaseImmuneCellExpressionData
?NovershternHematopoieticData
?MonacoImmuneData
?ImmGenData
?MouseRNAseqData
?HumanPrimaryCellAtlasData
其他的細胞類型注釋流程幾乎完全一樣螟加,如果后續(xù)需要個性化分析徘溢,只需要提取出自己識別好的類,然后在這些類中繼續(xù)走標準流程 1-7捆探,之后還是使用點圖進行驗證然爆,單純的只是多了個 group.by 參數(shù)而已
第三部分結(jié)束,準備進入第四部分
第四部分癌細胞識別
這個癌細胞識別略微小復(fù)雜徐许,通常的流程是在已知的細胞類別上繼續(xù)進行分類施蜜。
這種分類方式大部分也都是屬于人工自己篩選的
例如在上皮細胞中識別癌細胞和非癌細胞
rm(list=ls())
options(stringsAsFactors = F)
library(Seurat)
library(ggplot2)
load(file = 'first_sce.Rdata')
load(file = 'phe-of-first-anno.Rdata')
sce=sce.first
rm(sce.first)
table(phe$immune_annotation)
sce@meta.data=phe
# 那么第一步就是提取上皮細胞中的癌細胞
cells.use = row.names(sce@meta.data)[which(phe$immune_annotation=='epi')]
length(cells.use)
sce = subset(sce, cells=cells.use)
sce
# 走標準流程,一直走到 TSNE
#1.Normalize data
require(Seurat)
sce = NormalizeData(object = sce, normalization.method = "LogNormalize",scale.factor = 1e6)
#2.Find variable genes
sce = FindVariableFeatures(object = sce, selection.method = "vst", nfeatures = 2000)
#3.Scale data
sce = ScaleData(object = sce)
#4.Perform PCA
sce = RunPCA(object = sce, pc.genes = VariableFeatures(sce))
ElbowPlot(sce)
VizDimLoadings(sce, dims = 1:5, reduction = "pca", nfeatures = 10)
DimHeatmap(sce, dims = 1:10, cells = 100, balanced = TRUE)
pca.obj = sce@reductions$pca
pc.coords = pca.obj@cell.embeddings
df1 = sce@meta.data[,c("nFeature_RNA","nCount_RNA","percent.rp")]
df2 = pc.coords[,c(1:10)]
cordf12 = cor(df1,df2)
# Make a correlation plot
library(corrplot)
# 補一張相關(guān)性圖
corrplot(cordf12, method = "number", main="Correlation of PCs and metadata")
#5\. Construct Neighbor graph
sce = FindNeighbors(object = sce, dims = 1:15)
#6\. 分辨率
tiss_subset = FindClusters(object = tiss_subset, verbose = T, resolution = 1)
#7\. Run and project TSNEs
sce = RunTSNE(sce, dims = 1:15)
# 可視化
DimPlot(sce, reduction = "tsne", label = T)
DimPlot(sce,reduction = "tsne",label=T, group.by = "patient_id")
TSNEPlot(object = sce, do.label = F, group.by = "driver_gene")
TSNEPlot(object = sce, do.label = F, group.by = "patient_id")
tab=table(sce@meta.data$seurat_clusters,sce@meta.data$patient_id)
balloonplot(tab)
這里的話就存在一個前提假設(shè)雌隅,如果細胞會跨患者聚類則這些細胞不是癌細胞
tab = tab %>% as.matrix()
ids = apply(tab,1,function(x){ x/sum(x) })
nonCancer = which(apply(ids,2,max)<0.85) %>% colnames(ids)[.]
sce@meta.data$cancer <-ifelse(sce@meta.data$seurat_clusters %in% nonCancer ,'non-cancer','cancer')
# MAke a table
table(sce@meta.data$cancer)
phe=sce@meta.data
# cancer marker CEACAM6, cell-cycle-related gene CCND2, and apoptosis-related gene BAX
genes_to_check = c("CEACAM6","CCND2","BAX")
# All on Dotplot
p <- DotPlot(sce, features = genes_to_check,group.by = 'cancer') + coord_flip()
p
save(phe,file = 'phe-of-cancer-or-not.Rdata')
其實到這里癌細胞注釋和上一步的免疫細胞注釋幾乎一樣
所以下面介紹 inferCNV 的算法
在使用 infercnv其實只有 2 行代碼而已
重要的是引導(dǎo)文件的構(gòu)建
1. 第一個引導(dǎo)文件是 count 矩陣 2. 第二個引導(dǎo)文件是 label 信息 (第一列是細胞 ID翻默;第二列是細對應(yīng)的細胞類型缸沃,這個數(shù)據(jù)是由前面的分析結(jié)果得到的) 3. 第三個引導(dǎo)文件是樣本染色體或基因信息(這個獲取方法有很多種,1.抄大佬代碼(推薦)修械,2.自己下載)
rm(list=ls())
options(stringsAsFactors = F)
library(Seurat)
library(ggplot2)
load(file = 'first_sce.Rdata')
load(file = 'phe-of-first-anno.Rdata')
sce=sce.first
table(phe$immune_annotation)
sce@meta.data=phe
table(phe$immune_annotation,phe$seurat_clusters)
# BiocManager::install("infercnv")
library(infercnv)
##### inferCNV引導(dǎo)文件的構(gòu)建
# 上皮細胞 count 矩陣
epi.cells = row.names(sce@meta.data)[which(phe$immune_annotation=='epi')]
epiMat=as.data.frame(GetAssayData(subset(sce,
cells=epi.cells), slot='counts'))
# 間質(zhì)細胞
cells.use = row.names(sce@meta.data)[which(phe$immune_annotation=='stromal')]
sce =subset(sce, cells=cells.use)
load(file = 'phe-of-subtypes-stromal.Rdata')
fib.cells = row.names(sce@meta.data)[phe$singleR=='Fibroblasts']
endo.cells = row.names(sce@meta.data)[phe$singleR=='Endothelial_cells']
# 成纖細胞 count 矩陣
fibMat=as.data.frame(GetAssayData(subset(sce,
cells=fib.cells), slot='counts'))
# 內(nèi)皮細胞 count 矩陣
endoMat=as.data.frame(GetAssayData(subset(sce,
cells=endo.cells), slot='counts'))
#合并
dat=cbind(epiMat,fibMat,endoMat)
ids1 = data.frame(cellId = colnames(epiMat))
ids2 = data.frame(cellId = colnames(fibMat))
ids3 = data.frame(cellId = colnames(endoMat))
ids1$cellType = 'epi'
ids2$cellType = 'Fibroblasts'
ids3$cellType = 'Endothelial_cells'
data.frame(ids1,ids2,ids3)
groupFiles = rbind(ids1,ids2) %>% rbind(.,ids3) %>% as.data.frame
rm(epiMat,fibMat,endoMat,ids1,ids2,ids3)
library(AnnoProbe)
geneInfor=annoGene(rownames(dat),"SYMBOL",'human')
geneInfor=geneInfor[with(geneInfor, order(chr, start)),c(1,4:6)]
geneInfor=geneInfor[!duplicated(geneInfor[,1]),]
## 這里可以去除性染色體
# 也可以把染色體排序方式改變
dat=dat[rownames(dat) %in% geneInfor[,1],]
dat=dat[match( geneInfor[,1], rownames(dat) ),]
dim(dat)
expFile='expFile.txt'
write.table(dat,file = expFile,sep = '\t',quote = F)
groupFiles='groupFiles.txt'
head(groupinfo)
write.table(groupinfo,file = groupFiles,sep = '\t',quote = F,col.names = F,row.names = F)
head(geneInfor)
geneFile='geneFile.txt'
write.table(geneInfor,file = geneFile,sep = '\t',quote = F,col.names = F,row.names = F)
table(groupinfo[,2])
以上就是構(gòu)建三個引導(dǎo)文件的代碼
最終引導(dǎo)文件分別張這樣:
> dat[1:5,1:5] #count 矩陣
A10_1001000407 A10_1001000408 A10_1001000412 A10_B000863 A10_B001470
DDX11L1 0 0 0 0 0
WASH7P 0 0 0 0 0
MIR6859-1 0 0 0 0 0
MIR1302-2 0 0 0 0 0
FAM138A 0 0 0 0 0
> groupinfo[1:5,]# 分組矩陣
cellId cellType
1 A10_1001000407 epi
2 A10_1001000408 epi
3 A10_1001000412 epi
4 A10_B000863 epi
5 A10_B001470 epi
> geneInfor[1:5,]#染色體或基因信息
SYMBOL chr start end
1 DDX11L1 chr1 11869 14409
2 WASH7P chr1 14404 29570
3 MIR6859-1 chr1 17369 17436
5 MIR1302-2 chr1 30366 30503
6 FAM138A chr1 34554 36081
然后就直接跑流程就可以
# 這個注意一下這個流程跑起來非常慢趾牧,不知道是我的電腦的問題還是怎么回事
# 大概是我中午睡了一覺起來還沒好
rm(list=ls())
options(stringsAsFactors = F)
library(Seurat)
library(ggplot2)
library(infercnv)
expFile='expFile.txt'
groupFiles='groupFiles.txt'
geneFile='geneFile.txt'
infercnv_obj = CreateInfercnvObject(raw_counts_matrix=expFile,
annotations_file=groupFiles,
delim="\t",
gene_order_file= geneFile,
ref_group_names=c('Fibroblasts','Endothelial_cells')) ## 這個是 normalcell 的 grouplabel
## 文獻的代碼:
infercnv_obj2 = infercnv::run(infercnv_obj,
cutoff=1,
out_dir= 'plot_out/inferCNV_output2' ,
cluster_by_groups=F, # cluster
hclust_method="ward.D2", plot_steps=F)
### 可以添加 denoise=TRUE以及 HMM=TRUE
最后就是 inferCNV 的結(jié)果解讀了
link1
link2
全文總結(jié)
其實,單細胞從頭到尾的流程就那么幾個代碼肯污,感覺有 70%以上的都是重復(fù)的 1. 數(shù)據(jù)預(yù)處理翘单,確保 count 數(shù)據(jù)能夠和 metadata 一一對應(yīng) 2. 計算外源RNA的百分比,添加到 metadata蹦渣;刪除 count 矩陣中的外源 RNA 信息 3. 構(gòu)建 seurat 對象哄芜,計算 MT,RP 百分比信息 4. 根據(jù)相應(yīng)的信息自主選擇過濾柬唯,一般都是過濾基因认臊,細胞,MT 百分比锄奢,RP 百分比 5. 第一輪基礎(chǔ)流程失晴,然后保存數(shù)據(jù) 6. 后續(xù)的所謂的個性化分析都是基于特定的細胞群體 7. 細胞群體的識別方法按這個流程的話只有一種,就是找到特定的基因拘央,畫點圖然后提取 8. 反復(fù)的根據(jù)提取的子細胞群落用第 5 步存下來的數(shù)據(jù)一直跑基礎(chǔ)流程涂屁,一直畫點圖,TSNE 9. 如果做 inferCNV 只要做好引導(dǎo)文件即可灰伟,除了修改ref_group_names
以外不懂的情況下一個字別改
Best Regards,
Yuan.SH
---------------------------------------
School of Basic Medical Sciences,
Fujian Medical University,
Fuzhou, Fujian, China.
please contact with me via the following ways:
(a) e-mail :yuansh3354@163.com