Seurat軟件使用操作教程

1哥倔、載入R軟件

(base) ┌─[shpc_a98ef85f8f@SHPC-Pro-1]─[~/SingleronTest/ligang/data/hg19/pbmc3k]
└──? $R
R version 4.0.2 (2020-06-22) -- "Taking Off Again"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-conda_cos6-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

2 下載Seurat軟件

> install.packages("Seurat")
> library(dplyr)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Warning message:
package 'dplyr' was built under R version 4.0.5 
> library(Seurat)
Attaching SeuratObject
> library(patchwork)
Warning message:
package 'patchwork' was built under R version 4.0.3 
> library(tidyverse)
Registered S3 method overwritten by 'cli':
  method     from         
  print.boxx spatstat.geom
-- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.5     v purrr   0.3.4
v tibble  3.1.5     v stringr 1.4.0
v tidyr   1.1.4     v forcats 0.5.1
v readr   2.0.2     
-- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Warning messages:
1: package 'ggplot2' was built under R version 4.0.5 
2: package 'purrr' was built under R version 4.0.3 
3: package 'stringr' was built under R version 4.0.5 
4: package 'forcats' was built under R version 4.0.3

3 數(shù)據(jù)下載

Peripheral Blood Mononuclear Cells (PBMC)10X Genomics dataset page提供的一個(gè)數(shù)據(jù)矿咕,包含2700個(gè)單細(xì)胞吊输,出自Illumina NextSeq 500平臺(tái)育瓜。

PBMCs是來(lái)自健康供體具有相對(duì)少量RNA(around 1pg RNA/cell)的原代細(xì)胞似袁。在Illumina NextSeq 500平臺(tái)存璃,檢測(cè)到2700個(gè)單細(xì)胞淋叶,每個(gè)細(xì)胞獲得69000 reads阎曹。

1. tar -xvf pbmc3k_filtered_gene_bc_matrices.tar
#文件夾下包含3個(gè)文件   這些文件都在操作的當(dāng)前路徑,所以文件路徑輸入“./”即可爸吮。
barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz

matrix.mtx:matrix.mtx 是 MatrixMarket格式文件芬膝;更多內(nèi)容見(jiàn):http://math.nist.gov/MatrixMarket/formats.html

  • 文件中儲(chǔ)存非零值;

  • 注釋使用%標(biāo)記;

  • 第一行包含文件中總行數(shù)形娇,總列數(shù)锰霜,總的記錄數(shù)

  • 每行中提供記錄的所處的行號(hào)和列號(hào),已經(jīng)記錄的內(nèi)容

4 讀入10X數(shù)據(jù)

> pbmc.data <- Read10X('./')
> dim(pbmc.data)
[1] 32738  2700

總共得到2700個(gè)細(xì)胞和13714個(gè)基因桐早。
count matrix長(zhǎng)什么樣呢癣缅?我們可以看

# 首先看看三個(gè)基因的count值
> pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
3 x 30 sparse Matrix of class "dgCMatrix"
   [[ suppressing 30 column names 'AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1' ... ]]
                                                                   
CD3D  4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2  3 . . . . . 3 4 1 5
TCL1A . .  . . . . . . 1 . . . . . . . . . . .  . 1 . . . . . . . .
MS4A1 . 6  . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
> summary(colSums(pbmc.data))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    548    1758    2197    2367    2763   15844 

查看每個(gè)細(xì)胞有多少基因被檢測(cè)到

其中的.表示0,即no molecules detected哄酝。當(dāng)然友存,這個(gè)地方還有另外一種含義就是這個(gè)基因是真的沒(méi)有表達(dá)。

由于單細(xì)胞測(cè)序數(shù)據(jù)中大多數(shù)的值都為0陶衅,因此屡立,seurat使用一個(gè)稀疏矩陣來(lái)保存測(cè)序得到的count matrix,這樣有利于數(shù)據(jù)存儲(chǔ)空間的節(jié)省搀军。

我們來(lái)看看使用稀疏矩陣和使用0來(lái)存儲(chǔ)兩種方式的大小對(duì)比膨俐。

> dense.size <- object.size(as.matrix(pbmc.data))
> dense.size
709591472 bytes
> sparse.size <- object.size(pbmc.data)
> sparse.size
29905192 bytes

dense為轉(zhuǎn)換為0后的matrix存儲(chǔ)大小,709591472 bytes罩句,
sparse為.即稀疏矩陣的大小焚刺,29905192 bytes。
兩者比值為23.7 倍门烂,即使用系數(shù)矩陣來(lái)存儲(chǔ)單細(xì)胞水平的基因表達(dá)值非常節(jié)省空間乳愉。
使用pbmc數(shù)據(jù)初始化Seurat對(duì)象

> pbmc
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
> head(pbmc$RNA@data[,1:5])
6 x 5 sparse Matrix of class "dgCMatrix"
              AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
AL627309.1                   .                .                .
AP006222.2                   .                .                .
RP11-206L10.2                .                .                .
RP11-206L10.9                .                .                .
LINC00115                    .                .                .
NOC2L                        .                .                .
              AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
AL627309.1                   .                .
AP006222.2                   .                .
RP11-206L10.2                .                .
RP11-206L10.9                .                .
LINC00115                    .                .
NOC2L                        .                .

5 單細(xì)胞數(shù)據(jù)分析預(yù)處理

預(yù)處理主要包括基于QC指標(biāo)的細(xì)胞和基因過(guò)濾,數(shù)據(jù)標(biāo)準(zhǔn)化和歸一化屯远,高變基因選擇蔓姚。

5.1首先是QC來(lái)篩選高質(zhì)量的細(xì)胞

一般篩選條件有三個(gè):

  • 1.每個(gè)細(xì)胞中檢測(cè)到的唯一基因數(shù)
    • 低質(zhì)量的細(xì)胞或者空的droplet液滴通常含有很少的基因
    • Cell doublets雙胞體或多胞體含有很高的異常的gene counts
  • 2.每個(gè)細(xì)胞中檢測(cè)到的分子總數(shù)
  • 3.線粒體基因含量比例
    • 低質(zhì)量或者死亡細(xì)胞含有很高的線粒體基因
    • 使用PercentageFeatureSet()計(jì)算一個(gè)特征的比例
    • MT-開(kāi)頭的基因認(rèn)為是線粒體基因
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
# 查看QC指標(biāo)
# Show QC metrics for the first 5 cells
> head(pbmc@meta.data, 5)
                 orig.ident nCount_RNA nFeature_RNA percent.mt
AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898

我們將使用以下標(biāo)準(zhǔn)進(jìn)行基因過(guò)濾:
We filter cells that have unique feature counts over 2,500 or less than 200
We filter cells that have >5% mitochondrial counts
過(guò)濾前三個(gè)指標(biāo)可視化

> pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
> pbmc
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
> head(pbmc$RNA@data[,1:5])
6 x 5 sparse Matrix of class "dgCMatrix"
              AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
AL627309.1                   .                .                .
AP006222.2                   .                .                .
RP11-206L10.2                .                .                .
RP11-206L10.9                .                .                .
LINC00115                    .                .                .
NOC2L                        .                .                .
              AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
AL627309.1                   .                .
AP006222.2                   .                .
RP11-206L10.2                .                .
RP11-206L10.9                .                .
LINC00115                    .                .
NOC2L                        .                .
> pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
> head(pbmc@meta.data, 5)
                 orig.ident nCount_RNA nFeature_RNA percent.mt
AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898
> VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

我們就出了小提請(qǐng)圖。
查看基因數(shù)目, 線粒體基因占比與UMI數(shù)目的關(guān)系

> plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
> plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
> lllg <- plot1 + plot2
> lllg
> ggsave('lllg.pdf')
Saving 7 x 7 in image

6 質(zhì)控

篩選檢測(cè)到基因數(shù)目超過(guò)2500或低于200的細(xì)胞
單個(gè)細(xì)胞中線粒體基因數(shù)目占比超過(guò)>5%

> pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
> pbmc
An object of class Seurat 
13714 features across 2638 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)

6.1數(shù)據(jù)標(biāo)準(zhǔn)化

默認(rèn)使用數(shù)據(jù)標(biāo)準(zhǔn)化方法是LogNormalize, 每個(gè)細(xì)胞總的表達(dá)量都標(biāo)準(zhǔn)化到10000慨丐,然后log取對(duì)數(shù)赂乐;結(jié)果存放于pbmc[["RNA"]]@data。
標(biāo)準(zhǔn)化前咖气,每個(gè)細(xì)胞總的表達(dá)量

> hist(colSums(pbmc$RNA@data),
+      breaks = 100,
+      main = "Total expression before normalisation",
+      xlab = "Sum of expression")

6.2 標(biāo)準(zhǔn)化后挨措,每個(gè)細(xì)胞總的表達(dá)量

> pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> hist(colSums(pbmc$RNA@data),
+      breaks = 100,
+      main = "Total expression after normalisation",
+      xlab = "Sum of expression")  

6.3 變化基因鑒定

鑒定在細(xì)胞間表達(dá)高度變化的基因,后續(xù)研究需要集中于這部分基因崩溪。Seurat內(nèi)置的FindVariableFeatures()函數(shù)浅役,首先計(jì)算每一個(gè)基因的均值和方差,并且直接模擬其關(guān)系伶唯。默認(rèn)返回2000個(gè)基因觉既。

> pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
Calculating gene variances
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
# 10個(gè)表達(dá)變化最為劇烈的基因
> top10 <- head(VariableFeatures(pbmc), 10) #head(pbmc$RNA@var.features,10)
> top10
 [1] "PPBP"   "LYZ"    "S100A9" "IGLL5"  "GNLY"   "FTL"    "PF4"    "FTH1"  
 [9] "GNG11"  "S100A8"
畫(huà)出表達(dá)變化的基因,從而觀察其分布
> plot1 <- VariableFeaturePlot(pbmc)
> ggsave('plot1.pdf')
Saving 7 x 7 in image
Warning messages:
1: Transformation introduced infinite values in continuous x-axis 
2: Removed 1 rows containing missing values (geom_point). 
畫(huà)出表達(dá)變化的基因乳幸,標(biāo)記前10個(gè)基因
> plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
When using repel, set xnudge and ynudge to 0 for optimal results
> ggsave('plot2.pdf')
Saving 7 x 7 in image
Warning messages:
1: Transformation introduced infinite values in continuous x-axis 
2: Removed 1 rows containing missing values (geom_point). 

畫(huà)出表達(dá)變化的基因瞪讼,標(biāo)記前10個(gè)基因

7 數(shù)據(jù)縮放

線性轉(zhuǎn)換縮放數(shù)據(jù),ScaleData()函數(shù)可以實(shí)現(xiàn)此功能粹断。

最終每個(gè)基因均值為0符欠,方差為1。

結(jié)果存放于pbmc[["RNA"]]@scale.data瓶埋。

> all.genes <- rownames(pbmc)
> pbmc <- ScaleData(pbmc, features = all.genes)
Centering and scaling data matrix
  |======================================================================| 100%

設(shè)置參數(shù)features是因?yàn)镾caleData默認(rèn)處理前面鑒定的差異基因希柿。這一步怎么做都不會(huì)影響到后續(xù)pca和聚類,但是會(huì)影響做熱圖养筒。

移除影響方差的因素曾撤。

> pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")
Regressing out percent.mt
  |======================================================================| 100%
Centering and scaling data matrix
  |======================================================================| 100%

8 線性降維分析

8.1 PCA

對(duì)縮放后的數(shù)據(jù)進(jìn)行PCA分析,默認(rèn)使用前面鑒定表達(dá)變化大的基因晕粪。使用features參數(shù)可以重新定義數(shù)據(jù)集挤悉。

> pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

VizDimReduction, DimPlot, 和 DimHeatmap可以從基因或細(xì)胞角度可視化pca結(jié)果
查看對(duì)每個(gè)主成分影響比較大的基因集

PC_ 1 
Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
           FCER1G, CFD, LGALS1, LGALS2, SERPINA1, S100A8, CTSS, IFITM3, SPI1, CFP 
           PSAP, IFI30, COTL1, SAT1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CTSW, STK17A, CD27 
           CD247, CCL5, GIMAP5, GZMA, AQP3, CST7, TRAF3IP3, SELL, GZMK, HOPX 
           MAL, MYC, ITM2A, ETS1, LYAR, GIMAP7, KLRG1, NKG7, ZAP70, BEX2 
PC_ 2 
Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
           HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
           BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
Negative:  NKG7, PRF1, CST7, GZMA, GZMB, FGFBP2, CTSW, GNLY, B2M, SPON2 
           CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
           TTC38, CTSC, APMAP, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
PC_ 3 
Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPA1, HLA-DPB1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
           HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
           PLAC8, BLNK, MALAT1, SMIM14, PLD4, IGLL5, SWAP70, P2RX5, LAT2, FCGR3A 
Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
           HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
           NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, CMTM5, MPP1, MYL9, RP11-367G6.3, GP1BA 
PC_ 4 
Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, HLA-DPA1, HLA-DRB1 
           TCL1A, PF4, HLA-DQA2, SDPR, HLA-DRA, LINC00926, PPBP, GNG11, HLA-DRB5, SPARC 
           GP9, PTCRA, CA2, AP001189.4, CD9, NRGN, RGS18, GZMB, CLU, TUBB1 
Negative:  VIM, IL7R, S100A6, S100A8, IL32, S100A4, GIMAP7, S100A10, S100A9, MAL 
           AQP3, CD14, CD2, LGALS2, FYB, GIMAP4, ANXA1, RBP7, CD27, FCN1 
           LYZ, S100A12, MS4A6A, GIMAP5, S100A11, FOLR3, TRABD2A, AIF1, IL8, TMSB4X 
PC_ 5 
Positive:  GZMB, FGFBP2, S100A8, NKG7, GNLY, CCL4, PRF1, CST7, SPON2, GZMA 
           GZMH, LGALS2, S100A9, CCL3, XCL2, CD14, CLIC3, CTSW, MS4A6A, GSTP1 
           S100A12, RBP7, IGFBP7, FOLR3, AKR1C3, TYROBP, CCL5, TTC38, XCL1, APMAP 
Negative:  LTB, IL7R, CKB, MS4A7, RP11-290F20.3, AQP3, SIGLEC10, VIM, CYTIP, HMOX1 
           LILRB2, PTGES3, HN1, CD2, FAM110A, CD27, ANXA5, CTD-2006K23.1, MAL, VMO1 
           CORO1B, TUBA1B, LILRA3, GDI2, TRADD, ATP1A1, IL32, ABRACL, CCDC109B, PPA1 

查看前五的可變基因

> print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
PC_ 1 
Positive:  CST3, TYROBP, LST1, AIF1, FTL 
Negative:  MALAT1, LTB, IL32, IL7R, CD2 
PC_ 2 
Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
Negative:  NKG7, PRF1, CST7, GZMA, GZMB 
PC_ 3 
Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPA1 
Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
PC_ 4 
Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
Negative:  VIM, IL7R, S100A6, S100A8, IL32 
PC_ 5 
Positive:  GZMB, FGFBP2, S100A8, NKG7, GNLY 
Negative:  LTB, IL7R, CKB, MS4A7, RP11-290F20.3 

可視化對(duì)每個(gè)主成分影響比較大的基因集


> VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
> ggsave('pca.pdf')
Saving 7 x 7 in image

兩個(gè)主成分的展示

> DimPlot(pbmc, reduction = "pca",split.by = 'ident')
> ggsave('dimplotpca2.pdf')
Saving 7 x 7 in image

DimHeatmap繪制基于單個(gè)主成分的熱圖,細(xì)胞和基因的排序都是基于他們的主成分分?jǐn)?shù)巫湘。對(duì)于數(shù)據(jù)異質(zhì)性的探索是很有幫助的装悲,可以幫助用戶選擇用于下游分析的主成分維度。

> DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
> ggplot('dimheatmap.pdf')
Error: `data` must be a data frame, or other object coercible by `fortify()`, not a character vector.
Run `rlang::last_error()` to see where the error occurred.
> ggsave('dimheatmap.pdf')
Saving 7 x 7 in image
> DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
> ggsave('dimheatmap.pdf')
Saving 7 x 7 in image
> DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
> ggsave(dimp.pdf')
+ '
Error: unexpected string constant in:
"ggsave(dimp.pdf')
'"
> ggsave('dim.pdf')
Saving 7 x 7 in image

8 數(shù)據(jù)維度

為了避免單個(gè)基因影響剩膘,Seurat聚類細(xì)胞時(shí)使用pca結(jié)果衅斩。首先需要確定的是使用多少個(gè)主成分用于后續(xù)分析。常用有兩種方法怠褐,一種是基于零分布的統(tǒng)計(jì)檢驗(yàn)方法畏梆,這種方法耗時(shí)且可能不會(huì)返回明確結(jié)果。另一種是主成分分析常用的啟發(fā)式評(píng)估奈懒。
JackStraw()
在JackStraw()函數(shù)中, 使用基于零分布的置換檢驗(yàn)方法奠涌。隨機(jī)抽取一部分基因(默認(rèn)1%)然后進(jìn)行pca分析得到pca分?jǐn)?shù),將這部分基因的pca分?jǐn)?shù)與先前計(jì)算的pca分?jǐn)?shù)進(jìn)行比較得到顯著性p-Value磷杏,溜畅。根據(jù)主成分(pc)所包含基因的p-value進(jìn)行判斷選擇主成分。最終的結(jié)果是每個(gè)基因與每個(gè)主成分的關(guān)聯(lián)的p-Value极祸。保留下來(lái)的主成分是那些富集小的p-Value基因的主成分慈格。
處理大數(shù)據(jù)時(shí)會(huì)花費(fèi)大量時(shí)間怠晴;ElbowPlot()內(nèi)置了一些其它的方法可以減少運(yùn)行時(shí)間。

> pbmc <- JackStraw(pbmc, num.replicate = 100)
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=06m 05s
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
> JackStrawPlot(pbmc, dims = 1:15)
Error in JackStrawPlot(pbmc, dims = 1:15) : 
  Jackstraw procedure not scored for all the provided dims. Please run ScoreJackStraw.
> ggsave('jackstraw.pdf')
Saving 7 x 7 in image
> pbmc <- ScoreJackStraw(pbmc, dims = 1:20)

JackStrawPlot()函數(shù)提供可視化方法浴捆,用于比較每一個(gè)主成分的p-value的分布蒜田,虛線是均勻分布;顯著的主成分富集有小p-Value基因选泻,實(shí)線位于虛線左上方冲粤。下圖表明保留10個(gè)pca主成分用于后續(xù)分析是比較合理的。

> JackStrawPlot(pbmc, dims = 1:15)
Warning message:
Removed 23426 rows containing missing values (geom_point). 
> ggsave('jackstraw.pdf')
Saving 7 x 7 in image
Warning message:
Removed 23426 rows containing missing values (geom_point). 
> ElbowPlot(pbmc)
> ggsave('elbowplot')
Error: Unknown graphics device ''

ElbowPlot

> ElbowPlot(pbmc)
> ggsave('elbowplot')
Error: Unknown graphics device ''

啟發(fā)式評(píng)估方法生成一個(gè)Elbow plot圖页眯。在圖中展示了每個(gè)主成分對(duì)數(shù)據(jù)方差的解釋情況(百分比表示)梯捕,并進(jìn)行排序。根據(jù)自己需要選擇主成分窝撵,圖中發(fā)現(xiàn)第9個(gè)主成分是一個(gè)拐點(diǎn)傀顾,后續(xù)的主成分(PC)變化都不大了。

注意:鑒別數(shù)據(jù)的真實(shí)維度不是件容易的事情忿族;除了上面兩種方法锣笨,Serat官當(dāng)文檔還建議將主成分(數(shù)據(jù)異質(zhì)性的相關(guān)來(lái)源有關(guān))與GSEA分析相結(jié)合。Dendritic cell 和 NK aficionados可能識(shí)別的基因與主成分 12 和 13相關(guān)道批,定義了罕見(jiàn)的免疫亞群 (i.e. MZB1 is a marker for plasmacytoid DCs)错英。如果不是事先知道的情況下,很難發(fā)現(xiàn)這些問(wèn)題隆豹。

Serat官當(dāng)文檔因此鼓勵(lì)用戶使用不同數(shù)量的PC(10椭岩、15,甚至50)重復(fù)下游分析璃赡。其實(shí)也將觀察到的判哥,結(jié)果通常沒(méi)有顯著差異。因此碉考,在選擇此參數(shù)時(shí)塌计,可以盡量選大一點(diǎn)的維度,維度太小的話對(duì)結(jié)果會(huì)產(chǎn)生不好的影響侯谁。

9 細(xì)胞聚類

Seurat v3應(yīng)用基于圖形的聚類方法锌仅,例如KNN方法。具有相似基因表達(dá)模式的細(xì)胞之間繪制邊緣墙贱,然后將他們劃分為一個(gè)內(nèi)聯(lián)群體热芹。

在PhenoGraph中,首先基于pca維度中(先前計(jì)算的pca數(shù)據(jù))計(jì)算歐式距離(the euclidean distance)惨撇,然后根據(jù)兩個(gè)細(xì)胞在局部的重合情況(Jaccard 相似系數(shù))優(yōu)化兩個(gè)細(xì)胞之間的邊緣權(quán)值伊脓。此步驟內(nèi)置于FindNeighbors函數(shù),輸入時(shí)先前確定的pc數(shù)據(jù)魁衙。

為了聚類細(xì)胞报腔,接下來(lái)應(yīng)用模塊化優(yōu)化技術(shù)迭代將細(xì)胞聚集在一起株搔。(the Louvain algorithm (default) or SLM [SLM, Blondel et al., Journal of Statistical Mechanics]),F(xiàn)indClusters函數(shù)實(shí)現(xiàn)這一功能榄笙,其中需要注意resolution參數(shù)邪狞,該參數(shù)設(shè)置下游聚類分析的“granularity”,更大的resolution會(huì)導(dǎo)致跟多的細(xì)胞類群茅撞。3000左右的細(xì)胞量,設(shè)置resolution為0.4-1.2是比較合適的巨朦。細(xì)胞數(shù)據(jù)集越大米丘,需要更大的resolution參數(shù), 會(huì)獲得更多的細(xì)胞聚類。
查看細(xì)胞屬于那個(gè)類群可以使用函數(shù)Idents糊啡。

> pbmc <- FindNeighbors(pbmc, dims = 1:10)
Computing nearest neighbor graph
Computing SNN
> pbmc <- FindClusters(pbmc, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 2638
Number of edges: 95893

Running Louvain algorithm...
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8735
Number of communities: 9
Elapsed time: 0 seconds
> head(Idents(pbmc), 5)
AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
               0                3                2                1 
AAACCGTGTATGCG-1 
               6 
Levels: 0 1 2 3 4 5 6 7 8
>pbmc <- RunUMAP(pbmc, dims = 1:10)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
08:29:35 UMAP embedding parameters a = 0.9922 b = 1.112
08:29:35 Read 2638 rows and found 10 numeric columns
08:29:35 Using Annoy for neighbor search, n_neighbors = 30
08:29:36 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
08:29:36 Writing NN index file to temp file /tmp/Rtmp1UlrA2/file178084559cfdf3
08:29:36 Searching Annoy index using 1 thread, search_k = 3000
08:29:37 Annoy recall = 100%
08:29:37 Commencing smooth kNN distance calibration using 1 thread
08:29:38 Initializing from normalized Laplacian + noise
08:29:38 Commencing optimization for 500 epochs, with 106338 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
08:29:41 Optimization finished
> DimPlot(pbmc, reduction = "umap")
> ggsave('umap.pdf')
Saving 7 x 7 in image

添加細(xì)胞標(biāo)簽

> DimPlot(pbmc, reduction = "umap",label = TRUE)
> LabelClusters(DimPlot(pbmc, reduction = "umap"),id = 'ident')
> ggsave('dim.pdf')
Saving 7 x 7 in image

此時(shí)可以保存數(shù)據(jù)拄查,方便下次直接導(dǎo)入數(shù)據(jù)修改圖形。

10 尋找差異表達(dá)基因 (cluster biomarkers)

Seurat可以通過(guò)差異表達(dá)分析尋找不同細(xì)胞類群的標(biāo)記基因棚蓄。FindMarkers函數(shù)可以進(jìn)行此操作堕扶,但是默認(rèn)尋找單個(gè)類群(參數(shù)ident.1)與其他所有類群陽(yáng)性和陰性標(biāo)記基因。FindAllMarkers函數(shù)會(huì)自動(dòng)尋找每個(gè)類群和其他每個(gè)類群之間的標(biāo)記基因梭依。

min.pct參數(shù):設(shè)定在兩個(gè)細(xì)胞群中任何一個(gè)被檢測(cè)到的百分比稍算,通過(guò)此設(shè)定不檢測(cè)很少表達(dá)基因來(lái)縮短程序運(yùn)行時(shí)間。默認(rèn)0.1

thresh.test參數(shù):設(shè)定在兩個(gè)細(xì)胞群中基因差異表達(dá)量役拴『剑可以設(shè)置為0 ,程序運(yùn)行時(shí)間會(huì)更長(zhǎng)河闰。

max.cells.per.ident參數(shù):每個(gè)類群細(xì)胞抽樣設(shè)置科平;也可以縮短程序運(yùn)行時(shí)間。

> cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=06s  
> head(cluster1.markers, n = 5)
               p_val avg_log2FC pct.1 pct.2     p_val_adj
S100A9  0.000000e+00   5.570063 0.996 0.215  0.000000e+00
S100A8  0.000000e+00   5.477394 0.975 0.121  0.000000e+00
FCN1    0.000000e+00   3.394219 0.952 0.151  0.000000e+00
LGALS2  0.000000e+00   3.800484 0.908 0.059  0.000000e+00
CD14   2.856582e-294   2.815626 0.667 0.028 3.917516e-290
> cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s  
> head(cluster5.markers, n = 5)
                      p_val avg_log2FC pct.1 pct.2     p_val_adj
FCGR3A        8.331882e-208   4.261784 0.975 0.040 1.142634e-203
CFD           1.932644e-198   3.423863 0.938 0.036 2.650429e-194
IFITM3        2.710023e-198   3.876058 0.975 0.049 3.716525e-194
CD68          1.069778e-193   3.013656 0.926 0.035 1.467094e-189
RP11-290F20.3 4.218926e-190   2.722303 0.840 0.016 5.785835e-186
> pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
Calculating cluster 0
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Calculating cluster 1
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s  
Calculating cluster 2
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Calculating cluster 3
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Calculating cluster 4
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Calculating cluster 5
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s  
Calculating cluster 6
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03s  
Calculating cluster 7
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s  
Calculating cluster 8
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s  
> pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
Error: Problem with `filter()` input `..1`.
i Input `..1` is `top_n_rank(2, avg_logFC)`.
x object 'avg_logFC' not found
i The error occurred in group 1: cluster = 0.
Run `rlang::last_error()` to see where the error occurred.
> pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
Calculating cluster 0
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Calculating cluster 1
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s  
Calculating cluster 2
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Calculating cluster 3
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Calculating cluster 4
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Calculating cluster 5
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s  
Calculating cluster 6
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03s  
Calculating cluster 7
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s  
Calculating cluster 8
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s  
> pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
Error: Problem with `filter()` input `..1`.
i Input `..1` is `top_n_rank(2, avg_logFC)`.
x object 'avg_logFC' not found
i The error occurred in group 1: cluster = 0.
Run `rlang::last_error()` to see where the error occurred.
Seurat可以通過(guò)參數(shù)test.use設(shè)定檢驗(yàn)差異表達(dá)的方法(詳情見(jiàn) 
[DE vignett](https://links.jianshu.com/go?to=https%3A%2F%2Fsatijalab.org%2Fseurat%2Fv3.0%2Fde_vignette.html))姜性。

> cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s  
> head(cluster1.markers, n = 5)
      myAUC  avg_diff power avg_log2FC pct.1 pct.2
RPS12 0.831 0.5132163 0.662  0.7404146 1.000 0.991
RPS6  0.828 0.4730236 0.656  0.6824288 1.000 0.995
RPL32 0.824 0.4362054 0.648  0.6293113 0.999 0.995
RPS27 0.821 0.5010227 0.642  0.7228229 0.999 0.992
RPS14 0.815 0.4366673 0.630  0.6299777 1.000 0.994

Seurat有多種方法可視化標(biāo)記基因的方法
VlnPlot: 基于細(xì)胞類群的基因表達(dá)概率分布
FeaturePlot:在tSNE 或 PCA圖中畫(huà)出基因表達(dá)情況
RidgePlot瞪慧,CellScatter,DotPlot

> VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
> ggsave('vlen123.pdf')
Saving 7 x 7 in image
> VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
> ggsave('vlenkg73.pdf')
Saving 7 x 7 in image

Seurat有多種方法可視化標(biāo)記基因的方法
VlnPlot: 基于細(xì)胞類群的基因表達(dá)概率分布
FeaturePlot:在tSNE 或 PCA圖中畫(huà)出基因表達(dá)情況
RidgePlot部念,CellScatter弃酌,DotPlot

> FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", 
+     "CD8A"))
> ggsave('featuregene10.pdf')
Saving 7 x 7 in image

DoHeatmap為指定的細(xì)胞和基因花表達(dá)熱圖。每個(gè)類群默認(rèn)展示top 20標(biāo)記基因印机。

> ggsave('featuregene10.pdf')
Saving 7 x 7 in image
> top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
Error: Problem with `filter()` input `..1`.
i Input `..1` is `top_n_rank(10, avg_logFC)`.
x object 'avg_logFC' not found
i The error occurred in group 1: cluster = 0.
Run `rlang::last_error()` to see where the error occurred.

11 Assigning cell type identity to clusters

Cluster ID Markers Cell Type
0 IL7R, CCR7 Naive CD4+ T
1 IL7R, S100A4 Memory CD4+
2 CD14, LYZ CD14+ Mono
3 MS4A1 B
4 CD8A CD8+ T
5 FCGR3A, MS4A7 FCGR3A+ Mono
6 GNLY, NKG7 NK
7 FCER1A, CST3 DC
8 PPBP Platelet

> new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 
+     "NK", "DC", "Platelet")
> names(new.cluster.ids) <- levels(pbmc)
> pbmc <- RenameIdents(pbmc, new.cluster.ids)
> DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
> ggsave('wmap123.pdf')
Saving 7 x 7 in image


12 最后保存這個(gè)運(yùn)行過(guò)得文件

> saveRDS(pbmc, file = "pbmc3k_final.rds")
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末矢腻,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子射赛,更是在濱河造成了極大的恐慌多柑,老刑警劉巖,帶你破解...
    沈念sama閱讀 212,816評(píng)論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件楣责,死亡現(xiàn)場(chǎng)離奇詭異竣灌,居然都是意外死亡聂沙,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,729評(píng)論 3 385
  • 文/潘曉璐 我一進(jìn)店門初嘹,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)及汉,“玉大人,你說(shuō)我怎么就攤上這事屯烦】浪妫” “怎么了?”我有些...
    開(kāi)封第一講書(shū)人閱讀 158,300評(píng)論 0 348
  • 文/不壞的土叔 我叫張陵驻龟,是天一觀的道長(zhǎng)温眉。 經(jīng)常有香客問(wèn)我,道長(zhǎng)翁狐,這世上最難降的妖魔是什么类溢? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 56,780評(píng)論 1 285
  • 正文 為了忘掉前任,我火速辦了婚禮露懒,結(jié)果婚禮上闯冷,老公的妹妹穿的比我還像新娘。我一直安慰自己懈词,他們只是感情好蛇耀,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,890評(píng)論 6 385
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著钦睡,像睡著了一般蒂窒。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上荞怒,一...
    開(kāi)封第一講書(shū)人閱讀 50,084評(píng)論 1 291
  • 那天洒琢,我揣著相機(jī)與錄音,去河邊找鬼褐桌。 笑死衰抑,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的荧嵌。 我是一名探鬼主播呛踊,決...
    沈念sama閱讀 39,151評(píng)論 3 410
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼啦撮!你這毒婦竟也來(lái)了谭网?” 一聲冷哼從身側(cè)響起,我...
    開(kāi)封第一講書(shū)人閱讀 37,912評(píng)論 0 268
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤赃春,失蹤者是張志新(化名)和其女友劉穎愉择,沒(méi)想到半個(gè)月后,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 44,355評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡锥涕,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,666評(píng)論 2 327
  • 正文 我和宋清朗相戀三年衷戈,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片层坠。...
    茶點(diǎn)故事閱讀 38,809評(píng)論 1 341
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡殖妇,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出破花,到底是詐尸還是另有隱情谦趣,我是刑警寧澤,帶...
    沈念sama閱讀 34,504評(píng)論 4 334
  • 正文 年R本政府宣布座每,位于F島的核電站蔚润,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏尺栖。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 40,150評(píng)論 3 317
  • 文/蒙蒙 一烦租、第九天 我趴在偏房一處隱蔽的房頂上張望延赌。 院中可真熱鬧,春花似錦叉橱、人聲如沸挫以。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 30,882評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)掐松。三九已至,卻和暖如春粪小,著一層夾襖步出監(jiān)牢的瞬間大磺,已是汗流浹背。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 32,121評(píng)論 1 267
  • 我被黑心中介騙來(lái)泰國(guó)打工探膊, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留杠愧,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 46,628評(píng)論 2 362
  • 正文 我出身青樓逞壁,卻偏偏與公主長(zhǎng)得像流济,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子腌闯,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,724評(píng)論 2 351

推薦閱讀更多精彩內(nèi)容