10X單細胞(10X空間轉(zhuǎn)錄組)進行細胞定義的分析策略

hello,大家好治泥,從今天開始,我們開始走上正軌遮精,分享10X單細胞或者10X空間轉(zhuǎn)錄組的分析內(nèi)容居夹,今天我們分享的內(nèi)容就是做細胞定義的分析策略,文章在Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods本冲,2021年6月發(fā)表于nature protocols准脂,影響因子10分,還是我們一貫的思路檬洞,先分享文獻內(nèi)容狸膏,最后看看示例代碼。

Abstract(我們總結(jié)一下)

We recommend a three-step workflow including automatic cell annotation (wherever possible), manual cell annotation and verification.(自動細胞注釋(應該是軟件)添怔,人工細胞注釋和確認)湾戳。
Frequently encountered challenges are discussed, as well as strategies to address them.(希望對我們有所幫助)。
Guiding principles and specific recommendations for software tools and resources that can be used for each step are covered, and an R notebook is included to help run the recommended workflow.(看來作者已經(jīng)寫好了示例,當然广料,需要我們有一定的基礎(chǔ))院塞。

Introduction

(1)細胞注釋,我們單細胞分析得到的注釋類似于下圖性昭,注釋的結(jié)果必須是可解釋的并且支持生物學的發(fā)現(xiàn)。
圖片.png
To interpret this map biologically, it is necessary to determine which cell types or cell states are represented by clusters or other patterns observed in the data县遣。These interpretations can then be labeled on the map, which helps place them in a conceptual framework useful for better understanding tissue biology.

簡單回顧一下三種降維的優(yōu)缺點
  • t-SNE:preserves local groups of similar cells糜颠,while equalizing the density of cells within each group(同時均衡每組內(nèi)的細胞密度);global relationships between cell types are not maintained, and thus
    cluster-to-cluster relationships cannot be inferred and may be misleading
    萧求。
  • UMAP:UMAP is typically regarded as better for visualizing global relationships and gradients than t-SNE(這個大家應該熟悉)其兴。
  • PCA:PCA can be useful for visualizing cell gradients and states,線性降維

(2)細胞注釋的一般步驟automatic annotation, manual annotation and verification夸政,這一部分大家應該都很熟悉了元旬。
圖片.png
First, automatic annotation uses a predefined set of ‘marker genes’ or reference single-cell data to identify and label individual cells or cell clusters by matching their gene expression patterns (signatures) to those of known cell types.(這個應該是所有做細胞定義的分析過程,當然,也很困難)匀归。
A second major step is manual annotation, which involves studying genes and gene functions specific to each cell cluster or pattern to verify automatic cell annotations and identify novel cell types and states.(尤其存在新的細胞類型坑资,更加難以確認)。
Finally, verification can confirm the identity and function of select cell types using independent methods, such as new validation experiments.接下來穆端,喔們逐步分析看看袱贮。

Step 1: automatic cell annotation 兩種方式,1体啰、marker gene 2攒巍、參考集,各有利弊

A major challenge with automatic cell annotation is that many cell types do not have well-characterized gene expression signatures, resulting in incomplete or inaccurate labeling for some cells.Automated methods typically work better for major cell types and may not be able to effectively distinguish subtypes.(細胞數(shù)多的可靠性高)荒勇。自動細胞注釋有助于快速識別已知細胞類型并突出顯示未知細胞類型以供進一步探索柒莉。

Comparison of the caveats and recommendations for different approaches to cell annotation

Stage of analysis Aspect of analysis Potential caveats Recommendation
Automatic cell annotation All automatic methods Fast, but not effective for poorly characterized cells Use manual annotation for poorly characterized cells
Annotating clusters May miss important differences between cells Use automatic annotation of clusters to get a general idea of cell type and then refine labels manually. In addition, use multiple cluster-based methods and compare results
Annotating individual cells Ideal, but requires high reads per cell Experiments with low reads per cell require cluster-based annotation
Marker-based annotation methods Marker genes not easily accessible for all cell types; may result in conflicting or absent cell labels Requires expert knowledge to curate more extensive marker lists
Reference-based annotation methods Perform poorly with incomplete or poorly matched reference data, which may result in conflicting or absent cell labels Use well-matched reference data or marker-based methods if such data are unavailable
Often requires batch correction, which may reduce the accuracy of results Analyze the reference data for strong biological signals. Use a good experimental protocol that will prevail over batch effects
Mistakes in reference data get carried over to results Analyze reference data for potential errors before using
Comparing results from different automatic annotation methods Results may not agree with each other Compare confidence scores of respective labels and consider label agreement (majority rule); resolve conflicts using manual annotation
Consider the possibility of cell subtypes, new cell types or gradients and cell states
Expert manual cell annotation All manual methods Slow, labor-intensive Whenever possible, begin with automatic annotation to determine general cell labels
Subjective Work with an expert; consider multiple cell-type conclusions
Marker-based annotation Cell types not distinguishable by a single marker Use multiple markers for each cell type
Known markers not distinguishing cell types Curate larger lists of markers from the literature, additional experiments or experts
Conflicting marker gene sets between sources Select a marker gene set that best represents the biological signal being looked for in the data (e.g., if looking for cell subtypes, use more extensive gene sets than what is used for general cell-type annotation)

先來看看第一個,基于marker gene 的注釋

To be successful, the marker gene or gene set (a collection of marker genes) should be specifically and consistently expressed in a given cell, cluster or class of cells沽翔。大而全的marker gene列表是必需的兢孝。
(1)To label individual cells(注意這里是定義單個細胞),one of the most reliable markerbased annotation tools is semi-supervised category identification and assignment(半監(jiān)督的策略,SCINA搀擂,這里需要我們認為提供marker gene列表西潘,原理大家需要看看原文章了)。
(2)AUCell is another good marker-based labeling method that classifies individual cells or clusters.(關(guān)于AUcell大家可以參考文章深入理解R包AUcell對于分析單細胞的作用)哨颂。AUCell ranks the genes in each cell by decreasing expression value, and cells are labeled according to their most active (highly expressed) marker gene sets.(原理喷市,了解一下)
(3)gene set variation analysis (GSVA) has been benchmarked to be fast and reliable。GSVA works similarly to AUCell: given a database of marker gene sets, it identifies sets that are enriched in the gene expression profile of a cluster.(關(guān)于GSVA做細胞定義,這個是目前大多數(shù)公司的做法)。

marker 定義細胞類型的最大問題妆艘,A disadvantage of these tools is that markers are not easily accessible for all cell types.

軟件注釋細胞類型的表格

Tool Type Language Resolution Approach Allows ‘None’ Notes
singleCell Net Reference based R Single cells Relative-expression gene pairs + random forest Yes, but rarely does so even when it should 10–100× slower than other methods; high accuracy
scmap-cluster Reference based R Single cells Consistent correlations Yes Fastest method available; balances falsepositives and false-negatives; includes web interface for use with a large pre-built reference or custom reference set
scmap-cell Reference based R Single cells Approximate nearest neighbors Yes Assigns individual cells to nearest neighbor cells in reference; allows mapping of cell trajectories; fast and scalable
singleR Reference based R Single cells Hierarchical clustering and Spearman correlations No Includes a large marker reference; does not scale to data sets of ≥10,000 cells; includes web interface with marker database
Scikit-learn Reference based Python Multiple possible k-nearest neighbors, support vector machine, random forest, nearest mean classifier and linear discriminant analysis (Optional) Expertise required for correct design and appropriate training of classifier while avoiding overtraining
AUCell Marker based R Single cells Area under the curve to estimate marker gene set enrichment Yes Because of low detection rates at the level of single cells, it requires many markers for every cell type
SCINA Marker based R Single cells Expectation maximization, Gaussian mixture model (Optional) Simultaneously clusters and annotates cells; robust to the inclusion of incorrect marker genes
GSEA/GSVA Marker based R/Java Clusters of cells Enrichment test Yes Marker gene lists must be reformatted in GMT format. Markers must all be differentially expressed in the same direction in the cluster
Harmony Integration R Single cells Iterative clustering and adjustment Yes Integrates only lower-dimensional projection of the data; seamlessly(無縫地) integrated into Seurat pipeline; may overcorrect data
Seurat-canonical correlation analysis Integration R Single cells MNN anchors + canonical correlation analysis Yes Accuracy depends on the accuracy of MNN anchors, which are automatically-identified corresponding cells across data sets
mnnCorrect Integration R Single cells MNN pairs + singular value decomposition Yes Accuracy depends on the accuracy of MNN pairs (cells matched between data sets).
Linked inference of genomic experimental relationships (LIGER) Integration R Single cells Non-negative matrix factorization Yes Allows interpretation of data set–specific and shared factors of variation

接下來看看第二部分 Reference-based automatic cell annotation

this approach is possible only if high-quality and relevant annotated reference single-cell data are available掂恕。目前單細胞已經(jīng)有了一些公共的數(shù)據(jù)庫可以獲取參考集。These atlases typically contain hundreds of thousands of cells and dozens of different annotated cell types垫桂。
這種方法有一個共同的特點,需要一個注釋完整的參考集,一旦參考集是不完整的植酥、缺失的,準確度就會明顯的下降弦牡。
當然友驮,原則上任何做整合分析的方法都可以用于細胞定義,簡單回顧一下常見整合方法的特點驾锰。

  • Harmony iteratively merges data sets represented by top PCs, which are then used to cluster cells. Each cell is iteratively adjusted on the basis of an estimated correction vector to shift it closer to the center of its cluster until convergence. MNN approaches, such as mnnCorrect/FastMNN or Seurat v321, identify the most similar cells (MNNs), called ‘a(chǎn)nchors’, across data sets that are used to estimate and correct the cell type–specific batch effects. LIGER identifies shared (common biology) and unique (biological or technical) factors between data sets using non-negative matrix factorization. LIGER is recommended when specific cell types appear to be present in some of the data sets and missing in others. Integration methods can suffer from overcorrection, where different cell types are merged, or undercorrection, when resulting clusters contain cells from only one input data set. Multiple integration methods may need to be evaluated to find a balance that best represents the data.(批次矯正的方法也要根據(jù)情況來判斷)卸留,過度矯正是目前常見的問題

細化自動注釋

Benchmarking studies show variable performance of automatic annotation tools, depending on the data set and distinctiveness of the gene expression profiles of the cell types to be annotated(軟件之間的也不具有統(tǒng)一性)。

For instance, distinguishing T cells from B cells is relatively straightforward, but automatic tools sometimes cannot accurately distinguish CD8+ cytotoxic T cells from natural killer cells(如下圖)

圖片.png

When applying multiple cell annotation methods to a data set, cells or clusters will acquire multiple, sometimes conflicting, cell-type labels.(這也是最大的問題)椭豫。如果定義的一致耻瑟,那么很容易辨別旨指,容易定義結(jié)果存在矛盾,那么每個軟件提供一個可能性分數(shù)to identify a single high-scoring label.但是不同軟件的判定分數(shù)不具有比較性喳整,但是可以根據(jù)定義結(jié)果出現(xiàn)的頻率來加以識別谆构,出現(xiàn)的頻率越高,越可能是某種細胞類型算柳,如果這些都不行低淡,那就只有人工注釋了。
圖片.png
如果注釋存在矛盾瞬项,很可能說明該cluster還有subtype蔗蹋,但是,如果不能明確定義亞型囱淋,則更通用的細胞類型注釋可能更合適猪杭。 這里舉一個例子,F(xiàn)or example, if a cluster is annotated as regulatory T cells, naive T cells and helper T cells by different methods, it may be most appropriate to assign the general label of ‘T cells’.
If the conflicting annotations are not subtypes of the same cell type, then the cluster may represent an intermediate cell state or gene-expression gradient妥衣。(中間狀態(tài)或者基因表達等級皂吮,這個地方很值得挖掘)。

Finally, a cluster may have a novel cell identity that is absent from the reference data. This often results in widely varying results from automatic annotation methods or insufficient confidence for any tool to assign any label. In such situations, manual annotation must be performed.(新的細胞類型就需要我們?nèi)斯ぷ⑨屃耍?/h4>

Step 2: expert manual cell annotation(人工注釋)

人工注釋細胞類型目前是最可靠的方法(gold-standard method)税手,但是蜂筹,it is slow and labor intensive and can be subjective.主要的人工注釋,就是我們?nèi)斯みx擇marker的過程,很繁瑣芦倒。
圖片.png
所有挑選出來的基因必須進行檢驗和可視化(比如dotplot和熱圖)
圖片.png

圖片.png
Challenges in this approach are that well-known markers are often too few in number to completely annotate an scRNA-seq data set, and some well-known markers may not be as specific within an scRNA-seq data set as expected.
set. Master transcription factors that drive cell fate often make better gene expression markers than cell-surface proteins that are commonly used to classify cell populations(轉(zhuǎn)錄因子基因的識別能力更好)艺挪,因為轉(zhuǎn)錄組水平和蛋白水平并不不能很好的關(guān)聯(lián)。
識別同一個細胞類型的marker gene通常是多個兵扬,尤其在定義subcluster的時候麻裳。
The ideal primary source for cell-defining genes is a singlecell atlas from a relevant organism, organ and disease context.(marker基因具有組織、器官器钟、疾病特異性)津坑。
in some instances a cluster may not express markers of any known cell type; conversely, it may express markers of more than one cell type.(這種情況就是低質(zhì)量的細胞、新的細胞類型或者含有subcluster)傲霸。
Once cell-type information from known markers is exhausted, cells that have not been confidently annotated must be manually examined, cluster by cluster.(marker gene無法起到作用的時候疆瑰,就需要人工進一步檢驗了。潛在的marker gene需要用一些軟件來進行差異分析)昙啄。
圖片.png
All marker genes are then manually researched to find functional information that may help identify the cell type of the cluster with which they are associated穆役。(例如通路富集)。
Some cells may be challenging to annotate, including novel cell types, which can be described on the basis of the function of genes they express跟衅。

Annotating cell states and gradients(針對新的細胞類型)

When analyzing and characterizing novel cell types, it is important to determine whether they represent a stable cell type or contain multiple cell states.(穩(wěn)態(tài)還是多種細胞狀態(tài))。
細胞類型和狀態(tài)的定義尚未標準化播歼,但可能預期穩(wěn)定的細胞類型在整個cluster中具有homo基因表達并且聚類在一起伶跷。
whereas cell gradients appear as a spread-out string of cells and cell states掰读,Expression gradients indicate continuous differences that are present in the cell population, which could represent states like the cell cycle, immune activation, spatial patterning or transient developmental stages。識別有意義的細胞狀態(tài)去除批次效應意義重大叭莫。
圖片.png
注釋細胞發(fā)育的中間階段通常很困難蹈集,因為這些區(qū)域很少表達獨特的標記基因。 It is often easier to label the ends of a gradient and then characterize intermediate stages using the order of specific genes that mark these ends as increasing or decreasing across the gradient雇初。
Extracting the cells in the gradient and performing principal component analysis (PCA) on them is often a useful visualization for gradients, because it preserves the large-scale distances between cells(有的軌跡分析軟件就采用這樣的策略)拢肆。
目前沒有可以自動注釋中間態(tài)的方法,不同細胞的層次只能人工識別靖诗,making use of known structure and celltype transitions relevant to the particular experiment郭怪。
Similarly, homogeneous or similar cell states or cell types are often difficult to annotate because they share many of the same marker genes。(這個時候就需要再分群分析了).
Very fine distinctions between highly similar cell types may not be visible transcriptionally and may be visible only in other genomic layers, such as chromatin state (assay for transposase-accessible chromatin using sequencing (ATAC-seq) and DNA methylation).(多組學識別細胞類型也是很重要的一點)刊橘。

Step 3: annotation verification

需要其他的分析輔助驗證了鄙才,包括多組學,SC-ATAC促绵,CNV等等攒庵。

最后,附上一張做細胞定義的軟件總結(jié)

圖片.png

圖片.png

圖片.png

我們來看一下示例代碼

1. Reference-based automatic annotation

Create the Reference

The first step in performing reference-based annotation is to select an annotated dataset to use as the reference. Here we will use one of the references created by the authors of SingleR and show how it can be used with other tools such as scmap.

Other reference datasets can be found in GEO (https://www.ncbi.nlm.nih.gov/geo/) or at a link provided by the authors of the reference dataset. However, to use a dataset as a reference you will need both the single-cell RNA sequencing data and the cell-type annotations. GEO does not require authors to provide the cell-type annotations of their data, so you may need to contact the authors directly to to get the annotations for some datasets.

# Set a random seed to ensure result reproducibility
set.seed(9742)
# Download singleR reference data for immune cells and save it as the variable "ref"
# The variable is a class called "Summarized Experiment"
# This will take a while
ref <- celldex::DatabaseImmuneCellExpressionData()

Next we need to reformat the data to ensure it is compatible with the tool we are using. We will be demonstrating scmap, which uses data formatted as a ‘SingleCellExperiment object’, and assumes by default that gene names are found in a column named ‘feature_symbol’ while the cell-type labels are in a column named ‘cell_type1’. In addition, scmap requires that you normalize and log-transform the reference data; this has already been done for the SingleR reference data so we skip those steps here.

# Assign cell-type labels in a column named "cell_type1"
colData(ref)$cell_type1 <- colData(ref)$label.fine
# Assign gene names in a column called "feature_symbol"
rowData(ref)$feature_symbol <- rownames(ref)

# Convert the data into a SingleCellExperiment object
ref_sce <- SingleCellExperiment::SingleCellExperiment(assays=list(logcounts=Matrix::Matrix(assays(ref)$logcounts)), 
            colData=colData(ref), rowData=rowData(ref))

Our reference data is ready to be used now. So lets process this data to build the index we will use to map our unlabeled data to. First, we select genes to use, which will be those deemed most informative by scmap after fitting a linear model to the gene expression by gene dropout distribution. Those which are most informative have high expression values and low % dropout rates across cells.

# Create scmap-cluster reference by first selecting the most informative features
ref_sce <- scmap::selectFeatures(ref_sce, suppress_plot=FALSE)
Your object does not contain counts() slot. Dropouts were calculated using logcounts() slot...
image.png
# Inspect the first 50 genes selected by scmap
rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)][1:50]
# You can check and see how many genes were chosen by checking the length of the
# vector of gene names
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])
[1] 500

Now we can see the genes that scmap has chosen to use. If there are key marker genes missing we can make sure they are included like this:

# Create a list of key markers that you want to use
my_key_markers = c("TRAC", "TRBC1", "TRBC2", "TRDC", "TRGC1", "TRGC2", "IGKC")
# Ensure markers are in the list of features used by scmap
rowData(ref_sce)$scmap_features[rownames(ref_sce) %in% my_key_markers] <- TRUE
# You can check and see if this added any genes by checking the length 
# of the vector of gene names again
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])
[1] 502

And we can remove genes that we think might be technical artefacts, such as mitochondria RNAs, like this:

# Create a list of mitochondrial genes from the dataset (genes that begin with "MT")
mt_genes <- rownames(ref_sce)[grep("^MT-", rownames(ref_sce))]
# Remove these genes from the features used by scmap
rowData(ref_sce)$scmap_features[rownames(ref_sce) %in% mt_genes] <- FALSE
# Check how many genes this is
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])
[1] 495
# Extract the features and assign them to a new variable, "scmap_feature_genes"
scmap_feature_genes <- rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)]
# Note that the number of genes/features is identical to what we just checked
length(scmap_feature_genes)
[1] 495

Now we build the reference profiles used in scmap-cluster, for cluster-based cell-type annotation. These profiles can be accessed and plotted from inside the SingleCellExperiment object as follows:

# Create reference profiles;
# Once reference profiles are generated the original data are 
# not needed for scmap-cluster
ref_sce <- scmap::indexCluster(ref_sce)
# Visualize interesting features as a heatmap
# Reformat the data so that they can be used as input to ggplot2
cormat <- reshape2::melt(as.matrix(metadata(ref_sce)$scmap_cluster_index))
# Plot the data
ggplot2::ggplot(cormat, ggplot2::aes(x = Var2, y = Var1, fill = value)) +
  ggplot2::geom_tile() +
  ggplot2::scale_fill_gradient2(low = "blue", high = "darkred",
                                name = "Expression value") +
  ggplot2::theme_minimal() +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 1,
                                            size = 18, hjust = 1),
                 axis.text.y = ggplot2::element_text(size = 15),
                 axis.title.x = ggplot2::element_blank(),
                 axis.title.y = ggplot2::element_blank())
image.png
# Store expression information as a variable
scmap_cluster_reference <- metadata(ref_sce)$scmap_cluster_index

From here on out, scmap only needs this set of reference profiles. So if working with a very large reference, one could save this index separately to your computer and reload it when annotating new datasets. But since that is not the case here, we will simply save this index to a variable for now.

We will also demonstrate scmap-cell to annotate individual cells of our dataset, so we will create that index as well. As before one would first normalize and log-transform the reference data, and select genes to use. As we have already done that, we need only run the command to build the scmap-cell index. There are two parameters we can set: M and k, increasing M and k will give more accurate mapping but increase the size of the index, and the time needed to map cells. Here we use the defaults (you may see a warning message about the defaults that are being used):

# Update the previous reference to also contain the scmap-cell reference
ref_sce <- scmap::indexCell(ref_sce)
Parameter M was not provided, will use M = n_features / 10 (if n_features <= 1000), where n_features is the number of selected features, and M = 100 otherwise.
Parameter k was not provided, will use k = sqrt(number_of_cells)
# Extract the scmap index from the reference and store as a variable
scmap_cell_reference <- metadata(ref_sce)$scmap_cell_index
# Extract the associated cell IDs from the reference and save as a variable
scmap_cell_metadata <- colData(ref_sce)

scmap-cell assigns cells in one dataset to their “nearest neighbours” in the reference dataset. In this case, the “nearest neighbours” are the cells in the reference dataset most similar to the cells in the query dataset.

One can use any rule they like to transfer information, such as cell-type or pseudotime, from these nearest neighbours to the query data. Thus we need to store the associated metadata (cell type ID) for the reference as well (see above). Now we don’t need to use our original reference dataset anymore.

Assign cells from the query dataset to the reference.

The query dataset we will be using is provided by 10X genomics.

download.file("https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz",
              "pbmc3k_filtered_gene_bc_matrices.tar.gz")
trying URL 'https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz'
Content type 'application/x-tar' length 7621991 bytes (7.3 MB)
==================================================
downloaded 7.3 MB
untar("pbmc3k_filtered_gene_bc_matrices.tar.gz")

Now we need to load our unlabeled dataset into R. Normal preprocessing including QC filtering, normalizing and log-transforming the data must be done prior to annotating. In addition, scmap is based on the SingleCellExperiment object, so if our data is stored as a Seurat object we must convert it to SingleCellExperiment as shown below.

# This portion of the tutorial is assuming the raw 10X data is in the
# following folder in your directory:
data <- Seurat::Read10X("filtered_gene_bc_matrices/hg19/")
# Make SingleCellExperiment from the raw matrix
query_sce <- SingleCellExperiment::SingleCellExperiment(assays=list(counts=data))

# Make SingleCellExperiment from Seurat object
query_seur <- Seurat::CreateSeuratObject(data)
Feature names cannot have underscores ('_'), replacing with dashes ('-')
query_sce <- Seurat::as.SingleCellExperiment(query_seur)

# normalize the data using the scater package
query_sce <- scater::logNormCounts(query_sce)

# add feature_symbol column (i.e. the gene symbols)
rowData(query_sce)$feature_symbol <- rownames(query_sce)

Now you should have an entry in assays(my_sce) called logcounts with the log-normalized matrix. We are now ready to annotate our data with scmap-cluster. Let’s start with scmap-cluster:

# Run scmapCluster
scmap_cluster_res <- scmap::scmapCluster(projection=query_sce, 
                index_list=list(immune1 = scmap_cluster_reference), 
                threshold=0.1)
# plot the results of our annotation
par(mar=c(13, 4, 1, 0))
barplot(table(scmap_cluster_res$combined_labs), las=2)
圖片.png

# Store this annotation information within the query object
colData(query_sce)$scmap_cluster <- scmap_cluster_res$combined_labs

# Make a UMAP of the cells, labeled with the cell-type annotations from scmapCluster
query_sce <- scater::runUMAP(query_sce)
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="scmap_cluster")
圖片.png

Alternatively we could use scmap-cell, to find the 10 nearest neighbours to each cell (i.e. the 10 most similar cells to each query cell), then pick the annotation that is most common among the neighbours, like this:

# Determine the 10 nearest neighbours from the reference dataset for each
# cell in the query dataset using scmapCell
nearest_neighbours <- scmap::scmapCell(projection=query_sce, 
    index_list = list(immune1 = scmap_cell_reference), 
    w=10)

# Get metadata (cell type IDs) for the neighbours of each cell in the query dataset
mode_label <- function(neighbours, metadata=scmap_cell_metadata$cell_type1) {
    freq <- table(metadata[neighbours])
    label <- names(freq)[which(freq == max(freq))]
    if (length(label) > 1) {return("ambiguous")}
    return(label)
}

# Apply these labels to the query cells
scmap_cell_labs <- apply(nearest_neighbours$immune1$cells, 2, mode_label)

# Add the labels to the query object
colData(query_sce)$scmap_cell <- scmap_cell_labs

# Create a bar plot of how many cells in the query dataset were assigned
# a specific label
par(mar=c(10, 4, 0, 0))
barplot(table(scmap_cell_labs), las=2)
圖片.png

# Make a UMAP and add the new cell-type annotations
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="scmap_cell")
圖片.png

Another option compatible with the SingleCellExperiment Object is SingleR. As before, we need a reference and a query dataset. In the case of SingleR, we need the entirety of the reference dataset, rather than generating a compressed reference index as we did with scmap. In addition, running just this small example demonstrates the difference in run time between the methods (SingleR takes a fair bit of time).

# Run SingleR on the query data and the reference to acquire
# cell-type predictions for the cells in the query dataset
predictions <- SingleR::SingleR(test=query_sce, ref=ref, labels=ref$label.fine)
# You'll notice that some of the cells didn't get assigned a cell identity
# We can count the number here:
sum(is.na(predictions$pruned.labels))
[1] 23
# Change NAs to "ambiguous"
predictions$pruned.labels[which(is.na(predictions$pruned.labels))] <- "ambiguous"
# Add singleR labels to query_sce
colData(query_sce)$singleR <- predictions$pruned.labels

# Create a bar plot of number of cells per assigned cell ID
par(mar=c(13, 4, 2, 0))
barplot(table(predictions$pruned.labels), las=2)
圖片.png

# Make a UMAP and add the cell-type annotations
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="singleR")
image.png

Integration as a form of annotation

Another option is to integrate our query data with our reference data. Then we simply transfer the labels from the annotated reference to the neighbouring query cells in the integrated dataset. Clustering the integrated data is a common approach to transferring labels. We demonstrate how this could be done with Harmony below. But the approach would be the same for any integration tool.

Note: the SingleR reference is not single cells, but averages across many cells. Thus we convert and downsample the reference to a single cell object for demonstration purposes. For a real experiment, one would use the original single cells as the reference when integrating datasets.

set.seed(2891)
# Convert reference and query datasets to Seurat Objects

# Add a "counts" slot to the reference SingleCellExperiment object so we can convert it to a Seurat Object
assays(ref_sce)[["counts"]] <- round(2^assays(ref_sce)[["logcounts"]]) -1
colnames(ref_sce) <- paste("cell", 1:ncol(ref_sce))

# Subset both objects so both the reference and query datasets have the same genes
# First subset the reference
ref_seur <- Seurat::as.Seurat(ref_sce[rownames(ref_sce) %in% rownames(query_sce),])
ref_seur@active.ident <- factor(rep("reference", ncol(ref_seur)))
# Now subset the query
query_seur <- Seurat::as.Seurat(query_sce[rownames(query_seur) %in% rownames(ref_sce),])
query_seur@active.ident <- factor(rep("query", ncol(query_seur)))

# Downsample the reference to be similar to query in terms of total UMIs
totalUMI <- median(query_seur@meta.data$nCount_RNA)
ref_seur@assays$RNA@counts <- Seurat::SampleUMI(ref_seur@assays$RNA@counts,
                                                max.umi=totalUMI, upsample=FALSE)

# Merge the datasets together into a single Seurat object
merged_seur <- merge(ref_seur, query_seur)
merged_seur@meta.data$source <- merged_seur@active.ident

# Normalize the combined data
merged_seur <- Seurat::NormalizeData(merged_seur)
# Rather than choosing new variable features, we will choose
# the genes that had been previously important by scmap for consistency
Seurat::VariableFeatures(merged_seur) <- scmap_feature_genes
# Scale the data and run dimensionality reduction on the combined data
merged_seur <- Seurat::ScaleData(merged_seur)
merged_seur <- Seurat::RunPCA(merged_seur)
merged_seur <- Seurat::RunUMAP(merged_seur, dims=1:15)
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("Before Integration")
圖片.png
# Run Harmony to remove batch effects
merged_seur <- harmony::RunHarmony(merged_seur, "source", dims.use=1:15)
merged_seur <- Seurat::RunUMAP(merged_seur, dims=1:15, reduction="harmony")
# Plot the data
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("After Integration")
圖片.png

Now that the data is integrated we will cluster the data and look at the annotations of the reference cells present in each cluster. As with all clustering, this may require manual tuning of the resolution parameters to get the best labels.

# Cluster the integrated dataset
merged_seur <- Seurat::FindNeighbors(merged_seur, reduction="harmony", dims=1:15)
merged_seur <- Seurat::FindClusters(merged_seur, resolution=0.5)
# Plot the data
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("After Integration")
圖片.png
# Create a table of cluster labels based on integrated data
table(merged_seur@meta.data$label.fine, 
        merged_seur@active.ident)

                                     0   1   2   3   4   5   6   7   8   9
  B cells, naive                     0   0   0 106   0   0   0   0   0   0
  Monocytes, CD14+                   0   0 106   0   0   0   0   0   0   0
  Monocytes, CD16+                   0   0   0   0   0   0 105   0   0   0
  NK cells                           0 105   0   0   0   0   0   0   0   0
  T cells, CD4+, memory TREG         0   0   0   0   0   0   0 104   0   0
  T cells, CD4+, naive               4   0   0   0  95   4   0   0   0   0
  T cells, CD4+, naive TREG          1   0   0   0 102   0   0   1   0   0
  T cells, CD4+, naive, stimulated   0   0   0   0   0   0   0   0 102   0
  T cells, CD4+, TFH                96   0   0   0   5   0   0   3   0   0
  T cells, CD4+, Th1               103   0   0   0   1   0   0   0   0   0
  T cells, CD4+, Th1_17            104   0   0   0   0   0   0   0   0   0
  T cells, CD4+, Th17               99   0   0   0   0   0   0   5   0   0
  T cells, CD4+, Th2                96   0   0   0   6   0   0   2   0   0
  T cells, CD8+, naive               0   0   0   0   1 103   0   0   0   0
  T cells, CD8+, naive, stimulated   0   0   0   0   0   1   0   0   0 101

Here we have a table of the reference annotations (across rows) per cluster (across columns). We can manually label the clusters based on this table or we could create a rule to algorithmically label the clusters based on this table. Since there are only 11 clusters, we assign the labels manually.

cluster_labs <- c("0"="ambiguous", 
    "1"="Monocytes, CD14+", 
    "2"="B cells, naive", 
    "3"="T cells, CD4+, naive TREG",
    "4"="T cells, CD4+, Th1_17",
    "5"="NK cells",
    "6"="T cells, CD8+, naive",
    "7"="Monocytes, CD16+",
    "8"="T cells, CD4+, memory TREG",
    "9"="T cells, CD4+, naive, stimulated",
    "10" = "T cells, CD8+, naive, stimulated")

# Assign cluster label to the associated query cells
# (the query cells that had been assigned the same cluster label)
merged_seur@meta.data$annotation <- cluster_labs[merged_seur@meta.data$RNA_snn_res.0.5]

# Add the results to the SingleCellExperiment Object and plot
query_sce$Harmony_lab <- merged_seur@meta.data$annotation[merged_seur@meta.data$source =="query"]
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="Harmony_lab")
image.png

2. Refining / Consensus annotations

Once we have run several tools, we can use the consensus of the labels to get a more robust annotation. In this case we will simply use the most common label across tools to assign the final automatically annotated label.

Hide

annotation_columns <- c("scmap_cluster", "scmap_cell", "singleR", "Harmony_lab")

#Optional check how consistent the labelling was.
#head(colData(query_sce)[,annotation_columns])

get_consensus_label <- function(labels){
    labels <- labels[labels != "ambiguous"]
    if (length(labels) == 0) {return("ambiguous")}
    freq <- table(labels)
    label <- names(freq)[which(freq == max(freq))]
    if (length(label) > 1) {return("ambiguous")}
    return(label)
}

colData(query_sce)$consensus_lab <- apply(colData(query_sce)[,annotation_columns], 1, get_consensus_label)
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="consensus_lab")
image.png

3. Marker-based automatic annotation

An alternative way for annotation of your query scRNAseq dataset is to utilize Marker-based annotation tools. SCINA is a semi-supervised annotation tool that takes in the signature genes and expression matrix and predicts the potential labels based on the prior knowledge of the cell-type-specific markers. List of markers is usually provided in the gmt format. The PBMC gene set used below have been gathered by Diaz-Mejia JJ et al.

download.file("https://zenodo.org/record/3369934/files/pbmc_22_10x.tar.bz2",
              "pbmc_22_10x.tar.bz2")
untar("pbmc_22_10x.tar.bz2")

The extracted data will by located in the following file: ./MY_PAPER/SUPPLEMENTARY_DATA/pbmc_22_10x/pbmc_22_10x_cell_type_signature_gene_sets.gmt

The results from this annotation tool are not used in the above step to find consensus annotations because the lists of marker genes are not consistent with the cell types identified in the reference dataset. This is because these data come from different sources, and would not have been characterizing the exact same set of cells. If you wish for marker-based and reference-based annotation methods to be combined in the above step of automatically determining consensus annotations, you would have to make sure all of the identified cell subtypes are the same and that they are spelt the exact same way in order for R to recognize the names as identical.

# Import the marker genes as a GMT file and store as a variable
markers <- msigdb::read.gmt('./MY_PAPER/SUPPLEMENTARY_DATA/pbmc_22_10x/pbmc_22_10x_cell_type_signature_gene_sets.gmt')
# Convert the expression data from Seurat object into a matrix data structure
exprMatrix <- as.matrix(Seurat::GetAssayData(query_seur))
# Run SCINA on the query data using the marker genes to identify cell types
# Specifying rm_overlap = FALSE allows the same marker gene to specify multiple cell types which
# may be useful if identifying cell subtypes or other similar types of cells
# Specifying allow_unknown = TRUE allows cells to be labeled as "unknown" instead of being
# assigned a low-confident label
predictions.scina = SCINA::SCINA(exp = exprMatrix, signatures = markers$genesets,
                          rm_overlap = FALSE, allow_unknown = TRUE)
# Add SCINA annotation information to each cell in Seurat object
colData(query_sce)$SCINA <- predictions.scina$cell_labels

# Make a UMAP and add the SCINA cell-type annotations
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="SCINA") +
  ggplot2::theme(legend.position = "bottom",
                 legend.text = ggplot2::element_text(size = 4))

image.png

4. Manual annotation

Retrieving marker genes

If you do not have an extensive list of markers per cell type, or a good quality reference dataset, it is useful to extract the top marker genes from each cluster of your query data. We can easily do this in Seurat, with the data formatted as a *Seurat object (which we created earlier and stored as the variable query_seur). First, the data must be normalized and scaled, and the variable genes between cells must be determined.

Hide

query_seur <- Seurat::NormalizeData(query_seur) # Normalize the data
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
query_seur <- Seurat::FindVariableFeatures(query_seur) # Determine the variable features of the dataset
Calculating gene variances
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
query_seur <- Seurat::ScaleData(query_seur) # Scale the data based on the variable features

Next, different types of dimensionality reduction must be performed on the data so that the cells can be grouped together in 2D space.

query_seur <- Seurat::RunPCA(query_seur)
query_seur <- Seurat::RunTSNE(query_seur)
# RunUMAP has already been performed on the data, so the following line of code
# does not need to be run in this case:
#query_seur <- Seurat::RunUMAP(query_seur, dims = 1:50)

From this object, we can cluster the data at a chosen resolution that can be modified later on if desired.

# Determine the "nearest neighbours" of each cell
query_seur <- Seurat::FindNeighbors(query_seur, dims = 1:50)
Computing nearest neighbor graph
Computing SNN
# Cluster the cells
query_seur <- Seurat::FindClusters(query_seur, resolution = 0.5)
Seurat::DimPlot(query_seur, reduction = "UMAP")
圖片.png

Now let’s extract the top marker genes, and see which ones correspond with each cluster. This can be done using the FindAllMarkers function within Seurat.

markers_seur <- Seurat::FindAllMarkers(query_seur, only.pos = TRUE)
require(dplyr)
# Retrieve the top 5 marker genes per cluster
# Use whichever genes have the highest values under the AVG_LOG column
top5 <- markers_seur %>% group_by(cluster) %>%
  dplyr::slice_max(get(grep("^avg_log", colnames(markers_seur), value = TRUE)),
                   n = 5)
# Create the dot plot
Seurat::DotPlot(query_seur, features = unique(top5$gene)) +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 1,
                                            size = 8, hjust = 1)) +
  Seurat::NoLegend()
圖片.png
# Create the heatmap
Seurat::DoHeatmap(query_seur, features = unique(top5$gene)) +
  Seurat::NoLegend() +
  ggplot2::theme(axis.text.y = ggplot2::element_text(size = 8))
The following features were omitted as they were not found in the scale.data slot for the RNA assay: NOSIP, CD3E, CD3D, IL7R, LDHB
image.png

Pathway analysis

Pathway analysis can also be done for each cluster to determine significantly up- and downregulated pathways based on known gene function. An easy way to do this is by feeding our current Seurat object into cerebroApp. cerebroApp requires that marker genes be fetched again through before performing simple pathway analysis.

# First get marker genes through cerebro
query_seur <- cerebroApp::getMarkerGenes(query_seur,
                                         groups = c('seurat_clusters'),
                                         assay = "RNA",
                                         organism = "hg")
# Get enriched pathways through cerebro
query_seur <- cerebroApp::getEnrichedPathways(query_seur,
                                              databases = c("GO_Biological_Process_2018",
                                                            "GO_Cellular_Component_2018",
                                                            "GO_Molecular_Function_2018",
                                                            "KEGG_2016",
                                                            "WikiPathways_2016",
                                                            "Reactome_2016",
                                                            "Panther_2016",
                                                            "Human_Gene_Atlas",
                                                            "Mouse_Gene_Atlas"),
                                              adj_p_cutoff = 0.05,
                                              max_terms = 100,
                                              URL_API = "http://amp.pharm.mssm.edu/Enrichr/enrich")
圖片.png

生活很好败晴,有你更好

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
禁止轉(zhuǎn)載浓冒,如需轉(zhuǎn)載請通過簡信或評論聯(lián)系作者。
  • 序言:七十年代末尖坤,一起剝皮案震驚了整個濱河市稳懒,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌糖驴,老刑警劉巖僚祷,帶你破解...
    沈念sama閱讀 206,013評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異贮缕,居然都是意外死亡辙谜,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評論 2 382
  • 文/潘曉璐 我一進店門感昼,熙熙樓的掌柜王于貴愁眉苦臉地迎上來装哆,“玉大人,你說我怎么就攤上這事定嗓⊥汕伲” “怎么了?”我有些...
    開封第一講書人閱讀 152,370評論 0 342
  • 文/不壞的土叔 我叫張陵宵溅,是天一觀的道長凌简。 經(jīng)常有香客問我,道長恃逻,這世上最難降的妖魔是什么雏搂? 我笑而不...
    開封第一講書人閱讀 55,168評論 1 278
  • 正文 為了忘掉前任藕施,我火速辦了婚禮,結(jié)果婚禮上凸郑,老公的妹妹穿的比我還像新娘裳食。我一直安慰自己,他們只是感情好芙沥,可當我...
    茶點故事閱讀 64,153評論 5 371
  • 文/花漫 我一把揭開白布诲祸。 她就那樣靜靜地躺著,像睡著了一般而昨。 火紅的嫁衣襯著肌膚如雪救氯。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 48,954評論 1 283
  • 那天配紫,我揣著相機與錄音径密,去河邊找鬼。 笑死躺孝,一個胖子當著我的面吹牛享扔,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播植袍,決...
    沈念sama閱讀 38,271評論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼惧眠,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了于个?” 一聲冷哼從身側(cè)響起氛魁,我...
    開封第一講書人閱讀 36,916評論 0 259
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎厅篓,沒想到半個月后秀存,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 43,382評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡羽氮,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 35,877評論 2 323
  • 正文 我和宋清朗相戀三年或链,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片档押。...
    茶點故事閱讀 37,989評論 1 333
  • 序言:一個原本活蹦亂跳的男人離奇死亡澳盐,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出令宿,到底是詐尸還是另有隱情叼耙,我是刑警寧澤,帶...
    沈念sama閱讀 33,624評論 4 322
  • 正文 年R本政府宣布粒没,位于F島的核電站筛婉,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏癞松。R本人自食惡果不足惜爽撒,卻給世界環(huán)境...
    茶點故事閱讀 39,209評論 3 307
  • 文/蒙蒙 一冕碟、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧匆浙,春花似錦、人聲如沸厕妖。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,199評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽言秸。三九已至软能,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間举畸,已是汗流浹背查排。 一陣腳步聲響...
    開封第一講書人閱讀 31,418評論 1 260
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留抄沮,地道東北人跋核。 一個月前我還...
    沈念sama閱讀 45,401評論 2 352
  • 正文 我出身青樓,卻偏偏與公主長得像叛买,于是被迫代替她去往敵國和親砂代。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 42,700評論 2 345

推薦閱讀更多精彩內(nèi)容