hello,大家好治泥,從今天開始,我們開始走上正軌遮精,分享10X單細胞或者10X空間轉(zhuǎn)錄組的分析內(nèi)容居夹,今天我們分享的內(nèi)容就是做細胞定義的分析策略,文章在Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods本冲,2021年6月發(fā)表于nature protocols准脂,影響因子10分,還是我們一貫的思路檬洞,先分享文獻內(nèi)容狸膏,最后看看示例代碼。
Abstract(我們總結(jié)一下)
We recommend a three-step workflow including automatic cell annotation (wherever possible), manual cell annotation and verification.(自動細胞注釋(應該是軟件)添怔,人工細胞注釋和確認)湾戳。
Frequently encountered challenges are discussed, as well as strategies to address them.(希望對我們有所幫助)。
Guiding principles and specific recommendations for software tools and resources that can be used for each step are covered, and an R notebook is included to help run the recommended workflow.(看來作者已經(jīng)寫好了示例,當然广料,需要我們有一定的基礎(chǔ))院塞。
Introduction
(1)細胞注釋,我們單細胞分析得到的注釋類似于下圖性昭,注釋的結(jié)果必須是可解釋的并且支持生物學的發(fā)現(xiàn)。
To interpret this map biologically, it is necessary to determine which cell types or cell states are represented by clusters or other patterns observed in the data县遣。These interpretations can then be labeled on the map, which helps place them in a conceptual framework useful for better understanding tissue biology.
簡單回顧一下三種降維的優(yōu)缺點
-
t-SNE:preserves local groups of similar cells糜颠,while equalizing the density of cells within each group(同時均衡每組內(nèi)的細胞密度);global relationships between cell types are not maintained, and thus
cluster-to-cluster relationships cannot be inferred and may be misleading萧求。 - UMAP:UMAP is typically regarded as better for visualizing global relationships and gradients than t-SNE(這個大家應該熟悉)其兴。
- PCA:PCA can be useful for visualizing cell gradients and states,線性降維
(2)細胞注釋的一般步驟automatic annotation, manual annotation and verification夸政,這一部分大家應該都很熟悉了元旬。
First, automatic annotation uses a predefined set of ‘marker genes’ or reference single-cell data to identify and label individual cells or cell clusters by matching their gene expression patterns (signatures) to those of known cell types.(這個應該是所有做細胞定義的分析過程,當然,也很困難)匀归。
A second major step is manual annotation, which involves studying genes and gene functions specific to each cell cluster or pattern to verify automatic cell annotations and identify novel cell types and states.(尤其存在新的細胞類型坑资,更加難以確認)。
Finally, verification can confirm the identity and function of select cell types using independent methods, such as new validation experiments.接下來穆端,喔們逐步分析看看袱贮。
Step 1: automatic cell annotation 兩種方式,1体啰、marker gene 2攒巍、參考集,各有利弊
A major challenge with automatic cell annotation is that many cell types do not have well-characterized gene expression signatures, resulting in incomplete or inaccurate labeling for some cells.Automated methods typically work better for major cell types and may not be able to effectively distinguish subtypes.(細胞數(shù)多的可靠性高)荒勇。自動細胞注釋有助于快速識別已知細胞類型并突出顯示未知細胞類型以供進一步探索柒莉。
Comparison of the caveats and recommendations for different approaches to cell annotation
Stage of analysis | Aspect of analysis | Potential caveats | Recommendation |
---|---|---|---|
Automatic cell annotation | All automatic methods | Fast, but not effective for poorly characterized cells | Use manual annotation for poorly characterized cells |
Annotating clusters | May miss important differences between cells | Use automatic annotation of clusters to get a general idea of cell type and then refine labels manually. In addition, use multiple cluster-based methods and compare results | |
Annotating individual cells | Ideal, but requires high reads per cell | Experiments with low reads per cell require cluster-based annotation | |
Marker-based annotation methods | Marker genes not easily accessible for all cell types; may result in conflicting or absent cell labels | Requires expert knowledge to curate more extensive marker lists | |
Reference-based annotation methods | Perform poorly with incomplete or poorly matched reference data, which may result in conflicting or absent cell labels | Use well-matched reference data or marker-based methods if such data are unavailable | |
Often requires batch correction, which may reduce the accuracy of results | Analyze the reference data for strong biological signals. Use a good experimental protocol that will prevail over batch effects | ||
Mistakes in reference data get carried over to results | Analyze reference data for potential errors before using | ||
Comparing results from different automatic annotation methods | Results may not agree with each other | Compare confidence scores of respective labels and consider label agreement (majority rule); resolve conflicts using manual annotation | |
Consider the possibility of cell subtypes, new cell types or gradients and cell states | |||
Expert manual cell annotation | All manual methods | Slow, labor-intensive | Whenever possible, begin with automatic annotation to determine general cell labels |
Subjective | Work with an expert; consider multiple cell-type conclusions | ||
Marker-based annotation | Cell types not distinguishable by a single marker | Use multiple markers for each cell type | |
Known markers not distinguishing cell types | Curate larger lists of markers from the literature, additional experiments or experts | ||
Conflicting marker gene sets between sources | Select a marker gene set that best represents the biological signal being looked for in the data (e.g., if looking for cell subtypes, use more extensive gene sets than what is used for general cell-type annotation) |
先來看看第一個,基于marker gene 的注釋
To be successful, the marker gene or gene set (a collection of marker genes) should be specifically and consistently expressed in a given cell, cluster or class of cells沽翔。大而全的marker gene列表是必需的兢孝。
(1)To label individual cells(注意這里是定義單個細胞),one of the most reliable markerbased annotation tools is semi-supervised category identification and assignment(半監(jiān)督的策略,SCINA搀擂,這里需要我們認為提供marker gene列表西潘,原理大家需要看看原文章了)。
(2)AUCell is another good marker-based labeling method that classifies individual cells or clusters.(關(guān)于AUcell大家可以參考文章深入理解R包AUcell對于分析單細胞的作用)哨颂。AUCell ranks the genes in each cell by decreasing expression value, and cells are labeled according to their most active (highly expressed) marker gene sets.(原理喷市,了解一下)
(3)gene set variation analysis (GSVA) has been benchmarked to be fast and reliable。GSVA works similarly to AUCell: given a database of marker gene sets, it identifies sets that are enriched in the gene expression profile of a cluster.(關(guān)于GSVA做細胞定義,這個是目前大多數(shù)公司的做法)。
marker 定義細胞類型的最大問題妆艘,A disadvantage of these tools is that markers are not easily accessible for all cell types.
軟件注釋細胞類型的表格
Tool | Type | Language | Resolution | Approach | Allows ‘None’ | Notes |
---|---|---|---|---|---|---|
singleCell Net | Reference based | R | Single cells | Relative-expression gene pairs + random forest | Yes, but rarely does so even when it should | 10–100× slower than other methods; high accuracy |
scmap-cluster | Reference based | R | Single cells | Consistent correlations | Yes | Fastest method available; balances falsepositives and false-negatives; includes web interface for use with a large pre-built reference or custom reference set |
scmap-cell | Reference based | R | Single cells | Approximate nearest neighbors | Yes | Assigns individual cells to nearest neighbor cells in reference; allows mapping of cell trajectories; fast and scalable |
singleR | Reference based | R | Single cells | Hierarchical clustering and Spearman correlations | No | Includes a large marker reference; does not scale to data sets of ≥10,000 cells; includes web interface with marker database |
Scikit-learn | Reference based | Python | Multiple possible | k-nearest neighbors, support vector machine, random forest, nearest mean classifier and linear discriminant analysis | (Optional) | Expertise required for correct design and appropriate training of classifier while avoiding overtraining |
AUCell | Marker based | R | Single cells | Area under the curve to estimate marker gene set enrichment | Yes | Because of low detection rates at the level of single cells, it requires many markers for every cell type |
SCINA | Marker based | R | Single cells | Expectation maximization, Gaussian mixture model | (Optional) | Simultaneously clusters and annotates cells; robust to the inclusion of incorrect marker genes |
GSEA/GSVA | Marker based | R/Java | Clusters of cells | Enrichment test | Yes | Marker gene lists must be reformatted in GMT format. Markers must all be differentially expressed in the same direction in the cluster |
Harmony | Integration | R | Single cells | Iterative clustering and adjustment | Yes | Integrates only lower-dimensional projection of the data; seamlessly(無縫地) integrated into Seurat pipeline; may overcorrect data |
Seurat-canonical correlation analysis | Integration | R | Single cells | MNN anchors + canonical correlation analysis | Yes | Accuracy depends on the accuracy of MNN anchors, which are automatically-identified corresponding cells across data sets |
mnnCorrect | Integration | R | Single cells | MNN pairs + singular value decomposition | Yes | Accuracy depends on the accuracy of MNN pairs (cells matched between data sets). |
Linked inference of genomic experimental relationships (LIGER) | Integration | R | Single cells | Non-negative matrix factorization | Yes | Allows interpretation of data set–specific and shared factors of variation |
接下來看看第二部分 Reference-based automatic cell annotation
this approach is possible only if high-quality and relevant annotated reference single-cell data are available掂恕。目前單細胞已經(jīng)有了一些公共的數(shù)據(jù)庫可以獲取參考集。These atlases typically contain hundreds of thousands of cells and dozens of different annotated cell types垫桂。
這種方法有一個共同的特點,需要一個注釋完整的參考集,一旦參考集是不完整的植酥、缺失的,準確度就會明顯的下降弦牡。
當然友驮,原則上任何做整合分析的方法都可以用于細胞定義,簡單回顧一下常見整合方法的特點驾锰。
- Harmony iteratively merges data sets represented by top PCs, which are then used to cluster cells. Each cell is iteratively adjusted on the basis of an estimated correction vector to shift it closer to the center of its cluster until convergence. MNN approaches, such as mnnCorrect/FastMNN or Seurat v321, identify the most similar cells (MNNs), called ‘a(chǎn)nchors’, across data sets that are used to estimate and correct the cell type–specific batch effects. LIGER identifies shared (common biology) and unique (biological or technical) factors between data sets using non-negative matrix factorization. LIGER is recommended when specific cell types appear to be present in some of the data sets and missing in others. Integration methods can suffer from overcorrection, where different cell types are merged, or undercorrection, when resulting clusters contain cells from only one input data set. Multiple integration methods may need to be evaluated to find a balance that best represents the data.(批次矯正的方法也要根據(jù)情況來判斷)卸留,過度矯正是目前常見的問題
細化自動注釋
Benchmarking studies show variable performance of automatic annotation tools, depending on the data set and distinctiveness of the gene expression profiles of the cell types to be annotated(軟件之間的也不具有統(tǒng)一性)。
For instance, distinguishing T cells from B cells is relatively straightforward, but automatic tools sometimes cannot accurately distinguish CD8+ cytotoxic T cells from natural killer cells(如下圖)
When applying multiple cell annotation methods to a data set, cells or clusters will acquire multiple, sometimes conflicting, cell-type labels.(這也是最大的問題)椭豫。如果定義的一致耻瑟,那么很容易辨別旨指,容易定義結(jié)果存在矛盾,那么每個軟件提供一個可能性分數(shù)to identify a single high-scoring label.但是不同軟件的判定分數(shù)不具有比較性喳整,但是可以根據(jù)定義結(jié)果出現(xiàn)的頻率來加以識別谆构,出現(xiàn)的頻率越高,越可能是某種細胞類型算柳,如果這些都不行低淡,那就只有人工注釋了。
如果注釋存在矛盾瞬项,很可能說明該cluster還有subtype蔗蹋,但是,如果不能明確定義亞型囱淋,則更通用的細胞類型注釋可能更合適猪杭。 這里舉一個例子,F(xiàn)or example, if a cluster is annotated as regulatory T cells, naive T cells and helper T cells by different methods, it may be most appropriate to assign the general label of ‘T cells’.
If the conflicting annotations are not subtypes of the same cell type, then the cluster may represent an intermediate cell state or gene-expression gradient妥衣。(中間狀態(tài)或者基因表達等級皂吮,這個地方很值得挖掘)。
Finally, a cluster may have a novel cell identity that is absent from the reference data. This often results in widely varying results from automatic annotation methods or insufficient confidence for any tool to assign any label. In such situations, manual annotation must be performed.(新的細胞類型就需要我們?nèi)斯ぷ⑨屃耍?/h4>
Step 2: expert manual cell annotation(人工注釋)
人工注釋細胞類型目前是最可靠的方法(gold-standard method)税手,但是蜂筹,it is slow and labor intensive and can be subjective.主要的人工注釋,就是我們?nèi)斯みx擇marker的過程,很繁瑣芦倒。
所有挑選出來的基因必須進行檢驗和可視化(比如dotplot和熱圖)
Challenges in this approach are that well-known markers are often too few in number to completely annotate an scRNA-seq data set, and some well-known markers may not be as specific within an scRNA-seq data set as expected.
set. Master transcription factors that drive cell fate often make better gene expression markers than cell-surface proteins that are commonly used to classify cell populations(轉(zhuǎn)錄因子基因的識別能力更好)艺挪,因為轉(zhuǎn)錄組水平和蛋白水平并不不能很好的關(guān)聯(lián)。
識別同一個細胞類型的marker gene通常是多個兵扬,尤其在定義subcluster的時候麻裳。
The ideal primary source for cell-defining genes is a singlecell atlas from a relevant organism, organ and disease context.(marker基因具有組織、器官器钟、疾病特異性)津坑。
in some instances a cluster may not express markers of any known cell type; conversely, it may express markers of more than one cell type.(這種情況就是低質(zhì)量的細胞、新的細胞類型或者含有subcluster)傲霸。
Once cell-type information from known markers is exhausted, cells that have not been confidently annotated must be manually examined, cluster by cluster.(marker gene無法起到作用的時候疆瑰,就需要人工進一步檢驗了。潛在的marker gene需要用一些軟件來進行差異分析)昙啄。
All marker genes are then manually researched to find functional information that may help identify the cell type of the cluster with which they are associated穆役。(例如通路富集)。
Some cells may be challenging to annotate, including novel cell types, which can be described on the basis of the function of genes they express跟衅。
Annotating cell states and gradients(針對新的細胞類型)
When analyzing and characterizing novel cell types, it is important to determine whether they represent a stable cell type or contain multiple cell states.(穩(wěn)態(tài)還是多種細胞狀態(tài))。
細胞類型和狀態(tài)的定義尚未標準化播歼,但可能預期穩(wěn)定的細胞類型在整個cluster中具有homo基因表達并且聚類在一起伶跷。
whereas cell gradients appear as a spread-out string of cells and cell states掰读,Expression gradients indicate continuous differences that are present in the cell population, which could represent states like the cell cycle, immune activation, spatial patterning or transient developmental stages。識別有意義的細胞狀態(tài)去除批次效應意義重大叭莫。
注釋細胞發(fā)育的中間階段通常很困難蹈集,因為這些區(qū)域很少表達獨特的標記基因。 It is often easier to label the ends of a gradient and then characterize intermediate stages using the order of specific genes that mark these ends as increasing or decreasing across the gradient雇初。
Extracting the cells in the gradient and performing principal component analysis (PCA) on them is often a useful visualization for gradients, because it preserves the large-scale distances between cells(有的軌跡分析軟件就采用這樣的策略)拢肆。
目前沒有可以自動注釋中間態(tài)的方法,不同細胞的層次只能人工識別靖诗,making use of known structure and celltype transitions relevant to the particular experiment郭怪。
Similarly, homogeneous or similar cell states or cell types are often difficult to annotate because they share many of the same marker genes。(這個時候就需要再分群分析了).
Very fine distinctions between highly similar cell types may not be visible transcriptionally and may be visible only in other genomic layers, such as chromatin state (assay for transposase-accessible chromatin using sequencing (ATAC-seq) and DNA methylation).(多組學識別細胞類型也是很重要的一點)刊橘。
Step 3: annotation verification
需要其他的分析輔助驗證了鄙才,包括多組學,SC-ATAC促绵,CNV等等攒庵。
最后,附上一張做細胞定義的軟件總結(jié)
我們來看一下示例代碼
1. Reference-based automatic annotation
Create the Reference
The first step in performing reference-based annotation is to select an annotated dataset to use as the reference. Here we will use one of the references created by the authors of SingleR and show how it can be used with other tools such as scmap.
Other reference datasets can be found in GEO (https://www.ncbi.nlm.nih.gov/geo/) or at a link provided by the authors of the reference dataset. However, to use a dataset as a reference you will need both the single-cell RNA sequencing data and the cell-type annotations. GEO does not require authors to provide the cell-type annotations of their data, so you may need to contact the authors directly to to get the annotations for some datasets.
# Set a random seed to ensure result reproducibility
set.seed(9742)
# Download singleR reference data for immune cells and save it as the variable "ref"
# The variable is a class called "Summarized Experiment"
# This will take a while
ref <- celldex::DatabaseImmuneCellExpressionData()
Next we need to reformat the data to ensure it is compatible with the tool we are using. We will be demonstrating scmap, which uses data formatted as a ‘SingleCellExperiment object’, and assumes by default that gene names are found in a column named ‘feature_symbol’ while the cell-type labels are in a column named ‘cell_type1’. In addition, scmap requires that you normalize and log-transform the reference data; this has already been done for the SingleR reference data so we skip those steps here.
# Assign cell-type labels in a column named "cell_type1"
colData(ref)$cell_type1 <- colData(ref)$label.fine
# Assign gene names in a column called "feature_symbol"
rowData(ref)$feature_symbol <- rownames(ref)
# Convert the data into a SingleCellExperiment object
ref_sce <- SingleCellExperiment::SingleCellExperiment(assays=list(logcounts=Matrix::Matrix(assays(ref)$logcounts)),
colData=colData(ref), rowData=rowData(ref))
Our reference data is ready to be used now. So lets process this data to build the index we will use to map our unlabeled data to. First, we select genes to use, which will be those deemed most informative by scmap after fitting a linear model to the gene expression by gene dropout distribution. Those which are most informative have high expression values and low % dropout rates across cells.
# Create scmap-cluster reference by first selecting the most informative features
ref_sce <- scmap::selectFeatures(ref_sce, suppress_plot=FALSE)
Your object does not contain counts() slot. Dropouts were calculated using logcounts() slot...
# Inspect the first 50 genes selected by scmap
rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)][1:50]
# You can check and see how many genes were chosen by checking the length of the
# vector of gene names
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])
[1] 500
Now we can see the genes that scmap has chosen to use. If there are key marker genes missing we can make sure they are included like this:
# Create a list of key markers that you want to use
my_key_markers = c("TRAC", "TRBC1", "TRBC2", "TRDC", "TRGC1", "TRGC2", "IGKC")
# Ensure markers are in the list of features used by scmap
rowData(ref_sce)$scmap_features[rownames(ref_sce) %in% my_key_markers] <- TRUE
# You can check and see if this added any genes by checking the length
# of the vector of gene names again
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])
[1] 502
And we can remove genes that we think might be technical artefacts, such as mitochondria RNAs, like this:
# Create a list of mitochondrial genes from the dataset (genes that begin with "MT")
mt_genes <- rownames(ref_sce)[grep("^MT-", rownames(ref_sce))]
# Remove these genes from the features used by scmap
rowData(ref_sce)$scmap_features[rownames(ref_sce) %in% mt_genes] <- FALSE
# Check how many genes this is
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])
[1] 495
# Extract the features and assign them to a new variable, "scmap_feature_genes"
scmap_feature_genes <- rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)]
# Note that the number of genes/features is identical to what we just checked
length(scmap_feature_genes)
[1] 495
Now we build the reference profiles used in scmap-cluster, for cluster-based cell-type annotation. These profiles can be accessed and plotted from inside the SingleCellExperiment object as follows:
# Create reference profiles;
# Once reference profiles are generated the original data are
# not needed for scmap-cluster
ref_sce <- scmap::indexCluster(ref_sce)
# Visualize interesting features as a heatmap
# Reformat the data so that they can be used as input to ggplot2
cormat <- reshape2::melt(as.matrix(metadata(ref_sce)$scmap_cluster_index))
# Plot the data
ggplot2::ggplot(cormat, ggplot2::aes(x = Var2, y = Var1, fill = value)) +
ggplot2::geom_tile() +
ggplot2::scale_fill_gradient2(low = "blue", high = "darkred",
name = "Expression value") +
ggplot2::theme_minimal() +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 1,
size = 18, hjust = 1),
axis.text.y = ggplot2::element_text(size = 15),
axis.title.x = ggplot2::element_blank(),
axis.title.y = ggplot2::element_blank())
# Store expression information as a variable
scmap_cluster_reference <- metadata(ref_sce)$scmap_cluster_index
From here on out, scmap only needs this set of reference profiles. So if working with a very large reference, one could save this index separately to your computer and reload it when annotating new datasets. But since that is not the case here, we will simply save this index to a variable for now.
We will also demonstrate scmap-cell to annotate individual cells of our dataset, so we will create that index as well. As before one would first normalize and log-transform the reference data, and select genes to use. As we have already done that, we need only run the command to build the scmap-cell index. There are two parameters we can set: M and k, increasing M and k will give more accurate mapping but increase the size of the index, and the time needed to map cells. Here we use the defaults (you may see a warning message about the defaults that are being used):
# Update the previous reference to also contain the scmap-cell reference
ref_sce <- scmap::indexCell(ref_sce)
Parameter M was not provided, will use M = n_features / 10 (if n_features <= 1000), where n_features is the number of selected features, and M = 100 otherwise.
Parameter k was not provided, will use k = sqrt(number_of_cells)
# Extract the scmap index from the reference and store as a variable
scmap_cell_reference <- metadata(ref_sce)$scmap_cell_index
# Extract the associated cell IDs from the reference and save as a variable
scmap_cell_metadata <- colData(ref_sce)
scmap-cell assigns cells in one dataset to their “nearest neighbours” in the reference dataset. In this case, the “nearest neighbours” are the cells in the reference dataset most similar to the cells in the query dataset.
One can use any rule they like to transfer information, such as cell-type or pseudotime, from these nearest neighbours to the query data. Thus we need to store the associated metadata (cell type ID) for the reference as well (see above). Now we don’t need to use our original reference dataset anymore.
Assign cells from the query dataset to the reference.
The query dataset we will be using is provided by 10X genomics.
download.file("https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz",
"pbmc3k_filtered_gene_bc_matrices.tar.gz")
trying URL 'https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz'
Content type 'application/x-tar' length 7621991 bytes (7.3 MB)
==================================================
downloaded 7.3 MB
untar("pbmc3k_filtered_gene_bc_matrices.tar.gz")
Now we need to load our unlabeled dataset into R. Normal preprocessing including QC filtering, normalizing and log-transforming the data must be done prior to annotating. In addition, scmap is based on the SingleCellExperiment object, so if our data is stored as a Seurat object we must convert it to SingleCellExperiment as shown below.
# This portion of the tutorial is assuming the raw 10X data is in the
# following folder in your directory:
data <- Seurat::Read10X("filtered_gene_bc_matrices/hg19/")
# Make SingleCellExperiment from the raw matrix
query_sce <- SingleCellExperiment::SingleCellExperiment(assays=list(counts=data))
# Make SingleCellExperiment from Seurat object
query_seur <- Seurat::CreateSeuratObject(data)
Feature names cannot have underscores ('_'), replacing with dashes ('-')
query_sce <- Seurat::as.SingleCellExperiment(query_seur)
# normalize the data using the scater package
query_sce <- scater::logNormCounts(query_sce)
# add feature_symbol column (i.e. the gene symbols)
rowData(query_sce)$feature_symbol <- rownames(query_sce)
Now you should have an entry in assays(my_sce)
called logcounts
with the log-normalized matrix. We are now ready to annotate our data with scmap-cluster. Let’s start with scmap-cluster:
# Run scmapCluster
scmap_cluster_res <- scmap::scmapCluster(projection=query_sce,
index_list=list(immune1 = scmap_cluster_reference),
threshold=0.1)
# plot the results of our annotation
par(mar=c(13, 4, 1, 0))
barplot(table(scmap_cluster_res$combined_labs), las=2)
# Store this annotation information within the query object
colData(query_sce)$scmap_cluster <- scmap_cluster_res$combined_labs
# Make a UMAP of the cells, labeled with the cell-type annotations from scmapCluster
query_sce <- scater::runUMAP(query_sce)
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="scmap_cluster")
Alternatively we could use scmap-cell, to find the 10 nearest neighbours to each cell (i.e. the 10 most similar cells to each query cell), then pick the annotation that is most common among the neighbours, like this:
# Determine the 10 nearest neighbours from the reference dataset for each
# cell in the query dataset using scmapCell
nearest_neighbours <- scmap::scmapCell(projection=query_sce,
index_list = list(immune1 = scmap_cell_reference),
w=10)
# Get metadata (cell type IDs) for the neighbours of each cell in the query dataset
mode_label <- function(neighbours, metadata=scmap_cell_metadata$cell_type1) {
freq <- table(metadata[neighbours])
label <- names(freq)[which(freq == max(freq))]
if (length(label) > 1) {return("ambiguous")}
return(label)
}
# Apply these labels to the query cells
scmap_cell_labs <- apply(nearest_neighbours$immune1$cells, 2, mode_label)
# Add the labels to the query object
colData(query_sce)$scmap_cell <- scmap_cell_labs
# Create a bar plot of how many cells in the query dataset were assigned
# a specific label
par(mar=c(10, 4, 0, 0))
barplot(table(scmap_cell_labs), las=2)
# Make a UMAP and add the new cell-type annotations
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="scmap_cell")
Another option compatible with the SingleCellExperiment Object is SingleR. As before, we need a reference and a query dataset. In the case of SingleR, we need the entirety of the reference dataset, rather than generating a compressed reference index as we did with scmap. In addition, running just this small example demonstrates the difference in run time between the methods (SingleR takes a fair bit of time).
# Run SingleR on the query data and the reference to acquire
# cell-type predictions for the cells in the query dataset
predictions <- SingleR::SingleR(test=query_sce, ref=ref, labels=ref$label.fine)
# You'll notice that some of the cells didn't get assigned a cell identity
# We can count the number here:
sum(is.na(predictions$pruned.labels))
[1] 23
# Change NAs to "ambiguous"
predictions$pruned.labels[which(is.na(predictions$pruned.labels))] <- "ambiguous"
# Add singleR labels to query_sce
colData(query_sce)$singleR <- predictions$pruned.labels
# Create a bar plot of number of cells per assigned cell ID
par(mar=c(13, 4, 2, 0))
barplot(table(predictions$pruned.labels), las=2)
# Make a UMAP and add the cell-type annotations
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="singleR")
Integration as a form of annotation
Another option is to integrate our query data with our reference data. Then we simply transfer the labels from the annotated reference to the neighbouring query cells in the integrated dataset. Clustering the integrated data is a common approach to transferring labels. We demonstrate how this could be done with Harmony below. But the approach would be the same for any integration tool.
Note: the SingleR reference is not single cells, but averages across many cells. Thus we convert and downsample the reference to a single cell object for demonstration purposes. For a real experiment, one would use the original single cells as the reference when integrating datasets.
set.seed(2891)
# Convert reference and query datasets to Seurat Objects
# Add a "counts" slot to the reference SingleCellExperiment object so we can convert it to a Seurat Object
assays(ref_sce)[["counts"]] <- round(2^assays(ref_sce)[["logcounts"]]) -1
colnames(ref_sce) <- paste("cell", 1:ncol(ref_sce))
# Subset both objects so both the reference and query datasets have the same genes
# First subset the reference
ref_seur <- Seurat::as.Seurat(ref_sce[rownames(ref_sce) %in% rownames(query_sce),])
ref_seur@active.ident <- factor(rep("reference", ncol(ref_seur)))
# Now subset the query
query_seur <- Seurat::as.Seurat(query_sce[rownames(query_seur) %in% rownames(ref_sce),])
query_seur@active.ident <- factor(rep("query", ncol(query_seur)))
# Downsample the reference to be similar to query in terms of total UMIs
totalUMI <- median(query_seur@meta.data$nCount_RNA)
ref_seur@assays$RNA@counts <- Seurat::SampleUMI(ref_seur@assays$RNA@counts,
max.umi=totalUMI, upsample=FALSE)
# Merge the datasets together into a single Seurat object
merged_seur <- merge(ref_seur, query_seur)
merged_seur@meta.data$source <- merged_seur@active.ident
# Normalize the combined data
merged_seur <- Seurat::NormalizeData(merged_seur)
# Rather than choosing new variable features, we will choose
# the genes that had been previously important by scmap for consistency
Seurat::VariableFeatures(merged_seur) <- scmap_feature_genes
# Scale the data and run dimensionality reduction on the combined data
merged_seur <- Seurat::ScaleData(merged_seur)
merged_seur <- Seurat::RunPCA(merged_seur)
merged_seur <- Seurat::RunUMAP(merged_seur, dims=1:15)
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("Before Integration")
# Run Harmony to remove batch effects
merged_seur <- harmony::RunHarmony(merged_seur, "source", dims.use=1:15)
merged_seur <- Seurat::RunUMAP(merged_seur, dims=1:15, reduction="harmony")
# Plot the data
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("After Integration")
Now that the data is integrated we will cluster the data and look at the annotations of the reference cells present in each cluster. As with all clustering, this may require manual tuning of the resolution parameters to get the best labels.
# Cluster the integrated dataset
merged_seur <- Seurat::FindNeighbors(merged_seur, reduction="harmony", dims=1:15)
merged_seur <- Seurat::FindClusters(merged_seur, resolution=0.5)
# Plot the data
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("After Integration")
# Create a table of cluster labels based on integrated data
table(merged_seur@meta.data$label.fine,
merged_seur@active.ident)
0 1 2 3 4 5 6 7 8 9
B cells, naive 0 0 0 106 0 0 0 0 0 0
Monocytes, CD14+ 0 0 106 0 0 0 0 0 0 0
Monocytes, CD16+ 0 0 0 0 0 0 105 0 0 0
NK cells 0 105 0 0 0 0 0 0 0 0
T cells, CD4+, memory TREG 0 0 0 0 0 0 0 104 0 0
T cells, CD4+, naive 4 0 0 0 95 4 0 0 0 0
T cells, CD4+, naive TREG 1 0 0 0 102 0 0 1 0 0
T cells, CD4+, naive, stimulated 0 0 0 0 0 0 0 0 102 0
T cells, CD4+, TFH 96 0 0 0 5 0 0 3 0 0
T cells, CD4+, Th1 103 0 0 0 1 0 0 0 0 0
T cells, CD4+, Th1_17 104 0 0 0 0 0 0 0 0 0
T cells, CD4+, Th17 99 0 0 0 0 0 0 5 0 0
T cells, CD4+, Th2 96 0 0 0 6 0 0 2 0 0
T cells, CD8+, naive 0 0 0 0 1 103 0 0 0 0
T cells, CD8+, naive, stimulated 0 0 0 0 0 1 0 0 0 101
Here we have a table of the reference annotations (across rows) per cluster (across columns). We can manually label the clusters based on this table or we could create a rule to algorithmically label the clusters based on this table. Since there are only 11 clusters, we assign the labels manually.
cluster_labs <- c("0"="ambiguous",
"1"="Monocytes, CD14+",
"2"="B cells, naive",
"3"="T cells, CD4+, naive TREG",
"4"="T cells, CD4+, Th1_17",
"5"="NK cells",
"6"="T cells, CD8+, naive",
"7"="Monocytes, CD16+",
"8"="T cells, CD4+, memory TREG",
"9"="T cells, CD4+, naive, stimulated",
"10" = "T cells, CD8+, naive, stimulated")
# Assign cluster label to the associated query cells
# (the query cells that had been assigned the same cluster label)
merged_seur@meta.data$annotation <- cluster_labs[merged_seur@meta.data$RNA_snn_res.0.5]
# Add the results to the SingleCellExperiment Object and plot
query_sce$Harmony_lab <- merged_seur@meta.data$annotation[merged_seur@meta.data$source =="query"]
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="Harmony_lab")
2. Refining / Consensus annotations
Once we have run several tools, we can use the consensus of the labels to get a more robust annotation. In this case we will simply use the most common label across tools to assign the final automatically annotated label.
Hide
annotation_columns <- c("scmap_cluster", "scmap_cell", "singleR", "Harmony_lab")
#Optional check how consistent the labelling was.
#head(colData(query_sce)[,annotation_columns])
get_consensus_label <- function(labels){
labels <- labels[labels != "ambiguous"]
if (length(labels) == 0) {return("ambiguous")}
freq <- table(labels)
label <- names(freq)[which(freq == max(freq))]
if (length(label) > 1) {return("ambiguous")}
return(label)
}
colData(query_sce)$consensus_lab <- apply(colData(query_sce)[,annotation_columns], 1, get_consensus_label)
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="consensus_lab")
3. Marker-based automatic annotation
An alternative way for annotation of your query scRNAseq dataset is to utilize Marker-based annotation tools. SCINA is a semi-supervised annotation tool that takes in the signature genes and expression matrix and predicts the potential labels based on the prior knowledge of the cell-type-specific markers. List of markers is usually provided in the gmt format. The PBMC gene set used below have been gathered by Diaz-Mejia JJ et al.
download.file("https://zenodo.org/record/3369934/files/pbmc_22_10x.tar.bz2",
"pbmc_22_10x.tar.bz2")
untar("pbmc_22_10x.tar.bz2")
The extracted data will by located in the following file: ./MY_PAPER/SUPPLEMENTARY_DATA/pbmc_22_10x/pbmc_22_10x_cell_type_signature_gene_sets.gmt
The results from this annotation tool are not used in the above step to find consensus annotations because the lists of marker genes are not consistent with the cell types identified in the reference dataset. This is because these data come from different sources, and would not have been characterizing the exact same set of cells. If you wish for marker-based and reference-based annotation methods to be combined in the above step of automatically determining consensus annotations, you would have to make sure all of the identified cell subtypes are the same and that they are spelt the exact same way in order for R to recognize the names as identical.
# Import the marker genes as a GMT file and store as a variable
markers <- msigdb::read.gmt('./MY_PAPER/SUPPLEMENTARY_DATA/pbmc_22_10x/pbmc_22_10x_cell_type_signature_gene_sets.gmt')
# Convert the expression data from Seurat object into a matrix data structure
exprMatrix <- as.matrix(Seurat::GetAssayData(query_seur))
# Run SCINA on the query data using the marker genes to identify cell types
# Specifying rm_overlap = FALSE allows the same marker gene to specify multiple cell types which
# may be useful if identifying cell subtypes or other similar types of cells
# Specifying allow_unknown = TRUE allows cells to be labeled as "unknown" instead of being
# assigned a low-confident label
predictions.scina = SCINA::SCINA(exp = exprMatrix, signatures = markers$genesets,
rm_overlap = FALSE, allow_unknown = TRUE)
# Add SCINA annotation information to each cell in Seurat object
colData(query_sce)$SCINA <- predictions.scina$cell_labels
# Make a UMAP and add the SCINA cell-type annotations
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="SCINA") +
ggplot2::theme(legend.position = "bottom",
legend.text = ggplot2::element_text(size = 4))
4. Manual annotation
Retrieving marker genes
If you do not have an extensive list of markers per cell type, or a good quality reference dataset, it is useful to extract the top marker genes from each cluster of your query data. We can easily do this in Seurat, with the data formatted as a *Seurat object (which we created earlier and stored as the variable query_seur
). First, the data must be normalized and scaled, and the variable genes between cells must be determined.
Hide
query_seur <- Seurat::NormalizeData(query_seur) # Normalize the data
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
query_seur <- Seurat::FindVariableFeatures(query_seur) # Determine the variable features of the dataset
Calculating gene variances
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
query_seur <- Seurat::ScaleData(query_seur) # Scale the data based on the variable features
Next, different types of dimensionality reduction must be performed on the data so that the cells can be grouped together in 2D space.
query_seur <- Seurat::RunPCA(query_seur)
query_seur <- Seurat::RunTSNE(query_seur)
# RunUMAP has already been performed on the data, so the following line of code
# does not need to be run in this case:
#query_seur <- Seurat::RunUMAP(query_seur, dims = 1:50)
From this object, we can cluster the data at a chosen resolution that can be modified later on if desired.
# Determine the "nearest neighbours" of each cell
query_seur <- Seurat::FindNeighbors(query_seur, dims = 1:50)
Computing nearest neighbor graph
Computing SNN
# Cluster the cells
query_seur <- Seurat::FindClusters(query_seur, resolution = 0.5)
Seurat::DimPlot(query_seur, reduction = "UMAP")
Now let’s extract the top marker genes, and see which ones correspond with each cluster. This can be done using the FindAllMarkers function within Seurat.
markers_seur <- Seurat::FindAllMarkers(query_seur, only.pos = TRUE)
require(dplyr)
# Retrieve the top 5 marker genes per cluster
# Use whichever genes have the highest values under the AVG_LOG column
top5 <- markers_seur %>% group_by(cluster) %>%
dplyr::slice_max(get(grep("^avg_log", colnames(markers_seur), value = TRUE)),
n = 5)
# Create the dot plot
Seurat::DotPlot(query_seur, features = unique(top5$gene)) +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 1,
size = 8, hjust = 1)) +
Seurat::NoLegend()
# Create the heatmap
Seurat::DoHeatmap(query_seur, features = unique(top5$gene)) +
Seurat::NoLegend() +
ggplot2::theme(axis.text.y = ggplot2::element_text(size = 8))
The following features were omitted as they were not found in the scale.data slot for the RNA assay: NOSIP, CD3E, CD3D, IL7R, LDHB
Pathway analysis
Pathway analysis can also be done for each cluster to determine significantly up- and downregulated pathways based on known gene function. An easy way to do this is by feeding our current Seurat object into cerebroApp. cerebroApp requires that marker genes be fetched again through before performing simple pathway analysis.
# First get marker genes through cerebro
query_seur <- cerebroApp::getMarkerGenes(query_seur,
groups = c('seurat_clusters'),
assay = "RNA",
organism = "hg")
# Get enriched pathways through cerebro
query_seur <- cerebroApp::getEnrichedPathways(query_seur,
databases = c("GO_Biological_Process_2018",
"GO_Cellular_Component_2018",
"GO_Molecular_Function_2018",
"KEGG_2016",
"WikiPathways_2016",
"Reactome_2016",
"Panther_2016",
"Human_Gene_Atlas",
"Mouse_Gene_Atlas"),
adj_p_cutoff = 0.05,
max_terms = 100,
URL_API = "http://amp.pharm.mssm.edu/Enrichr/enrich")
生活很好败晴,有你更好