single cell
lixj
- About single cell sequence
-
Analysis
- Data
- Build scater object
- Quality control
- Producing diagnostic plots for QC
- Identify confounding factors
- Normalization
- Identifying HVGs and blocking on cell cycle
- Dimensionality reduction
- Clustering cells
- Detecting marker genes between subpopulations
- Celluar trajectory
- Gene regulatory network–WGCNA+Cytoscape tool.
- For all the analysis,we can also use another pipeline
About single cell sequence
Bulk RNA-seq
- Measures the average expression level for each gene across a large population of input cells
- Useful for comparative transcriptomics, e.g. samples of the same tissue from different species
- Useful for quantifying expression signatures from ensembles
- Insufficient for studying heterogeneous systems, e.g. early development studies, complex tissues (brain)
- Does not provide insights into the stochastic nature of gene expression
scRNA-seq
- A new technology, first publication by Tang(https://www.nature.com/articles/nmeth.1315)
- Measures the distribution of expression levels for each gene across a population of cells
- Allows to study new biological questions in which cell-specific changes in transcriptome are important, like cell type identification, heterogeneity of cell responses, stochasticity of gene expression, inference of gene regulatory networks across the cells.
- Datasets range from <nobr aria-hidden="true" style="box-sizing: border-box; transition: none 0s ease 0s; border: 0px; padding: 0px; margin: 0px; max-width: none; max-height: none; min-width: 0px; min-height: 0px; vertical-align: 0px; line-height: normal; text-decoration: none; white-space: nowrap !important;">102</nobr><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mn>10</mn><mn>2</mn></msup></math> to <nobr aria-hidden="true" style="box-sizing: border-box; transition: none 0s ease 0s; border: 0px; padding: 0px; margin: 0px; max-width: none; max-height: none; min-width: 0px; min-height: 0px; vertical-align: 0px; line-height: normal; text-decoration: none; white-space: nowrap !important;">106</nobr><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mn>10</mn><mn>6</mn></msup></math> cells and increase in size every year
- Currently there are several different protocols in use, e.g. SMART-seq2 and Drop-seq
- There are also commercial platforms available, including the Fluidigm C1, Wafergen ICELL8 and the 10X Genomics Chromium
history
workflow
Challenges
The main difference between bulk and single cell RNA-seq is that each sequencing library represents a single cell, instead of a population of cells. Therefore, significant attention has to be paid to comparison of the results from different cells (sequencing libraries). The main sources of discrepancy between the libraries are:
- Amplification (up to 1 million fold)
- Gene ‘dropouts’ in which a gene is observed at a moderate expression level in one cell but is not detected in another cell
In both cases the discrepancies are introduced due to low starting amounts of transcripts since the RNA comes from one cell only. Improving the transcript capture efficiency and reducing the amplification bias are currently active areas of research. However, as we shall see in this course, it is possible to alleviate some of these issues through proper normalization and corrections.
Analysis
Data
Due to the large size and sparsity of 10X data (upto 90% of the expression matrix may be 0s) it is typically stored as a sparse matrix. The default output format for CellRanger is an .mtx file which stores this sparse matrix as a column of row coordinates, a column of column corodinates, and a column of expression values > 0. Only the coordinates are stored in the “.mtx file”, the names of each row & column are stored separately in the "genes.tsv”" and "barcodes.tsv" files respectively.There we use the 10xdata comes from hsc development at day10.
Build scater object
suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")
## ----quickstart-make-sce, results='hide'-----------------------------------
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts),
colData = sc_example_cell_info,
rowData = gene_df)
Also, we have a easy way to create a scater object.Use DropletUtils package
#suppressPackageStartupMessages(library(DropletUtils))
#sce <- read10xCounts("C:/Users/soleil/Desktop/single cell sequencing")
sce
## class: SingleCellExperiment
## dim: 33694 4890
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ...
## ENSG00000277475 ENSG00000268674
## rowData names(2): ID Symbol
## colnames(4890): AAACCTGAGAATCTCC-5 AAACCTGAGCTAGCCC-5 ...
## TTTGTCATCATACGGT-5 TTTGTCATCGTTACAG-5
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
A description of the SingleCellExperiment class for storing single-cell sequencing data.Like RNAseq, assays slot stores countdata, rowdata for features(genes), coldata for samples(cells).
Quality control
On cell
This matrix should be examined to remove poor quality cells which were not detected in either read QC or mapping QC steps. Failure to remove low quality cells at this stage may add technical noise which has the potential to obscure the biological signals of interest in the downstream analysis.
The calculateQCMetrics function computes a number of quality control metrics for each cell and feature, stored in the colData and rowData respectively. By default, the QC metrics are computed from the count data, but this can be changed through the exprs_values argument.Control sets can be defined for features (e.g., spike-in transcripts, mitochondrial genes) or cells (e.g., empty wells, visually damaged cells). The function will subsequently compute metrics for each control set.
suppressPackageStartupMessages(library(scater))
rownames(sce) <- rowData(sce)$Symbol
mito.gene <- grep("^MT-",rownames(sce))
sce <- calculateQCMetrics(sce,feature_controls = list(Mito=mito.gene))
colnames(colData(sce))
## [1] "Sample"
## [2] "Barcode"
## [3] "is_cell_control"
## [4] "total_features_by_counts"
## [5] "log10_total_features_by_counts"
## [6] "total_counts"
## [7] "log10_total_counts"
## [8] "pct_counts_in_top_50_features"
## [9] "pct_counts_in_top_100_features"
## [10] "pct_counts_in_top_200_features"
## [11] "pct_counts_in_top_500_features"
## [12] "total_features_by_counts_endogenous"
## [13] "log10_total_features_by_counts_endogenous"
## [14] "total_counts_endogenous"
## [15] "log10_total_counts_endogenous"
## [16] "pct_counts_endogenous"
## [17] "pct_counts_in_top_50_features_endogenous"
## [18] "pct_counts_in_top_100_features_endogenous"
## [19] "pct_counts_in_top_200_features_endogenous"
## [20] "pct_counts_in_top_500_features_endogenous"
## [21] "total_features_by_counts_feature_control"
## [22] "log10_total_features_by_counts_feature_control"
## [23] "total_counts_feature_control"
## [24] "log10_total_counts_feature_control"
## [25] "pct_counts_feature_control"
## [26] "pct_counts_in_top_50_features_feature_control"
## [27] "pct_counts_in_top_100_features_feature_control"
## [28] "pct_counts_in_top_200_features_feature_control"
## [29] "pct_counts_in_top_500_features_feature_control"
## [30] "total_features_by_counts_Mito"
## [31] "log10_total_features_by_counts_Mito"
## [32] "total_counts_Mito"
## [33] "log10_total_counts_Mito"
## [34] "pct_counts_Mito"
## [35] "pct_counts_in_top_50_features_Mito"
## [36] "pct_counts_in_top_100_features_Mito"
## [37] "pct_counts_in_top_200_features_Mito"
## [38] "pct_counts_in_top_500_features_Mito"
colnames(rowData(sce))
## [1] "ID" "Symbol"
## [3] "is_feature_control" "is_feature_control_Mito"
## [5] "mean_counts" "log10_mean_counts"
## [7] "n_cells_by_counts" "pct_dropout_by_counts"
## [9] "total_counts" "log10_total_counts"
Cell-level QC metrics
total_counts
: total number of counts for the cell (library size)log10_total_counts
: total_counts on the log10-scaletotal_counts_feature_control
: total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated.log10_total_counts_feature_controls
: total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.total_features_by_counts
: the number of features for the cell that have counts above the detection limit (default of zero).pct_counts_X
: percentage of all counts that come from the feature control set named X.
Feature-level QC metrics
mean_counts
: the mean count of the gene/feature.pct_dropout_by_counts
: the percentage of cells with counts of zero for each gene.pct_counts_Y
: percentage of all counts that come from the cell control set named Y.
Two common measures of cell quality are the library size and the number of expressed features in each library. The library size is defined as the total sum of counts across all features, i.e., genes and spike-in transcripts. Cells with relatively small library sizes are considered to be of low quality as the RNA has not been efficiently captured (i.e., converted into cDNA and amplified) during library preparation. The number of expressed features in each cell is defined as the number of features with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.
par(mfrow=c(1,2))
hist(sce$total_counts/1e6, xlab="Library sizes (millions)", main="",
breaks=20, col="grey80", ylab="Number of cells")
hist(sce$total_features_by_counts, xlab="Number of expressed genes", main="",
breaks=20, col="grey80", ylab="Number of cells")
To obtain an adaptive threshold, we assume that most of the dataset consists of high-quality cells. We remove cells with log-library sizes that are more than 3 median absolute deviations (MADs) below the median log-library size. (A log-transformation improves resolution at small values, especially when the MAD of the raw values is comparable to or greater than the median.) We also remove cells where the log-transformed number of expressed genes is 3 MADs below the median.
libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features_by_counts, nmads=3, type="lower", log=TRUE)
Another measure of quality is the proportion of reads mapped to genes in the mitochondrial genome. High proportions are indicative of poor-quality cells, possibly because of increased apoptosis and/or loss of cytoplasmic RNA from lysed cells.
hist(sce$pct_counts_feature_control, xlab="Mitochondrial proportion (%)",
ylab="Number of cells", breaks=50, main="", col="grey80")
Again, the ideal threshold for these proportions depends on the cell type and the experimental protocol. Cells with more mitochondria or more mitochondrial activity may naturally have larger mitochondrial proportions.If we assume that most cells in the dataset are of high quality, then the threshold can be set to remove any large outliers from the distribution of proportions. We use the MAD-based definition of outliers to remove putative low-quality cells from the dataset.
mito.drop <- isOutlier(sce$pct_counts_feature_control, nmads=3, type="higher")
Subsetting by column will retain only the high-quality cells that pass each filter described above. We examine the number of cells removed by each filter as well as the total number of retained cells. Removal of a substantial proportion of cells (> 10%) may be indicative of an overall issue with data quality. It may also reflect genuine biology in extreme cases (e.g., low numbers of expressed genes in erythrocytes) for which the filters described here are inappropriate.
sce <- sce[,!(libsize.drop | feature.drop | mito.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
ByMito=sum(mito.drop), Remaining=ncol(sce))
## ByLibSize ByFeature ByMito Remaining
## 1 0 71 140 4684
Above is a manual way.An alternative approach to quality control is to perform a principal components analysis (PCA) based on the quality metrics for each cell, e.g., the total number of reads, the total number of features and the proportion of mitochondrial. Outliers on a PCA plot may be indicative of low-quality cells that have aberrant technical properties compared to the (presumed) majority of high-quality cells.
On gene
Low-abundance genes are problematic as zero or near-zero counts do not contain enough information for reliable statistical inference. In addition, the discreteness of the counts may interfere with downstream statistical procedures, e.g., by compromising the accuracy of continuous approximations. Here, low-abundance genes are defined as those with an average count below a filter threshold of 0.03(alternative). These genes are likely to be dominated by drop-out events, which limits their usefulness in later analyses. Removal of these genes mitigates discreteness and reduces the amount of computational work without major loss of information.
ave.counts <- calcAverage(sce)
hist(log10(ave.counts), breaks=100, main="", col="grey80",
xlab=expression(Log[10]~"average count"))
abline(v=log10(0.04), col="blue", lwd=2, lty=2)
keep <- ave.counts >= 0.04
sum(keep)
## [1] 12308
sce <- sce[keep,]
An alternative approach to gene filtering is to select genes that have non-zero counts in at least n cells. This provides some more protection against genes with outlier expression patterns, i.e., strong expression in only one or two cells. Such outliers are typically uninteresting as they can arise from amplification artifacts that are not replicable across cells. (The exception is for studies involving rare cells where the outliers may be biologically relevant.)
numcells <- nexprs(sce, byrow=TRUE)
alt.keep <- numcells >= n
Producing diagnostic plots for QC
Examining the most expressed features
plotHighestExprs(sce, exprs_values = "counts")
The plot shows the top 50 (by default) most-expressed features. Each row in the plot below corresponds to a gene, and each bar corresponds to the expression of a gene in a single cell. The circle indicates the median expression of each gene, with which genes are sorted.
This should generally be dominated by constitutively expressed transcripts, such as those for ribosomal or mitochondrial proteins. The presence of other classes of features may be cause for concern if they are not consistent with expected biology. For example, a top set containing many spike-in transcripts suggests that too much spike-in RNA was added during library preparation, while the absence of ribosomal proteins and/or the presence of their pseudogenes are indicative of suboptimal alignment.
Frequency of expression as a function of the mean
Another useful plot is that of the frequency of expression (i.e., number of cells with non-zero expression) against the mean. These two metrics should be positively correlated with each other for most genes.
plotExprsFreqVsMean(sce)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And other plots.
Identify confounding factors
There is a large number of potential confounders, artifacts and biases in sc-RNA-seq data. One of the main challenges in analyzing scRNA-seq data stems from the fact that it is difficult to carry out a true technical replicate to distinguish biological and technical variability.Like batch effect, cell cycle and other factors.
Batch effect
Batch effects can arise from difference in reagents, isolation methods, the lab/experimenter who performed the experiment, even which day/time the experiment was performed.We can use PCA plot colored by batch to identify the batch effect.This example data do not have batches,so i skip this step.If you have data which do have batches, carefully identify the factors. There are several methods to deal with batch effect, eg.BASiCs(https://github.com/catavallejos/BASiCS),scLVM(https://github.com/PMBio/scLVM),RUVg(http://bioconductor.org/packages/release/bioc/html/RUVSeq.html); each using different noise models and different fitting procedures.
Alternatively, one can identify genes which exhibit significant variation beyond technical noise (eg. Distance to median, Highly variable genes).
Cell cycle
Cell cycle phase is usually uninteresting in studies focusing on other aspects of biology. However, the effects of cell cycle on the expression profile can mask other effects and interfere with the interpretation of the results. We can use scran package to analysis cell cycle.
suppressPackageStartupMessages(library(scran))
suppressPackageStartupMessages(library(org.Hs.eg.db))
mm.pairs <- readRDS(system.file("exdata", "human_cycle_markers.rds", package="scran"))
anno <- select(org.Hs.eg.db, keys=rownames(sce), keytype="SYMBOL", column="ENSEMBL")
## 'select()' returned 1:many mapping between keys and columns
ensembl <- anno$ENSEMBL[match(rownames(sce), anno$SYMBOL)]
assignments <- cyclone(sce, mm.pairs, gene.names=ensembl)
Normalization
After we identified important confounding factors and explanatory variables ,we should account for these variables in subsequent statistical models or to condition them out.
Library sizes vary because scRNA-seq data is often sequenced on highly multiplexed platforms the total reads which are derived from each cell may differ substantially.Many methods to correct for library size have been developped for bulk RNA-seq and can be equally applied to scRNA-seq (eg. UQ, SF, CPM, RPKM, FPKM, TPM).scater allows one to account for these variables in subsequent statistical models or to condition them out using normaliseExprs(), if so desired. This can be done by providing a design matrix to normaliseExprs().
!(http://www.reibang.com/p/5888284ef011)
Also we can use scran .If the data set is large, you can calculate the cluster first.
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster=clusters)
summary(sizeFactors(sce))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2676 0.6633 0.8820 1.0000 1.1978 5.1957
scran sometimes calculates negative or zero size factors. These will completely distort the normalized expression matrix.We can check the size factors scran has computed.If you find scran has calculated negative size factors try increasing the cluster and pool sizes until they are all positive.
plot(sizeFactors(sce), sce$total_counts, log="xy",
ylab="Library size", xlab="Size factor")
The size factors are well correlated against the library sizes, indicating that capture efficiency and sequencing depth are the major biases.
sce <- normalize(sce)
Finally, we compute normalized log-expression values. There is no need to call computeSpikeFactors() here, as there are no spike-in transcripts available.
Identifying HVGs and blocking on cell cycle
We identify HVGs to focus on the genes that are driving heterogeneity across the population of cells. This requires estimation of the variance in expression for each gene, followed by decomposition of the variance into biological and technical components. HVGs are then identified as those genes with the largest biological components. This avoids prioritizing genes that are highly variable due to technical factors such as sampling noise during RNA capture and library preparation.
!(http://www.reibang.com/p/9314a1209d7d)
Ideally, the technical component would be estimated by fitting a mean-variance trend to the spike-in transcripts using the trendVar function. Recall that the same set of spike-ins was added in the same quantity to each cell. This means that the spike-in transcripts should exhibit no biological variability, i.e., any variance in their counts should be technical in origin. Given the mean abundance of a gene, the fitted value of the trend can be used as an estimate of the technical component for that gene. The biological component of the variance can then be calculated by subtracting the technical component from the total variance of each gene with the decomposeVar function.
But the lack of spike-in transcripts complicates the modelling of the technical noise. One option is to assume that most genes do not exhibit strong biological variation, and to fit a trend to the variances of endogenous genes. Using the use.spikes=FALSE setting. This assumes that the majority of genes are not variably expressed, such that the technical component dominates the total variance for those genes.
Cell cycle can be regressed out by using the removeBatchEffect function from the limma package.
#after removal
suppressPackageStartupMessages(library(limma))
design <- model.matrix(~ 0+G1 + G2M, assignments$score)
fit.block <- trendVar(sce, use.spikes=FALSE, loess.args=list(span=0.05), design=design)
dec.block <- decomposeVar(sce, fit.block)
top.hvgs <- dec.block[dec.block$FDR <= 0.05 & dec.block$bio >= 0.5,]
assay(sce, "corrected") <- removeBatchEffect(logcounts(sce), covariates=design)
sce$G1score <- assignments$score$G1
sce$G2Mscore <- assignments$score$G2M
sce <- runPCA(sce,exprs_values="corrected", feature_set=rownames(top.hvgs))
out <- plotReducedDim(sce,"PCA",colour_by="G1score",size_by="G2Mscore")+ggtitle("after")
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.
#before removal
fit <- trendVar(sce, use.spikes=NA, loess.args=list(span=0.05))
dec <- decomposeVar(sce, fit)
top.hvgs2 <- dec[dec$FDR <= 0.05 & dec$bio >= 0.5,]
sce <- runPCA(sce, feature_set=rownames(top.hvgs2))
out2 <- plotReducedDim(sce,"PCA",colour_by="G1score",size_by="G2Mscore")+ggtitle("before")
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.
multiplot(out2, out, cols=2)
The result of this procedure is visualized with two PCA plots Before removal, the distribution of cells along the first two principal components is strongly associated with their G1 and G2/M scores. This is no longer the case after removal, which suggests that the cell cycle effect has been mitigated.
Also, we can examine the genes with the highest biological components and plot the genes with the largest biological components, to verify that they are indeed highly variable .
top.hvgs.gene <- top.hvgs[order(top.hvgs$bio, decreasing=TRUE),]
head(top.hvgs.gene)
## DataFrame with 6 rows and 6 columns
## mean total bio tech
## <numeric> <numeric> <numeric> <numeric>
## COL1A1 4.04673436438948 11.0618145393329 8.23584148562069 2.82597305371225
## COL3A1 4.7702554248679 11.3300825483233 7.66194357216325 3.66813897616001
## LUM 3.85205740816235 9.77012141715707 7.13641320619186 2.6337082109652
## DLK1 5.23836445010075 11.4531269777926 7.11332663688297 4.33980034090959
## IGFBP3 3.22869846213162 8.18524168460537 6.09145874028751 2.09378294431786
## IGF2 5.75547524542792 11.2510477388562 6.0367773240332 5.21427041482302
## p.value FDR
## <numeric> <numeric>
## COL1A1 0 0
## COL3A1 0 0
## LUM 0 0
## DLK1 0 0
## IGFBP3 0 0
## IGF2 0 0
plotExpression(sce,features = rownames(top.hvgs.gene)[1:10])
Dimensionality reduction
When working with large datasets, it can often be beneficial to apply some sort of dimensionality reduction method. By projecting the data onto a lower-dimensional sub-space, one is often able to significantly reduce the amount of noise. An additional benefit is that it is typically much easier to visualize the data in a 2 or 3-dimensional subspace.
We use the denoisePCA() function with the assumed trend to choose the number of dimensions to retain after PCA.
chosen <- rownames(top.hvgs.gene)
sce <- denoisePCA(sce,technical=dec.block, approximate=TRUE,assay.type="corrected",subset.row=chosen)
plot(attr(reducedDim(sce), "percentVar"), xlab="PC",
ylab="Proportion of variance explained")
abline(v=ncol(reducedDim(sce, "PCA")), lty=2, col="red")
Examination of the first few PCs already reveals some strong substructure in the data.
plotPCA(sce, ncomponents=3, colour_by="log10_total_features_by_counts")
This is recapitulated with a t-SNE plot. Again, note that we set use_dimred= to perform t-SNE on the denoised expression matrix.
set.seed(100)
sce <- runTSNE(sce, use_dimred="PCA", perplexity=50)#we can change perplaxity to have a robust result
plotTSNE(sce, colour_by="log10_total_features_by_counts")
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.
We can see some strong structures in the plot.
Clustering cells
!(http://www.reibang.com/p/e89ea23b8d56)
One of the most promising applications of scRNA-seq is de novo discovery and annotation of cell-types based on transcription profiles. Computationally, this is a hard problem as it amounts to unsupervised clustering. That is, we need to identify groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels. Moreover, in most situations we do not even know the number of clusters a priori. The problem is made even more challenging due to the high level of noise (both technical and biological) and the large number of dimensions (i.e. genes).
Unsupervised clustering is useful in many different applications and it has been widely studied in machine learning. Some of the most popular approaches are hierarchical clustering, k-means clustering and graph-based clustering.
hieararchy cluster
k-means cluster
graph-based cluster
There are many tools for clustering, SINCERA, pcaReduce, SC3, tSNE + k-means, SNN-Cliq, Seurat clustering, myclustree,etc.
The normalized and cell-cycle-adjusted log-expression values for correlated HVGs are used to cluster cells into putative subpopulations. Specifically, we perform hierarchical clustering on the Euclidean distances between cells, using Ward??s criterion to minimize the total variance within each cluster. This yields a dendrogram that groups together cells with similar expression patterns across the chosen genes. An alternative approach is to cluster on a matrix of distances derived from correlations (e.g., as in quickCluster). This is more robust to noise and normalization errors, but is also less sensitive to subtle changes in the expression profiles.
chosen.exprs <- assays(sce)$corrected[chosen,]
my.dist <- dist(reducedDim(sce))
my.tree <- hclust(my.dist, method="ward.D2")
suppressPackageStartupMessages(library(dynamicTreeCut))
my.clusters <- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist), verbose=0))#modify cutheight to change clusters
table(my.clusters)
## my.clusters
## 1 2 3 4 5 6
## 1408 1118 796 684 584 94
sce$clusters <- my.clusters
plotTSNE(sce,colour_by="clusters")
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.
suppressPackageStartupMessages(library(gplots))
heat.vals <- chosen.exprs - rowMeans(chosen.exprs)
clust.col <- rainbow(max(my.clusters))
heatmap.2(heat.vals, col=bluered, symbreak=TRUE, trace='none', cexRow=0.3,
ColSideColors=clust.col[my.clusters], Colv=as.dendrogram(my.tree),
breaks=seq(-5, 5, length.out=21))
Sometimes clustering does not yield good biological results, you can try to modify the parameters or use other clustering methods.
For example,sc3. The authors compared SC3 to five methods currently available for comparison by publicly published data (tSNE, PCA, snn-cliq, SINCERA and SEURAT), and sc3 performs better.sc3 performs single-cell consensus clustering.Both pcaReduce and tSNE+kmeans are stochastic and give different results every time they are run. To get a better overview of the solutions, we need to run the methods multiple times. SC3 is also stochastic, but thanks to the consensus step, it is more robust and less likely to produce different outcomes.
There, the expression map between cluster1 and cluster3 seems similar and the two clusters are in same module.We can merge it later.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5830613 311.4 10859787 580.0 10859787 580.0
## Vcells 121548762 927.4 788369764 6014.8 985459830 7518.5
We can plot TSNE colored by some important marker genes.For hsc, we can use CD34,KDR, etc.
plotTSNE(sce,colour_by="CD34")
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.
plotTSNE(sce,colour_by="KDR")
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.
plotTSNE(sce,colour_by="SPN")#SPN is CD43
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.
plotTSNE(sce,colour_by="PTPRC")#PTPRC is CD45RA,negative enrich
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.
Detecting marker genes between subpopulations
!(http://www.reibang.com/p/26d826d7b48c)
Once putative subpopulations are identified by clustering, we can identify marker genes for each cluster using the findMarkers function. This fits a linear model to the log-expression values for each gene, using methods in the limma package The aim is to test for DE in each cluster compared to the others while blocking on uninteresting factors in design. The top DE genes are likely to be good candidate markers as they can effectively distinguish between cells in different clusters.
markers <- findMarkers(sce, my.clusters, design=design,direction="up")
For each cluster, the DE results of the relevant comparisons are consolidated into a single output table. This allows a set of marker genes to be easily defined by taking the top DE genes from each pairwise comparison between clusters. For example, to construct a marker set for cluster 2 from the top 15 genes of each comparison, one would filter marker.set to retain rows with Top less than or equal to 10. Other statistics are also reported for each gene, including the adjusted p-values and the log-fold changes relative to every other cluster.
marker.set <- markers[["1"]]
head(marker.set, 10)
## DataFrame with 10 rows and 8 columns
## Top p.value FDR logFC.2 logFC.3
## <integer> <numeric> <numeric> <numeric> <numeric>
## CDH5 1 0 0 2.40606048779888 2.46486877457309
## KDR 1 0 0 3.05671880797832 3.24601945934808
## RAMP2 1 0 0 2.57848802090112 3.05495038287393
## VIM 1 0 0 0.0200752705673883 -0.236574166178003
## PLVAP 2 0 0 2.84541461675187 2.98323262552349
## MARCKS 2 0 0 1.46212577225737 1.02339059458477
## TFPI 3 0 0 1.61036000354054 1.75934155607113
## BST2 3 0 0 0.0508654839426717 0.642659796920468
## ICAM2 4 0 0 2.601208508031 2.7036331699331
## MDK 4 0 0 -1.39786713426209 -1.17288668996822
## logFC.4 logFC.5 logFC.6
## <numeric> <numeric> <numeric>
## CDH5 2.47543878070787 2.30905931750986 2.45773961162037
## KDR 3.10164570826467 3.21355406996164 2.59575683640317
## RAMP2 2.6578719901114 3.94019425274788 4.02984905700337
## VIM 0.0623734927852313 0.890342569384922 3.37320771090712
## PLVAP 2.91568247064559 2.62706625067622 2.9436834026177
## MARCKS 1.54192359341626 2.17510357976692 3.95019451007318
## TFPI 1.02213015920925 3.53607700398041 3.22203630397793
## BST2 0.0266559523681584 1.57002937321504 3.21241498180182
## ICAM2 2.71835561208541 2.52905976079267 1.15962140121361
## MDK -1.78406541381903 2.68186051000704 2.44837756024281
marker.set <- marker.set[marker.set$p.value<0.05 &marker.set$FDR<0.05,]
We visualize the expression profiles of the top candidates to verify that the DE signature is robust. The plot indicates that most of the top markers have strong and consistent up- or downregulation in cells of cluster 1 compared to some or all of the other clusters. Thus, cells from the subpopulation of interest can be identified as those that express the upregulated markers and do not express the downregulated markers.
top.markers <- rownames(marker.set[marker.set$Top <=10,])
top.exprs <- assays(sce)$corrected[top.markers,,drop=FALSE]
heat.vals <- top.exprs - rowMeans(top.exprs)
heatmap.2(heat.vals, col=bluered, symbreak=TRUE, trace='none', cexRow=0.6,
ColSideColors=clust.col[my.clusters], Colv=as.dendrogram(my.tree), dendrogram='none')
legend("bottomleft", col=clust.col, legend=sort(unique(my.clusters)), pch=16)
We can make GO annotation to identify the bp.clusteProfiler package can do GO銆並EGG銆丟SEA enrichment.
suppressPackageStartupMessages(library(clusterProfiler))
gene <- rownames(marker.set)[1:150]
gene.df<-bitr(gene, fromType = "SYMBOL",
toType = c("ENTREZID","ENSEMBL"),
OrgDb = org.Hs.eg.db)
## 'select()' returned 1:many mapping between keys and columns
## Warning in bitr(gene, fromType = "SYMBOL", toType = c("ENTREZID",
## "ENSEMBL"), : 2.67% of input gene IDs are fail to map...
ego_bp <-enrichGO(gene = gene.df$ENTREZID,
OrgDb = org.Hs.eg.db,
keyType = 'ENTREZID',
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
dotplot(ego_bp,font.size=6,title="The GO_BP enrichment analysis of cluster1 ")
ego_bp <- setReadable(ego_bp,OrgDb = org.Hs.eg.db)
ego_bp <- simplify(ego_bp)
emapplot(ego_bp,showCategory = 10)
cnetplot(ego_bp)
Sometimes we can merge clusters that seems alike.
Celluar trajectory
!(http://www.reibang.com/p/5b674135d6ce)
In many single cell studies, individual cells are executing through a gene expression program in an unsynchronized manner. In effect, each cell is a snapshot of the transcriptional program under study. The package Monocle provides tools for analyzing single-cell expression experiments. Monocle introduced the strategy of ordering single cells in pseudotime, placing them along a trajectory corresponding to a biological process such as cell differentiation by taking advantage of individual cell’s asynchronous progression of those processes. Monocle orders cells by learning an explicit principal graph from the single cell genomics data with advanced machine learning techniques (Reversed Graph Embedding), which robustly and accurately resolves complicated biological processes.
Monocle operates on CellDataSet object.Monocle is able to convert Seurat objects from the package “Seurat” and SCESets from the package “scater” into CellDataSet objects that Monocle can use. It’s also worth noting that the function will also work with SCESets from “Scran”.
# Where 'data_to_be_imported' can either be a Seurat object
# or an SCESet.
importCDS(data_to_be_imported)
# We can set the parameter 'import_all' to TRUE if we'd like to
# import all the slots from our Seurat object or SCESet.
# (Default is FALSE or only keep minimal dataset)
importCDS(data_to_be_imported, import_all = TRUE)
suppressPackageStartupMessages(library(monocle))
load("D:/paopaoR/hsc/hsc1.rda")
pData(hsc)$clustersb <- as.factor(pData(hsc)$clustersb)
To work with count data, specify the negative binomial distribution as the expressionFamily argument to newCellDataSet.
#Do not run
sample <- newCellDataSet(count_matrix,
phenoData = pd,
featureData = fd,
expressionFamily=negbinomial.size())
<colgroup style="box-sizing: border-box;"><col width="11%" style="box-sizing: border-box;"><col width="12%" style="box-sizing: border-box;"><col width="20%" style="box-sizing: border-box;"></colgroup>
Family function | Data type | Notes |
---|---|---|
negbinomial.size() | UMIs, Transcript counts from experiments with spike-ins or relative2abs(), raw read counts | Negative binomial distribution with fixed variance (which is automatically calculated by Monocle). Recommended for most users. |
negbinomial() | UMIs, Transcript counts from experiments with spike-ins or relative2abs, raw read counts | Slightly more accurate than negbinomial.size(), but much, much slower. Not recommended except for very small datasets. |
tobit() | FPKM, TPM | Tobits are truncated normal distributions. Using tobit() will tell Monocle to log-transform your data where appropriate. Do not transform it yourself. |
gaussianff() | log-transformed FPKM/TPMs, Ct values from single-cell qPCR | If you want to use Monocle on data you have already transformed to be normally distributed, you can use this function, though some Monocle features may not work well. |
Preparing
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 6064118 323.9 11288127 602.9 7816280 417.5
## Vcells 60951918 465.1 162655794 1241.0 117819939 898.9
We’ll also call two functions that pre-calculate some information about the data. Size factors help us normalize for differences in mRNA recovered across cells, and “dispersion” values will help us perform differential expression analysis later.
hsc <- estimateSizeFactors(hsc)
hsc <- estimateDispersions(hsc)
## Removing 56 outliers
Choose genes that define a cell’s progress.
disp_table <- dispersionTable(hsc)
unsup_clustering_genes <- subset(disp_table,mean_expression>=0.1)
hsc <- setOrderingFilter(hsc,unsup_clustering_genes$gene_id)
Reduce data dimensionality
hsc <- reduceDimension(hsc,max_components = 2,method='DDRTree')
Pseudotime
hsc <- orderCells(hsc)
head(pData(hsc))
## X Barcode clustersb
## AAACCTGAGAATCTCC-5 AAACCTGAGAATCTCC-5 AAACCTGAGAATCTCC-5 2
## AAACCTGAGCTAGCCC-5 AAACCTGAGCTAGCCC-5 AAACCTGAGCTAGCCC-5 1
## AAACCTGCAATGGATA-5 AAACCTGCAATGGATA-5 AAACCTGCAATGGATA-5 1
## AAACCTGCACCGGAAA-5 AAACCTGCACCGGAAA-5 AAACCTGCACCGGAAA-5 1
## AAACCTGCATGGTCAT-5 AAACCTGCATGGTCAT-5 AAACCTGCATGGTCAT-5 2
## AAACCTGGTAAGCACG-5 AAACCTGGTAAGCACG-5 AAACCTGGTAAGCACG-5 1
## Size_Factor Pseudotime State
## AAACCTGAGAATCTCC-5 0.5875901 34.5646783 2
## AAACCTGAGCTAGCCC-5 1.6352472 10.6095850 1
## AAACCTGCAATGGATA-5 1.0178060 0.4699411 1
## AAACCTGCACCGGAAA-5 2.0560322 7.3187067 1
## AAACCTGCATGGTCAT-5 0.6727968 32.2241212 2
## AAACCTGGTAAGCACG-5 1.0849709 0.9665245 1
pData(hsc)$clustersb <- as.factor(pData(hsc)$clustersb)
plot_cell_trajectory(hsc)
plot_cell_trajectory(hsc,color_by = "Pseudotime")
plot_cell_trajectory(hsc,color_by = "clustersb")
Finding Genes that Change as a Function of Pseudotime
Monocle’s main job is to put cells in order of progress through a biological process (such as cell differentiation) without knowing which genes to look at ahead of time. Once it’s done so, you can analyze the cells to find genes that changes as the cells make progress. For example, you can find genes that are significantly up/downregulated as the cells “mature”. We can test a panel of genes important for hsc progress.
to_be_tested <- row.names(subset(fData(hsc),
gene_short_name %in% c("CD34","KDR","PTPRC")))
hsc_sub <- hsc[to_be_tested,]
Monocle uses the VGAM package to model a gene’s expression level as a smooth, nonlinear function of pseudotime.The sm.ns function states that Monocle should fit a natural spline through the expression values to help it describe the changes in expression as a function of progress.
diff_test_res <- differentialGeneTest(hsc_sub,fullModelFormulaStr = "~sm.ns(Pseudotime)")
plot_genes_in_pseudotime(hsc_sub)
plot_genes_in_pseudotime(hsc_sub,color_by = "clustersb")
Monocle can continue to cluster Genes by Pseudotemporal Expression Pattern and analyze Branches in Single-Cell Trajectories.
For all the analysis,we can also use another pipeline
Seurat package(QC+normalize+block on confounders+identify hvgs+reduce dim+cluster cells+setect markers)+other analysis.