hello皮壁,大家好,今天給大家分享的是哪审,擬時序軌跡分析軟件中對軌跡關(guān)鍵驅(qū)動基因的下游分析蛾魄,挖掘擬時序分化軌跡開關(guān)基因的分析,這個方面大家應(yīng)該都不陌生湿滓,文章在GeneSwitches: ordering gene expression and functional events in single-cell experiments滴须,我們首先來分享一下文獻(xiàn),最后看一看參考代碼叽奥。
首先認(rèn)識一下什么是開關(guān)基因扔水,開關(guān)基因指細(xì)胞分化轉(zhuǎn)化過程中表達(dá)沉默或表達(dá)激活的基因,它可能會引起或推動發(fā)育體系朝氓,在可選擇的相關(guān)細(xì)胞途徑中進(jìn)行轉(zhuǎn)換魔市,對生物過程的發(fā)生發(fā)展有著重要的意義。
Abstract
1赵哲、Based on the similarity of gene expression profiles, many tools have been developed to generate an in silico(電腦模擬) ordering of cells in the form of pseudo-time trajectories.
2待德、these tools do not provide a means to find the ordering of critical gene expression changes over pseudo-time.(做軌跡分析做的好的生信人員都不是很多,每個軌跡分析的官方示例都是例題枫夺,需要靈活運(yùn)用将宪,能做到靈活運(yùn)用的生信人員,比例不會很高橡庞,軌跡分析有很多需要注意的地方较坛,大家可以參考文章10X單細(xì)胞軌跡分析之回顧,但這個也是簡單的總結(jié)扒最,遠(yuǎn)遠(yuǎn)不夠)燎潮。
關(guān)于軌跡轉(zhuǎn)變基因這個,以monocle2為例扼倘,主要就是計算不同state之間的差異基因确封,這些基因的代表了不同的發(fā)育軌跡除呵。(這個做法是否合理,有待商榷)爪喘。
3颜曾、We present GeneSwitches, a tool that takes any single-cell pseudo-time trajectory and determines the precise(準(zhǔn)確的,精確的) order of gene expression and functional-event changes over time.(這個很多軟件就是計算差異,不知道這個方法有什么新的思路沒有)秉剑。
4泛豪、GeneSwitches uses a statistical framework based on logistic regression to identify the order in which genes are either switched on or off along pseudo-time.(開關(guān)基因的識別,這個是我們最為關(guān)心的地方)侦鹏。
5诡曙、With this information, users can identify the order in which surface markers appear, investigate how functional ontologies are gained or lost over time and compare the ordering of switching genes from two related pseudo-temporal processes(開關(guān)基因的意義,下游分析略水,這個是后話了~~~)
Introduction 我們提煉一下
現(xiàn)有軟件(monocle2价卤、Slingshot等),A limitation of current pseudo-time methods is that extracting the underlying order of gene expression changes from these trajectories can be difficult(其實(shí)有關(guān)空間位置檢測變化基因的能力渊涝,monocle3已經(jīng)可以做到~~).
being able to interpret these gene expression changes in terms of the order that they occur would allow for a fuller understanding of the underlying biological processes.(這遠(yuǎn)不是我們簡單計算一下差異基因就能解決的問題~~~)慎璧。
To address this, here we developed GeneSwitches, a statistical framework that processes scRNA-seq data together with a pseudo-time trajectory to find the set of genes that switch during the transition.(軟件可以幫助我們找到軌跡發(fā)生過程中的開關(guān)基因)。
For each gene, we calculate a switching time and associated confidence level.(對于每個基因跨释,我們計算轉(zhuǎn)換時間和相關(guān)的置信水平胸私。,有點(diǎn)酷~~~~)鳖谈,能實(shí)現(xiàn)這樣目的的前提有兩個(i)
investigate how gene-regulatory networks or gene ontologies are gained or lost over time(這個絕對可以)(ii)
stratify selected gene sets (e.g. surface markers) by the order in which they appear(按基因出現(xiàn)的順序?qū)x定的基因集(例如表面標(biāo)記)進(jìn)行分層 ) (iii)
identify key differences in the gene expression changes in cell transitions that bifurcate over time(分化轉(zhuǎn)變基因岁疼,這個最為關(guān)鍵)。
接下來缆娃,我們來看看真實(shí)的案例(下圖)
GeneSwitches functions and examples
workflow (看看數(shù)據(jù)處理的過程)
1五续、GeneSwitches requires two inputs, namely the gene expression matrix and the pseudo-time ordering of each single cell。(矩陣和時間龄恋,分析過程如下圖)疙驾。
2、First, GeneSwitches binarizes the gene expression into either an ‘on’ or ‘off’ state to enable the identification of switching events.This binarization is performed gene-wise in a data-driven manner by fitting a mixture model of two Gaussian distributions to the gene expression)The threshold separating the ‘on’ and ‘off’ gene expression states is then determined by the intersection of the two fitted Gaussian distributions.(這個地方涉及到一些數(shù)學(xué)上的知識郭毕,大家感興趣可以查一下)它碎。
3、the pseudo-time can be partitioned to identify switching events within specific parts of the trajectory(也可以截取一段的時間序列進(jìn)行開關(guān)基因的識別)显押。
4扳肛、Ordering and visualization of switching genes,the binarized gene expression is used as a dependent variable in a logistic regression with the pseudo-time value of each cell providing the independent variable.In doing so, the probability of expression throughout pseudo-time is calculated and the quality of fit is determined by McFadden’s Pseudo R2. 這樣就找到了很多的開關(guān)基因乘碑,開關(guān)基因以0.5為界限挖息,然后進(jìn)行表達(dá)的可視化。
我們簡單來總結(jié)一下兽肤,GeneSwitches首先通過對分化軌跡中的基因進(jìn)行二值化分析套腹,篩選出了表達(dá)特性上存在on和off兩個狀態(tài)的潛在開關(guān)基因绪抛。隨后軟件對這些潛在開關(guān)基因進(jìn)行邏輯回歸分析和McFadden’s Pseudo R2擬時間相關(guān)性分析:通過邏輯回歸分析推算出每個開關(guān)基因的開關(guān)時間(Switching Time Point);通過擬時間相關(guān)性分析得到每個相關(guān)性R2值电禀,其中表達(dá)激活的開關(guān)基因與擬時序正相關(guān)(R2>0)幢码,被定義為上調(diào)型開關(guān)基因(Up-regulation),而表達(dá)沉默的開關(guān)基因與擬時序負(fù)相關(guān)(R2<0)尖飞,被定義為下調(diào)型開關(guān)基因(Down-regulation)症副。擬時序相關(guān)性越高代表基因與該軌跡進(jìn)程的關(guān)系越密切。得到了每個潛在開關(guān)基因的開關(guān)時間和相關(guān)性R2值后政基,將Top開關(guān)基因按其開關(guān)時間在擬時序上排序可視化贞铣,可以更直觀地展示分化軌跡進(jìn)程中的關(guān)鍵基因作用。同時沮明,GeneSwitches嵌入了功能分析模塊辕坝,通過基因注釋(如表面蛋白,轉(zhuǎn)錄因子或其他功能類型)和富集分析(GO, KEGG和HALLMARK)的方法幫助解讀開關(guān)基因分析結(jié)果珊擂,更好地與生物過程和疾病發(fā)生發(fā)展聯(lián)系起來圣勒。除此之外费变,該軟件不僅可以對單條軌跡進(jìn)行開關(guān)基因分析摧扇,還能比較兩條不同分化分支之間開關(guān)基因的異同,可用于尋找分支特異性的開關(guān)基因或比較相同開關(guān)基因在不同分支中開關(guān)時間的先后差異挚歧。
看一下基因的排序和可視化
Ordering and visualization of gene classes and functional groups
1扛稽、Switching genes can be used to investigate the functional nature of the pseudo-time trajectory.例如,it might be desirable to know for a set of known surface proteins at what point they are activated or deactivated during a transition in order to facilitate the identification of suitable markers on which to sort cells that are transitioning.(開關(guān)基因什么時候轉(zhuǎn)變的)滑负。
2在张、GeneSwitches can also identify the order in which functional ontologies are acquired or lost during a transition.(功能的獲得和丟失)。
3矮慕、軟件提供了可視化的功能帮匾,To visualize these changes, we provide the functionality to plot the density of switching genes from each ontological class with respect to pseudo-time in order to study when and how different functional classes are important。
4痴鳄、具體案例(單軌跡)瘟斜,首先monocle2計算矩陣的軌跡值,然后識別開關(guān)基因GeneSwitches identified that TIMP1 and VIM were early surface proteins to be activated, indicating that they might represent good candidate markers to identify cells progressing along the differentiation process more rapidly痪寻。We also observe that POU5F1 is deactivated early, whilst MYH7 is activated late(下圖d)螺句。 Functional ontology analysis showed that the cell cycle-related ontologies were down-regulated at an early time and cells acquired cardiac-related functions later in the pseudotime(功能缺失和獲得的時間關(guān)系)。
這部分我們簡單總結(jié)一下橡类,從人胚胎干細(xì)胞(hESC)分化為心肌細(xì)胞(CM)的分化軌跡開關(guān)基因分析蛇尚。通過開關(guān)分析發(fā)現(xiàn)了擬時序較為早期表達(dá)激活的開關(guān)基因VIM和TIMP1,可以作為加速分化進(jìn)程的候選基因顾画。另外他們還找到了早期表達(dá)沉默的開關(guān)基因POU5F1與后期表達(dá)激活的開關(guān)基因MYH7取劫。開關(guān)基因富集分析結(jié)果表明與細(xì)胞周期(Cell Cycle)相關(guān)的通路在擬時序早期下調(diào)匆笤,而與心臟功能相關(guān)的通路在擬時序后期上調(diào)。
雙軌跡開關(guān)基因分析案例勇凭,這個用于多個分化分支的情況
If GeneSwitches has been used to analyse two related pseudo-time trajectories, it is possible to compare the switching genes and their switching times.例如疚膊,在分化過程中,某些細(xì)胞群通常會分叉虾标,每組都處于不同的細(xì)胞狀態(tài)寓盗。 GeneSwitches can be used to compare these trajectories, looking for similarities and differences in the switching genes, as well as their switching times.(比較和尋找差異、開關(guān)基因的時間)璧函。下圖展示了用GeneSwitches分析兩條有聯(lián)系的分化分支的開關(guān)基因分析傀蚌,兩條分化軌跡分別為從人胚胎干細(xì)胞(hESC)到心肌細(xì)胞(CM)的分化分支1(Definitive CM)和到非收縮細(xì)胞的分化分支2(Non-contractile)。分析發(fā)現(xiàn)分支1上調(diào)了一些與心臟功能相關(guān)的基因如CSRP3和NKX2-5蘸吓,而分支2則上調(diào)了如DCN和COL1A2等成纖維細(xì)胞的marker基因善炫。
In summary, GeneSwitches can help identify the timing of gene expression events within a pseudo-time trajectory, which in turn allows for a fuller understanding of the order of regulatory and functional events that occur during a cellular transition.
看一看代碼
GeneSwitches
GeneSwitches 的目標(biāo)是以單細(xì)胞分辨率發(fā)現(xiàn)細(xì)胞狀態(tài)轉(zhuǎn)換期間基因表達(dá)和功能事件的順序。 它適用于任何單細(xì)胞軌跡或細(xì)胞的偽時間排序库继,以發(fā)現(xiàn)充當(dāng)細(xì)胞狀態(tài)之間開/關(guān)開關(guān)的基因箩艺,重要的是這些開關(guān)發(fā)生的順序。
Installation
Check and install required packages
Users may use following codes to check and install all the required packages.
list.of.packages <- c("SingleCellExperiment", "Biobase", "fastglm", "ggplot2", "monocle",
"plyr", "RColorBrewer", "ggrepel", "ggridges", "gridExtra", "devtools",
"mixtools")
## for package "fastglm", "ggplot2", "plyr", "RColorBrewer", "ggrepel", "ggridges", "gridExtra", "mixtools"
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
## for package "SingleCellExperiment", "Biobase"
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) BiocManager::install(new.packages)
Install GeneSwitches
The source code of GeneSwitches can be installed from GitHub with:
devtools::install_github("SGDDNB/GeneSwitches")
Input datasets
GeneSwitches 需要兩個輸入宪萄,即基因表達(dá)矩陣和每個細(xì)胞的相應(yīng)偽時間排序艺谆。 我們將這些輸入數(shù)據(jù)集轉(zhuǎn)換為一個 SingleCellExperiment
對象(Lun and Risso 2017),在下面你會發(fā)現(xiàn)一個完整的“從頭到尾”的工作流程來實(shí)現(xiàn)這個分析的潛力拜英。
## load libraries
library(GeneSwitches)
library(SingleCellExperiment)
這里的實(shí)例代碼静汤,我們將使用已發(fā)布的單細(xì)胞 RNA-seq 數(shù)據(jù)集,這些數(shù)據(jù)從人類胚胎干細(xì)胞 (hESC) 到心肌細(xì)胞 (CM) 的分化(Friedman 等人居凶,2018 年)虫给。 運(yùn)用monocle2進(jìn)行軌跡分析。 選擇這個數(shù)據(jù)集的部分原因是它顯示了心臟 hESC 分化的分叉細(xì)胞命運(yùn)侠碧,它產(chǎn)生了明確的心肌細(xì)胞 (Path1) 或非收縮性心臟衍生物 (Path2)抹估,允許應(yīng)用 GeneSwitches 的所有方面。
## Download example files to current directory
get_example_inputData()
## Load input data log-normalized gene expression
load("./logexpdata.RData")
## Load Monocle2 object with pseudo-time and dimensionality reduction
load("./cardiac_monocle2.RData")
Direct input (NOT run)
Users can input the gene expression (logexpdata
; recommend for log-normalized expression), pseudo-time (cell_pseudotime
) and dimensionality reductions (rd_PCA
; optional and only for gene expression plots) into SingleCellExperiment object as follows.
### create SingleCellExperiment object with log-normalized single cell data
#sce <- SingleCellExperiment(assays = List(expdata = logexpdata))
### add pseudo-time information
#colData(sce)$Pseudotime <- cell_pseudotime
### add dimensionality reductions, e.g. PCA, UMAP, tSNE
#pca <- prcomp(t(assays(sce)$expdata), scale. = FALSE)
#rd_PCA <- pca$x[,1:2]
#reducedDims(sce) <- SimpleList(PCA = rd_PCA)
Convert from trajectory results
Alternatively, GeneSwitches provides functions to convert Monocle2 or Slingshot results into SingleCellExperiment object directly. For Monocle2 trajectory, users need to indicate the states of the desired path, which can be checked by plotting the trajectory using Monocle2 function plot_cell_trajectory
or the following function.
## plot Monocle2 trajectory colored by State
# monocle:::plot_cell_trajectory(cardiac_monocle2, color_by = "State")
plot_monocle_State(cardiac_monocle2)
Based on the marker genes, the pseudo-time trajectory starts from State 3, which are hESC cells. Definitive CM cells are in State 1 and non-contractile cardiac derivatives are in State 5. Therefore, we focus on Path1 with cells in states 3, 2, 1 and Path2 with cells in states 3, 2, 5, and extract these two paths from Monocle2 object.
## Input log-normalized gene expression, Monocle2 pseudo-time and dimensionality reduction
## Path1 containing cells in states 3,2,1
sce_p1 <- convert_monocle2(monocle2_obj = cardiac_monocle2,
states = c(3,2,1), expdata = logexpdata)
## Path2 containing cells in states 3,2,5
sce_p2 <- convert_monocle2(monocle2_obj = cardiac_monocle2,
states = c(3,2,5), expdata = logexpdata)
If we are only interested in the trajectory within a certain range of pseudotime, function subset_pseudotime
can be used to subset the SingleCellExperiment object accordingly, followed by filtering out lowly expressed genes.
### Subset cells to pseudotime range from 10 to 25
#sce_p1_subset <- subset_pseudotime(sce_p1, min_time = 10, max_time = 25, minexp = 0, mincells = 10)
In Part I, we will apply GeneSwitches on a single trajectory, Path1, to demonstrate the general workflow and functions. Comparison of GeneSwitches results from two trajectories (Path1 & 2) will be shown in Part II.
PART I. GeneSwitches on a single trajectory
I-1. Binarize gene expression
由于我們關(guān)注的是打開或關(guān)閉的基因弄兜,因此我們首先將基因表達(dá)數(shù)據(jù)二值化為 1(on) 或 0(off) 狀態(tài)药蜻。 為了實(shí)現(xiàn)這一點(diǎn),對于每個基因挨队,我們將兩個高斯分布的混合模型擬合到輸入基因表達(dá)中谷暮,以計算用于二值化的基因特異性閾值。 在擬合之前盛垦,我們在基因表達(dá)中添加了均值為零和標(biāo)準(zhǔn)差為 0.1 的高斯噪聲湿弦,這確保了基因表達(dá)擬合的數(shù)值穩(wěn)定性。 然后去除不具有明顯雙峰“開-關(guān)”分布的基因腾夯。 對于使用 3 個內(nèi)核的 2000 個細(xì)胞颊埃,此步驟可能需要 2 分鐘蔬充。
### binarize gene expression using gene-specific thresholds
sce_p1 <- binarize_exp(sce_p1, ncores = 3)
Alternatively, we can use a global threshold for fast binarization. We plot a histogram of expression of all the genes in all cells and look for a break between the zero and expressed distributions to identify the global threshold.
### check the threshold for binarization
#h <- hist(assays(sce_p1)$expdata, breaks = 200, plot = FALSE)
#{plot(h, freq = FALSE, xlim = c(0,2), ylim = c(0,1), main = "Histogram of gene expression",
#xlab = "Gene expression", col = "darkgoldenrod2", border = "grey")
#abline(v=0.2, col="blue")}
###In this example, we choose 0.2 (blue line, also set as default) as the threshold.
# bn_cutoff <- 0.2
# sce_p1 <- binarize_exp(sce_p1, fix_cutoff = TRUE, binarize_cutoff = bn_cutoff)
I-2. Fit logistic regression & estimate switching time
Logistic regression is applied to model the binary states (on or off) of gene expression. Then the switching pseudo-time point is determined by the time at which the fitted line crosses the probability threshold 0.5. We use random downsampling of zero expressions (downsample = TRUE) to rescue the prediction of switching time for genes with high zero inflation.
## fit logistic regression and find the switching pseudo-time point for each gene
## with downsampling. This step takes less than 1 mins
sce_p1 <- find_switch_logistic_fastglm(sce_p1, downsample = TRUE, show_warning = FALSE)
I-3. Visualize ordering of switching genes
First, we filter poorly fitted genes based on zero-expression percentage (>90%), FDR (>0.05) and McFadden’s Pseudo R^2 (<0.03). We can then the number of top best fitting (high McFadden’s Pseudo R^2) genes to plot. One can also extract specific gene type(s) to plot, with provided gene type lists containing surface proteins (downloaded from here) and transcription factors (TFs, downloaded from here). Users are allowed to pass their own gene type lists as a data frame to parameter genelists, with rows as genes (non-duplicated) and two columns with name genenames
and genetypes
.
## filter top 15 best fitting switching genes among all the genes
sg_allgenes <- filter_switchgenes(sce_p1, allgenes = TRUE, topnum = 15)
## filter top 15 best fitting switching genes among surface proteins and TFs only
sg_gtypes <- filter_switchgenes(sce_p1, allgenes = FALSE, topnum = 20,
genelists = gs_genelists, genetype = c("Surface proteins", "TFs"))
## combine switching genes and remove duplicated genes from sg_allgenes
sg_vis <- rbind(sg_gtypes, sg_allgenes[setdiff(rownames(sg_allgenes), rownames(sg_gtypes)),])
Finally, plot the selected genes along the pseudo-timeline. Genes that are switched on are plotted above the line, while those switching off are below the line.
plot_timeline_ggplot(sg_vis, timedata = sce_p1$Pseudotime, txtsize = 3)
It is possible to use the dimensionality reduction provided from the user to visualise the gene expression and logistic regression fitting plots if needed.
plot_gene_exp(sce_p1, gene = "VIM", reduction = "monocleRD", downsample = F)
I-4. Order pathways along the pseudo-timeline
GeneSwitches can be used to order pathways or genesets as well. We include the pathways provided by MSigDB hallmark (Liberzon,A. et al., 2015), C2 curated and C5 gene ontology geneset collections. A Hypergeometric test is first applied to extract the pathways that are significantly overrepresented amongst those that are changing along the trajectory. The Switching time of the pathway is then determined by the median switching time of genes in that pathway.
## filter genes for pathway analysis using r^2 cutoff 0.1
sg_pw <- filter_switchgenes(sce_p1, allgenes = TRUE, r2cutoff = 0.1)
## apply hypergeometric test and determine the switching time
switch_pw <- find_switch_pathway(rowData(sce_p1), sig_FDR = 0.05,
pathways = msigdb_h_c2_c5, sg_pw)
## remove redundant pathways
switch_pw_reduce <- reduce_pathways(switch_pw, pathways = msigdb_h_c2_c5,
redundant_pw_rate = 0.8)
To better visualise the functional changes ridge plots of pathways genes show the density of switching genes along the pseudo-time. Top 10 significantly changed pathways are plotted here, ordered by the switching time.
plot_pathway_density(switch_pw_reduce[1:10,], sg_pw, orderbytime = TRUE)
#> Picking joint bandwidth of 2.49
We can also select specific pathway(s) to plot the switching genes in it. Among top 10 significantly changed pathways, we plot genes related to myogenesis and cardiac muscle tissue development.
sg_vis <- filter_switchgenes(sce_p1, topnum = 50, pathway_name = c("HALLMARK_MYOGENESIS",
"GO_CARDIAC_MUSCLE_TISSUE_DEVELOPMENT"))
plot_timeline_ggplot(sg_vis, timedata=sce_p1$Pseudotime, txtsize=3)
“Multiple” lables the genes in more than one pathways.
PART II. Comparing switching genes from two trajectories
Before comparison, we need to apply same steps in I-1 and I-2 on the cells from Path2 to identify switching pseudo-time point for each gene.
sce_p2 <- binarize_exp(sce_p2)
sce_p2 <- find_switch_logistic_fastglm(sce_p2, downsample = TRUE, show_warnings = FALSE)
And we filter out poorly fitted genes for both paths using the same cutoff.
sg_p1 <- filter_switchgenes(sce_p1, allgenes = TRUE, r2cutoff = 0.03)
sg_p2 <- filter_switchgenes(sce_p2, allgenes = TRUE, r2cutoff = 0.03)
We then plot common switching genes between two paths to compare their ordering.
sg_com <- common_genes(sg_p1, sg_p2, r2cutoff = 0.4,
path1name = "Definitive CM", path2name = "non-contractile")
common_genes_plot(sg_com, timedata = sce_p1$Pseudotime)
More importantly, we can plot the distinct switching genes of the two paths.
sg_disgs <- distinct_genes(sg_p1, sg_p2, r2cutoff = 0.52,
path1name = "Definitive CM", path2name = "non-contractile",
path1time = sce_p1$Pseudotime, path2time = sce_p2$Pseudotime)
plot_timeline_ggplot(sg_disgs, timedata = sce_p1$Pseudotime, color_by = "Paths",
iffulltml = FALSE, txtsize = 3)
We can also scale the timelines to be the same length (default number of bins is 100) so that differences are based on percentage of the trajectory covered rather than pseudo-time.
sg_disgs_scale <- distinct_genes(sg_p1, sg_p2, r2cutoff = 0.52,
path1name = "Definitive CM", path2name = "non-contractile",
path1time = sce_p1$Pseudotime, path2time = sce_p2$Pseudotime,
scale_timeline = T, bin = 100)
# timedata need to be 1 to (number of bins)
plot_timeline_ggplot(sg_disgs_scale, timedata = 1:100, color_by = "Paths",
iffulltml = FALSE, txtsize = 3)
這兩個不同轉(zhuǎn)換基因的圖只顯示了一系列發(fā)生轉(zhuǎn)換事件的偽時間線。 這個范圍實(shí)際上是在軌跡的末端班利,而共同基因大多處于早期(共同基因圖)饥漫。
Similarly, we can check the gene expression plots for the two paths.
gn <- "DCN"
p <- plot_gene_exp(sce_p1, gene = gn, reduction = "monocleRD",
downsample = FALSE, fitting = TRUE)
#> Warning: glm.fit: algorithm did not converge
p <- plot_gene_exp(sce_p2, gene = gn, reduction = "monocleRD",
downsample = FALSE, fitting = TRUE)
生活很好,等你超越