隔離的第四天盖呼,一個人的日子總是很難熬蠢挡,孤獨的生活卻總是來到我的身邊弧岳,這一篇我們來分享EcoTyper的示例代碼凳忙,幫助我們來尋找細胞特異性轉(zhuǎn)錄狀態(tài)和共關(guān)聯(lián)模式。這篇發(fā)表于cell的軟件禽炬,文獻的分享在10X單細胞空間數(shù)據(jù)分析之表征細胞狀態(tài)和生態(tài)型(EcoTyper)涧卵,我們來看看軟件的具體使用方法,當然了腹尖,我們更加關(guān)注單細胞和空間數(shù)據(jù)的部分
EcoTyper 是一個機器學習框架艺演,用于從bulk和單細胞 (scRNA-seq) 表達數(shù)據(jù)中大規(guī)模識別細胞類型特異性轉(zhuǎn)錄狀態(tài)及其共關(guān)聯(lián)模式。
軟件本身已經(jīng)定義了癌癥(Luca/Steen 等人桐臊,Cell 2021)和彌漫性大 B 細胞淋巴瘤(DLBCL)(Steen/Luca 等人,Cancer Cell 2021)中的細胞狀態(tài)和生態(tài)型晓殊。 當前版本的 EcoTyper 允許用戶在他們自己的數(shù)據(jù)中恢復(fù)這兩種腫瘤類別的細胞狀態(tài)和生態(tài)型断凶。 此外,它允許用戶在他們感興趣的系統(tǒng)中發(fā)現(xiàn)和恢復(fù)細胞狀態(tài)和生態(tài)型巫俺,包括直接從 scRNA-seq 數(shù)據(jù)中恢復(fù)认烁。
安裝EcoTyper
wget https://github.com/digitalcytometry/ecotyper/archive/refs/heads/master.zip
unzip master.zip
cd ecotyper-master
Basic resources
下面列出的 R 包是運行 EcoTyper 所必需的。 版本號表示用于開發(fā)和測試 EcoTyper 代碼的包版本介汹。 其他 R 版本也可能工作:
- R (v3.6.0 and v4.1.0).
- R packages: ComplexHeatmap (v2.2.0 and v2.8.0), NMF (v0.21.0 and v0.23.0), RColorBrewer (v1.1.2), cluster (v2.1.0 and v2.1.2)), circlize (v0.4.10 and v0.4.12), cowplot (v1.1.0 and v1.1.1), data.table (base package R v3.6.0 and v4.1.0), doParallel (v1.0.15 and v1.0.16), ggplot2 (v3.3.2, v3.3.3), grid (base package R v3.6.0 and v4.1.0), reshape2 (v1.4.4), viridis (v0.5.1 and v0.6.1), config (v0.3.1), argparse (v2.0.3), colorspace (v1.4.1 and v2.0.1), plyr (v1.8.6).
這些包與預(yù)先存儲在 EcoTyper 文件夾中的其他資源一起却嗡,允許:
- 在自己的bulk RNA-seq、微陣列和 scRNA-seq 數(shù)據(jù)中執(zhí)行先前定義的細胞狀態(tài)和生態(tài)型的恢復(fù) .
- perform cell state and ecotype discovery in scRNA-seq and pre-sorted cell type-specific profiles
當然了嘹承,我們關(guān)注單細胞和空間數(shù)據(jù)的分析
When the input is scRNA-seq or bulk-sorted cell type-specific profiles (e.g., FACS-purified), EcoTyper performs the following major steps for discovering cell states and ecotypes:
- Gene filtering: This step filters out genes that do not show cell type specificity.
- Cell state discovery: This step enables identification and quantitation of cell type-specific transcriptional states.
- Ecotype discovery: This step enables co-assignment of cell states into multicellular communities (ecotypes).
無論用于導(dǎo)出細胞狀態(tài)和生態(tài)型的輸入類型如何窗价,EcoTyper 都可以在外部表達數(shù)據(jù)集中執(zhí)行細胞狀態(tài)和生態(tài)型恢復(fù)。 可以bulk叹卷、scRNA-seq 和空間轉(zhuǎn)錄組數(shù)據(jù)進行恢復(fù)撼港。
我們來看看示例,提供的 scRNA-seq 數(shù)據(jù)中恢復(fù)細胞狀態(tài)和生態(tài)型
EcoTyper 預(yù)裝了用戶提供的 scRNA-seq 數(shù)據(jù)中先前在癌癥和淋巴瘤中定義的細胞狀態(tài)和生態(tài)型的參考指導(dǎo)恢復(fù)所需的資源骤竹。
我們先來看看主腳本EcoTyper_recovery_scRNA.R
suppressPackageStartupMessages({
library(argparse)
source("pipeline/lib/config.R")
source("pipeline/lib/multithreading.R")
})
parser <- ArgumentParser(add_help = F)
arguments = parser$add_argument_group('Arguments')
arguments$add_argument("-d", "--discovery", type="character", default="Carcinoma",
help=paste0("The name of the discovery dataset used to define cell states and ecotypes. Accepted values: ",
"'Carcinoma' will recover the cell states and ecotypes defined across carcinomas, as described in the EcoTyper carcinoma paper, ",
"'Lymphoma' will recover the cell states and ecotypes defined in diffuse large B cell lymphoma (DLBCL), as described in the EcoTyper lymphoma paper, ",
"'<MyDiscovery>' the value used in the field 'Discovery dataset name' of the config file used for running EcoTyper discovery ('EcoTyper_discovery.R') script. ",
"[default: '%(default)s']"),
metavar="<character>")
arguments$add_argument("-m", "--matrix", type = "character", metavar="<PATH>",
help="Path to a tab-delimited file containing the input scRNA-seq expression matrix, with gene names on the first column and cell ids as column names [required].")
arguments$add_argument("-a", "--annotation", type = "character", metavar="<PATH>",
help="Path to a tab-delimited annotation file containing the annotation of cells in the input matrix. This file should contain at least two columns, 'ID' with the same values as the columns of the expression matrix, and 'CellType' (case sensitive) which contains the cell type for each cell. These values are limited to the set of cell types analyzed in the discovery dataset. If the argument '-d' is set to 'Carcinoma', then the accepted values for column 'CellType' are: 'B.cells', 'CD4.T.cells', 'CD8.T.cells', 'Dendritic.cells', 'Endothelial.cells', 'Epithelial.cells', 'Fibroblasts', 'Mast.cells', 'Monocytes.and.Macrophages', 'NK.cells', 'PCs' and 'PMNs'. If the argument '-d' is set to 'Lymphoma', then the accepted values for column 'CellType' are: 'B.cells', 'Plasma.cells', 'T.cells.CD8', 'T.cells.CD4', 'T.cells.follicular.helper', 'Tregs', 'NK.cells', 'Monocytes.and.Macrophages', 'Dendritic.cells', 'Mast.cells', 'Neutrophils', 'Fibroblasts', 'Endothelial.cells'. All other values will be ignored for these two cases. Additionally, this file can contain any number of columns, that can be used for plotting color bars in the output heatmaps (see argument '-c'). [required]")
arguments$add_argument("-c", "--columns", type = "character", metavar="<character>", default="NULL",
help="A comma-spearated list of column names from the annotation file to be plotted as color bar in the output heatmaps. [default: '%(default)s']")
arguments$add_argument("-z", "--z-score", type = "character", metavar="<bool>", default="FALSE",
help="A flag indicating whether the significance quantification procedure should be run. Note that this procedure might be slow, as the NMF model is applied 30 times on the same dataset. [default: '%(default)s']")
arguments$add_argument("-s", "--subsample", type = "integer", metavar="<integer>", default=-1,
help="An integer specifying the number of cells each cell type will be downsampled to. For values <50, no downsampling will be performed. [default: '%(default)s' (no downsampling)]")
arguments$add_argument("-t", "--threads", type = "integer", metavar="<integer>", default=10,
help="Number of threads. [default: '%(default)s']")
arguments$add_argument("-o", "--output", type = "character", metavar="<PATH>", default="RecoveryOutput",
help="Output directory path. [default: '%(default)s']")
arguments$add_argument("-h", "--help", action='store_true', help="Print help message.")
args <- parser$parse_args()
#print(args)
if(is.null(args$matrix))
{
parser$print_help()
quit()
}
input_mat = normalizePath(args$matrix)
discovery = args$discovery
annotation_path = normalizePath(args$annotation)
columns = args$columns
z_flag = as.logical(args$z)
n_threads = as.numeric(args$threads)
subsample = as.integer(args$subsample)
if(!file.exists(input_mat))
{
stop("Error: Path to the input expression matrix does not exist!")
}
if(!discovery %in% c("Carcinoma", "Lymphoma"))
{
config_file = file.path("EcoTyper", discovery, "config_used.yml")
if(!file.exists(config_file))
{
stop("Error: Cannot read the config file used for the discovery of cell states and ecotypes. It should be in the following path: '", config_file, "'. Please make sure that the '--discovery (-d)' argument provided is correct!")
}
config <- config::get(file = config_file)
fractions = config$"Input"$"Cell type fractions"
if(is.null(fractions))
{
if(config$"Pipeline settings"$"Filter non cell type specific genes")
{
fractions = "Cell_type_specific_genes"
}else{
fractions = "All_genes"
}
}else{
if(!fractions %in% c("Carcinoma_Fractions", "Lymphoma_Fractions"))
{
fractions = "Custom"
}
}
}else{
fractions = paste0(discovery, "_Fractions")
}
if(!file.exists(annotation_path))
{
stop("Error: Path to the input annotation file does not exist!")
}else{
annotation = read.delim(annotation_path)
if(!all(c("ID", "CellType") %in% colnames(annotation)))
{
stop("Error: The annotation file provided, does not contain the columns 'ID' and 'CellType.")
}
if("Sample" %in% colnames(annotation))
{
recover_ecotypes = T
ecotypes = read.delim(file.path("EcoTyper", discovery, fractions, "Ecotypes", "discovery", "ecotypes.txt"))
disc_cell_types = table(ecotypes$CellType)
rec_cell_types= table(annotation$CellType)
if(length(rec_cell_types) * 2 >= length(disc_cell_types))
{
cat("The annotation file contains column 'Sample', and more than half of the cell types are present in the recovery dataset. Will perform ecotype recovery.\n")
}else{
cat("The annotation file contains column 'Sample', but less than half of the cell types are present in the recovery dataset. Will NOT perform ecotype recovery.\n")
}
}else{
recover_ecotypes = F
cat("The annotation file does not contain column 'Sample'. Will NOT perform ecotype recovery.\n")
}
if(columns != "NULL")
{
additional_columns = strsplit(columns, ",")[[1]]
if(!all(additional_columns %in% colnames(annotation)))
{
stop(paste0("The following columns are missing from the annotation file provided: ", "'",
paste(additional_columns[!additional_columns %in% colnames(annotation)], collapse = "'"), "'"))
}
}else{
additional_columns = c()
}
}
recovery = gsub(".tsv$", "", gsub(".txt$", "", basename(input_mat)))
input_dir = file.path("datasets", "scRNA", recovery)
dir.create(input_dir, recursive = T, showWarning = F)
dir.create(file.path(args$output, recovery), recursive = T, showWarnings = F)
final_output = normalizePath(file.path(args$output, recovery))
PushToJobQueue(paste0("ln -sf ", input_mat, " ", file.path(input_dir, "data.txt")))
RunJobQueue()
PushToJobQueue(paste0("ln -sf ", annotation_path, " ", file.path(input_dir, "annotation.txt")))
RunJobQueue()
start = Sys.time()
cur_dir = getwd()
setwd("pipeline")
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
key = key[key[,1] %in% annotation$CellType,]
if(nrow(key) == 0)
{
stop("Error: The column 'CellType' of the annotation file contains invalid values!")
}
for(cell_type in key[,1])
{
n_states = key[key[,1] == cell_type, 2]
PushToJobQueue((paste("Rscript state_recovery_scRNA.R", discovery, fractions, cell_type, n_states, recovery, z_flag, subsample, paste(additional_columns, collapse = " "))))
}
RunJobQueue()
if(recover_ecotypes)
{
PushToJobQueue((paste("Rscript ecotypes_recovery_scRNA.R", discovery, fractions, recovery)))
RunJobQueue()
}
cat("\nCopying EcoTyper results to the output folder...\n")
if(file.exists(final_output) && length(list.files(final_output)) > 0)
{
old_results_folder = paste0(final_output, format(Sys.time(), " %a %b %d %X %Y"))
dir.create(old_results_folder, recursive = T, showWarnings = F)
warning(paste0("The output folder contains files from a previous run. Moving those files to: '", old_results_folder, "'"))
system(paste0("mv -f ", final_output, "/* '", old_results_folder, "'"))
}
dir.create(final_output, recursive = T, showWarnings = F)
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
key = key[key[,1] %in% annotation$CellType,]
for(cell_type in key[,1])
{
n_clusters = key[key[,1] == cell_type, 2]
ct_output = file.path(final_output, cell_type)
dir.create(ct_output, recursive = T, showWarnings = F)
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", "scRNA", recovery, cell_type, n_clusters, "state_assignment.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", "scRNA", recovery, cell_type, n_clusters, "state_assignment_heatmap.pdf"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", "scRNA", recovery, cell_type, n_clusters, "state_assignment_heatmap.png"), ct_output))
if(z_flag)
{
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", "scRNA", recovery, cell_type, n_clusters, "recovery_z_scores.txt"), ct_output))
}
}
if(recover_ecotypes)
{
ct_output = file.path(final_output, "Ecotypes")
dir.create(ct_output, recursive = T, showWarnings = F)
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "recovery", recovery, "ecotype_assignment.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "recovery", recovery, "ecotype_abundance.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "recovery", recovery, "heatmap_assigned_samples_viridis.pdf"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "recovery", recovery, "heatmap_assigned_samples_viridis.png"), ct_output))
}
end = Sys.time()
cat(paste0("\nEcoTyper finished succesfully! Please find the results in: '", final_output, "'.\nRun time: ", format(end - start, digits = 0), "\n"))
運行一下帝牡,
Rscript EcoTyper_recovery_scRNA.R -h
## usage: EcoTyper_recovery_scRNA.R [-d <character>] [-m <PATH>] [-a <PATH>]
## [-c <character>] [-z <bool>] [-s <integer>]
## [-t <integer>] [-o <PATH>] [-h]
##
## Arguments:
## -d <character>, --discovery <character>
## The name of the discovery dataset used to define cell
## states and ecotypes. Accepted values: 'Carcinoma' will
## recover the cell states and ecotypes defined across
## carcinomas, as described in the EcoTyper carcinoma
## paper, 'Lymphoma' will recover the cell states and
## ecotypes defined in diffuse large B cell lymphoma
## (DLBCL), as described in the EcoTyper lymphoma paper,
## '<MyDiscovery>' the value used in the field 'Discovery
## dataset name' of the config file used for running
## EcoTyper discovery ('EcoTyper_discovery.R') script.
## [default: 'Carcinoma']
## -m <PATH>, --matrix <PATH>
## Path to a tab-delimited file containing the input
## scRNA-seq expression matrix, with gene names on the
## first column and cell ids as column names [required].
## -a <PATH>, --annotation <PATH>
## Path to a tab-delimited annotation file containing the
## annotation of cells in the input matrix. This file
## should contain at least two columns, 'ID' with the
## same values as the columns of the expression matrix,
## and 'CellType' (case sensitive) which contains the
## cell type for each cell. These values are limited to
## the set of cell types analyzed in the discovery
## dataset. If the argument '-d' is set to 'Carcinoma',
## then the accepted values for column 'CellType' are:
## 'B.cells', 'CD4.T.cells', 'CD8.T.cells',
## 'Dendritic.cells', 'Endothelial.cells',
## 'Epithelial.cells', 'Fibroblasts', 'Mast.cells',
## 'Monocytes.and.Macrophages', 'NK.cells', 'PCs' and
## 'PMNs'. If the argument '-d' is set to 'Lymphoma',
## then the accepted values for column 'CellType' are:
## 'B.cells', 'Plasma.cells', 'T.cells.CD8',
## 'T.cells.CD4', 'T.cells.follicular.helper', 'Tregs',
## 'NK.cells', 'Monocytes.and.Macrophages',
## 'Dendritic.cells', 'Mast.cells', 'Neutrophils',
## 'Fibroblasts', 'Endothelial.cells'. All other values
## will be ignored for these two cases. Additionally,
## this file can contain any number of columns, that can
## be used for plotting color bars in the output heatmaps
## (see argument '-c'). [required]
## -c <character>, --columns <character>
## A comma-spearated list of column names from the
## annotation file to be plotted as color bar in the
## output heatmaps. [default: 'NULL']
## -z <bool>, --z-score <bool>
## A flag indicating whether the significance
## quantification procedure should be run. Note that this
## procedure might be slow, as the NMF model is applied
## 30 times on the same dataset. [default: 'FALSE']
## -s <integer>, --subsample <integer>
## An integer specifying the number of cells each cell
## type will be downsampled to. For values <50, no
## downsampling will be performed. [default: '-1' (no
## downsampling)]
## -t <integer>, --threads <integer>
## Number of threads. [default: '10']
## -o <PATH>, --output <PATH>
## Output directory path. [default: 'RecoveryOutput']
## -h, --help Print help message.
參數(shù)解析:
-d/–discovery:用于定義細胞狀態(tài)的發(fā)現(xiàn)數(shù)據(jù)集的名稱。 默認情況下蒙揣,唯一可接受的值是 Carcinoma 和 Lymphoma(區(qū)分大小寫)靶溜,它們將分別恢復(fù)我們已經(jīng)在癌和淋巴瘤中定義的細胞狀態(tài)。 如果用戶在自己的數(shù)據(jù)中定義了細胞狀態(tài)懒震,則發(fā)現(xiàn)數(shù)據(jù)集的名稱是用于運行細胞狀態(tài)發(fā)現(xiàn)的配置文件的 Discovery dataset name 字段中提供的值罩息。 在教程中,我們將發(fā)現(xiàn)數(shù)據(jù)集的名稱設(shè)置為 Carcinoma个扰。
-m/–matrix:輸入 scRNA-seq 矩陣的路徑扣汪。 scRNA-seq 表達矩陣應(yīng)該是一個制表符分隔的文件,第一列是基因符號锨匆,下一列是細胞崭别。 它應(yīng)該有細胞標識符(例如條形碼)作為列名冬筒,并且應(yīng)該采用 TPM、CPM茅主、FPKM 或任何其他合適的計數(shù)格式舞痰。 基因符號和細胞標識符應(yīng)該是唯一的。 此外诀姚,我們建議列名不要包含由 R 函數(shù) make.names 修改的特殊字符响牛,例如 名稱開頭有數(shù)字或包含空格、制表符或 - 等字符赫段。 本教程中使用的 CRC 癌癥 scRNA-seq 數(shù)據(jù)如下所示:
data = read.delim("example_data/scRNA_CRC_data.txt", nrow = 5)
head(data[,1:5])
## Gene SMC01.T_AAAGATGCATGGATGG SMC01.T_AAAGTAGCAAGGACAC
## 1 A1BG 0 0
## 2 A1CF 0 0
## 3 A2M 0 0
## 4 A2ML1 0 0
## 5 A3GALT2 0 0
## SMC01.T_AAATGCCAGGATCGCA SMC01.T_AACTCTTCACAACGCC
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
- -a/–annotation:制表符分隔的注釋文件的路徑呀打。該文件應(yīng)至少包含兩列:與表達式矩陣的列具有相同值的 ID,以及包含每個細胞的細胞類型的 CellType(區(qū)分大小寫)糯笙。 CellType 列中的值應(yīng)指示每個細胞的細胞類型贬丛。這些值僅限于發(fā)現(xiàn)數(shù)據(jù)集中分析的一組細胞類型。如果參數(shù) -d/–discovery 設(shè)置為 Carcinoma给涕,則接受的列 CellType 值為:'B.cells'豺憔、'CD4.T.cells'、'CD8.T 細胞”够庙、“樹突細胞”恭应、“內(nèi)皮細胞”、“上皮細胞”耘眨、“成纖維細胞”昼榛、“肥大細胞”、“單核細胞和巨噬細胞”剔难、“NK細胞”褒纲、“PCs”和“ PMN'。如果參數(shù) -d/–discovery 設(shè)置為 Lymphoma钥飞,則接受的 CellType 列值為:'B.cells'莺掠、'Plasma.cells'、'T.cells.CD8'读宙、'T.cells.CD4'彻秆、 'T.cells.follicular.helper'、'Tregs'结闸、'NK.cells'唇兑、'單核細胞和巨噬細胞'、'樹突細胞'桦锄、'肥大細胞'扎附、'中性粒細胞'、'成纖維細胞'结耀、'內(nèi)皮細胞留夜。細胞'匙铡。對于這兩種情況,所有其他值都將被忽略碍粥。注釋文件可以包含一個名為 Sample 的列逃顶。如果此列存在雁社,除了細胞狀態(tài)恢復(fù)外棍郎,還將執(zhí)行生態(tài)型恢復(fù)阅签。此外,此文件可以包含任意數(shù)量的列枕面,可用于在輸出熱圖中繪制顏色條(請參閱參數(shù) -c/–columns)愿卒。
data = read.delim("example_data/scRNA_CRC_annotation.txt")
head(data)
## Index Patient Class Sample Cell_type Cell_subtype
## 1 SMC01-T_AAAGATGCATGGATGG SMC01 Tumor SMC01-T Epithelial cells CMS2
## 2 SMC01-T_AAAGTAGCAAGGACAC SMC01 Tumor SMC01-T Epithelial cells CMS2
## 3 SMC01-T_AAATGCCAGGATCGCA SMC01 Tumor SMC01-T Epithelial cells CMS2
## 4 SMC01-T_AACTCTTCACAACGCC SMC01 Tumor SMC01-T Epithelial cells CMS2
## 5 SMC01-T_AACTTTCGTTCGGGCT SMC01 Tumor SMC01-T Epithelial cells CMS2
## 6 SMC01-T_AAGGTTCTCCAATGGT SMC01 Tumor SMC01-T Epithelial cells CMS2
## CellType ID Tissue
## 1 Epithelial.cells SMC01.T_AAAGATGCATGGATGG Tumor
## 2 Epithelial.cells SMC01.T_AAAGTAGCAAGGACAC Tumor
## 3 Epithelial.cells SMC01.T_AAATGCCAGGATCGCA Tumor
## 4 Epithelial.cells SMC01.T_AACTCTTCACAACGCC Tumor
## 5 Epithelial.cells SMC01.T_AACTTTCGTTCGGGCT Tumor
## 6 Epithelial.cells SMC01.T_AAGGTTCTCCAATGGT Tumor
-c/–columns:注釋文件中以逗號分隔的列名列表(請參閱參數(shù) -a/–annotation),在輸出熱圖中繪制為彩條潮秘。 默認情況下琼开,輸出熱圖包含每個細胞分配到的細胞狀態(tài)標簽作為顏色條。 此參數(shù)指示的列名稱將添加到該顏色條中唇跨。
-z/–z-score:指示是否應(yīng)運行顯著性量化程序的標志(默認為 FALSE)。 此過程允許用戶確定在給定數(shù)據(jù)集中是否顯著恢復(fù)了細胞狀態(tài)衬衬。 請注意买猖,此過程可能非常緩慢,因為 NMF 模型在同一數(shù)據(jù)集上應(yīng)用了 30 次滋尉。
-s/–subsample: An integer specifying the number of cells each cell type will be downsampled to. For values <50, no downsampling will be performed. Default: -1 (no downsampling).
-t/–threads: Number of threads. Default: 10.
-o/–output: Output folder. The output folder will be created if it does not exist.
運行一下單細胞數(shù)據(jù)
Rscript EcoTyper_recovery_scRNA.R -d Carcinoma -m example_data/scRNA_CRC_data.txt -a example_data/scRNA_CRC_annotation.txt -o RecoveryOutput
The outputs of this script include the following files, for each cell type provided:
- The assignment of single cells to states:
data = read.delim("RecoveryOutput/scRNA_CRC_data/Fibroblasts/state_assignment.txt")
head(data[,c("ID", "State")])
## ID State
## 1 SMC01.T_TGCGCAGTCGGATGGA S01
## 2 SMC04.T_CACAAACTCTACTATC S01
## 3 SMC15.T_GCGCGATTCATAAAGG S01
## 4 SMC17.T_GTACGTAGTGACTACT S01
## 5 SMC20.T_CTAAGACCACTGTCGG S01
## 6 SMC20.T_GTTACAGTCGCGTTTC S01
兩個熱圖:一個表示發(fā)現(xiàn)數(shù)據(jù)集中細胞狀態(tài)標記基因表達的熱圖玉控,一個表示 scRNA-seq 數(shù)據(jù)集中相同標記基因表達的熱圖,經(jīng)過平滑處理以減輕 scRNA-seq 丟失的影響:
knitr::include_graphics("RecoveryOutput/scRNA_CRC_data/Fibroblasts/state_assignment_heatmap.png")
如果應(yīng)用統(tǒng)計顯著性量化方法狮惜,則每個細胞狀態(tài)的結(jié)果 z 分數(shù)將輸出到同一目錄中:
#data = read.delim("RecoveryOutput/scRNA_CRC_data/Epithelial.cells/recovery_z_scores.txt")
#head(data[,c("State", "Z")])
The output for ecotypes includes:
- The abundance (fraction) of each ecotype in each sample:
assign = read.delim("RecoveryOutput/scRNA_CRC_data/Ecotypes/ecotype_abundance.txt")
dim(assign)
## [1] 10 33
head(assign[,1:5])
## SMC01.N SMC01.T SMC02.N SMC02.T SMC03.N
## CE1 0.03013608 0.151493627 0.10825504 0.21601181 0.02930968
## CE2 0.00000000 0.009612861 0.00000000 0.02671883 0.00000000
## CE3 0.05199290 0.077968658 0.21188075 0.07300246 0.00000000
## CE4 0.01693878 0.043794663 0.03124202 0.00000000 0.00000000
## CE5 0.06285928 0.022531963 0.01115162 0.06406829 0.03197420
## CE6 0.16508230 0.059542004 0.24436913 0.01855854 0.43445664
- The assignment of samples to the carcinoma ecotype with the highest abundance. If the cell state fractions from the dominant ecotype are not significantly higher than the other cell state fractions in a given sample, the sample is considered unassigned and filtered out from this table:
discrete_assignments = read.delim("RecoveryOutput/scRNA_CRC_data/Ecotypes/ecotype_abundance.txt")
dim(discrete_assignments)
## [1] 10 33
head(discrete_assignments[,1:5])
## SMC01.N SMC01.T SMC02.N SMC02.T SMC03.N
## CE1 0.03013608 0.151493627 0.10825504 0.21601181 0.02930968
## CE2 0.00000000 0.009612861 0.00000000 0.02671883 0.00000000
## CE3 0.05199290 0.077968658 0.21188075 0.07300246 0.00000000
## CE4 0.01693878 0.043794663 0.03124202 0.00000000 0.00000000
## CE5 0.06285928 0.022531963 0.01115162 0.06406829 0.03197420
## CE6 0.16508230 0.059542004 0.24436913 0.01855854 0.43445664
- A heatmap of cell state abundances across the samples assigned to ecotypes. Rows correspond to the cell states forming ecotypes, while columns correspond to the samples assigned to ecotypes:
knitr::include_graphics("RecoveryOutput/scRNA_CRC_data/Ecotypes/heatmap_assigned_samples_viridis.png")
示例2高诺,Recovery of Lymphoma Cell States and Ecotypes in scRNA-seq Data
Rscript EcoTyper_recovery_scRNA.R -d Lymphoma -m example_data/scRNA_lymphoma_data.txt -a example_data/scRNA_lymphoma_annotation.txt -o RecoveryOutput -c Tissue
- 輸出The assignment of single cells to states:
data = read.delim("RecoveryOutput/scRNA_lymphoma_data/B.cells/state_assignment.txt")
head(data[,c("ID", "State")])
## ID State
## 1 Cell_2 S01
## 2 Cell_3 S01
## 3 Cell_4 S01
## 4 Cell_6 S01
## 5 Cell_8 S01
## 6 Cell_9 S01
熱圖
knitr::include_graphics("RecoveryOutput/scRNA_lymphoma_data/B.cells/state_assignment_heatmap.png")
Recovery of Cell States and Ecotypes in Spatial Transcriptomics data(空間轉(zhuǎn)錄組示例),當然了碾篡,主要是10Xgenomics的空間數(shù)據(jù)
為了使 EcoTyper 在 Visium 數(shù)據(jù)中執(zhí)行細胞狀態(tài)和生態(tài)類型恢復(fù)虱而,需要提供以下資源:
- the filtered feature-barcode matrices
barcodes.tsv.gz
,features.tsv.gz
andmatrix.mtx.gz
, in the format provided by 10x Genomics, and thetissue_positions_list.csv
file produced by the run summary images pipeline, containing the spatial position of barcodes.
- the filtered feature-barcode matrices
如果要分析的系統(tǒng)中預(yù)期的主要細胞群被 EcoTyper 癌論文中分析的細胞群(B 細胞、CD4 T 細胞开泽、CD8 T 細胞牡拇、樹突細胞、內(nèi)皮細胞穆律、上皮細胞惠呼、成纖維細胞、肥大細胞峦耘、 單核細胞/巨噬細胞剔蹋、NK 細胞、漿細胞辅髓、中性粒細胞)或 EcoTyper 淋巴瘤紙(B 細胞泣崩、CD4 T 細胞少梁、CD8 T 細胞、濾泡輔助 T 細胞律想、Tregs猎莲、樹突細胞、內(nèi)皮細胞技即、成纖維細胞著洼、肥大細胞、單核細胞/ 巨噬細胞而叼、NK細胞身笤、漿細胞、中性粒細胞)葵陵,那么需要:
Docker
Docker containers for CIBERSORTx Fractions and CIBERSORTx HiRes modules, both of which can be obtained from the CIBERSORTx website. Please follow the instructions from the website to install them.
A token required for running the docker containers, which can also be obtained from the CIBERSORTx website.
如果要分析的系統(tǒng)中預(yù)期的主要細胞群沒有被 EcoTyper 癌論文中分析的細胞群(B 細胞液荸、CD4 T 細胞、CD8 T 細胞脱篙、樹突細胞娇钱、內(nèi)皮細胞、上皮細胞绊困、成纖維細胞文搂、肥大細胞)概括 、單核細胞/巨噬細胞秤朗、NK 細胞煤蹭、漿細胞、中性粒細胞)或 EcoTyper 淋巴瘤論文(B 細胞取视、CD4 T 細胞硝皂、CD8 T 細胞、濾泡輔助 T 細胞作谭、Tregs稽物、樹突狀細胞、內(nèi)皮細胞折欠、成纖維細胞姨裸、肥大細胞、單核細胞 /巨噬細胞怨酝、NK細胞傀缩、漿細胞、中性粒細胞)农猬,則用戶需要為這些群體提供自己的細胞類型比例估計:
首先看一下主腳本
suppressPackageStartupMessages({
library(config)
library(argparse)
source("pipeline/lib/config.R")
source("pipeline/lib/misc.R")
source("pipeline/lib/multithreading.R")
})
parser <- ArgumentParser(add_help = F)
arguments = parser$add_argument_group('Arguments')
arguments$add_argument("-c", "--config", type = "character", metavar="<PATH>",
help="Path to the config files [required].")
arguments$add_argument("-h", "--help", action='store_true', help="Print help message.")
args <- parser$parse_args()
if(args$h || is.null(args$config))
{
parser$print_help()
quit()
}
config_file = abspath(args$config)
config <- config::get(file = config_file)
discovery = config$Input$"Discovery dataset name"
recovery = config$Input$"Recovery dataset name"
input_path = config$Input$"Input Visium directory"
fractions = config$Input$"Recovery cell type fractions"
coo = config$Input$"Malignant cell of origin"
CSx_username = config$Input$"CIBERSORTx username"
CSx_token = config$Input$"CIBERSORTx token"
n_threads = config$"Pipeline settings"$"Number of threads"
final_output = config$"Output"$"Output folder"
suppressWarnings({
input_path = abspath(input_path)
final_output = abspath(file.path(final_output, recovery))
})
suppressWarnings({
fractions_path = abspath(fractions)
})
if(!discovery %in% c("Carcinoma", "Lymphoma"))
{
discovery_config_file = file.path("EcoTyper", discovery, "config_used.yml")
if(!file.exists(discovery_config_file))
{
stop("Error: Cannot read the configuration file used for the discovery of cell states and ecotypes. It should be in the following path: '", config_file, "'. Please make sure that the '--discovery (-d)' argument provided is correct!")
}
config <- config::get(file = discovery_config_file)
discovery_fractions = config$"Input"$"Cell type fractions"
if(!is.null(discovery_fractions) && discovery_fractions %in% c("Carcinoma_Fractions", "Lymphoma_Fractions"))
{
fractions = discovery_fractions
}else{
if(!is.null(config$"Input"$"Filter non cell type specific genes"))
{
if(config$"Input"$"Filter non cell type specific genes")
{
fractions = "Cell_type_specific_genes"
}else{
fractions = "All_genes"
}
}else{
fractions = "Custom"
}
}
}else{
fractions = paste0(discovery, "_Fractions")
}
#Starting EcoTyper
setwd("pipeline")
start = Sys.time()
cat("\nLoading visium data...\n")
PushToJobQueue(paste("Rscript spatial_load_visium_data.R", recovery, input_path))
RunJobQueue()
if(fractions %in% c("Carcinoma_Fractions", "Lymphoma_Fractions") && !file.exists(fractions_path))
{
cat("\nRunning CIBERSORTxFractions on the visium dataset...\n")
PushToJobQueue(paste("Rscript csx_fractions.R", "visium", recovery, "LM22", "B_mode", CSx_username, CSx_token, TRUE))
RunJobQueue()
PushToJobQueue(paste("Rscript csx_fractions.R", "visium", recovery, "TR4", "B_mode", CSx_username, CSx_token, TRUE))
RunJobQueue()
PushToJobQueue(paste("Rscript csx_fractions_two_tiered.R", "visium", recovery, "TR4", "B_mode", "LM22", "B_mode", fractions))
RunJobQueue()
coo = "Epithelial.cells"
if(fractions %in% "Lymphoma_Fractions")
{
coo = "B.cells"
}
}else{
cat("\nLoading user-provided cell type fractions...\n")
dir.create(file.path("../CIBERSORTx/fractions/visium", recovery, fractions), recursive = T, showWarnings = F)
PushToJobQueue(paste("cp -f", fractions_path, file.path("../CIBERSORTx/fractions/visium", recovery, fractions, "CIBERSORTx_Adjusted.txt")))
RunJobQueue()
}
cat("\nRunning cell state recovery on the visium dataset...\n")
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{
n_states = key[key[,1] == cell_type, 2]
PushToJobQueue((paste("Rscript state_recovery_visium.R", discovery, fractions, cell_type, n_states, recovery, "FALSE")))
}
RunJobQueue()
cat("\nCalculating cell state abundances...\n")
print(paste("Rscript spatial_states.R", discovery, recovery, fractions, coo))
PushToJobQueue((paste("Rscript spatial_states.R", discovery, recovery, fractions, coo)))
RunJobQueue()
cat("\nCalculating ecotype abundances...\n")
PushToJobQueue((paste("Rscript spatial_ecotypes.R", discovery, recovery, fractions, coo)))
RunJobQueue()
cat("\nPlotting cell state heatmaps...\n")
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{
n_states = key[key[,1] == cell_type, 2]
PushToJobQueue((paste("Rscript spatial_plot_states.R", discovery, recovery, fractions, coo, cell_type)))
}
RunJobQueue()
PushToJobQueue((paste("Rscript spatial_plot_ecotypes.R", discovery, recovery, fractions, coo)))
RunJobQueue()
cat("\nCopying EcoTyper results to the output folder!\n")
if(file.exists(final_output) && length(list.files(final_output)) > 0)
{
old_results_folder = paste0(final_output, format(Sys.time(), " %a %b %d %X %Y"))
dir.create(old_results_folder, recursive = T, showWarnings = F)
warning(paste0("The output folder contains files from a previous run. Moving those files to: '", old_results_folder, "'"))
system(paste0("mv -f ", final_output, "/* '", old_results_folder, "'"))
}
dir.create(final_output, recursive = T, showWarnings = F)
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", recovery, paste0(cell_type, "_spatial_heatmaps.pdf")), final_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", recovery, paste0(cell_type, "_spatial_heatmaps.png")), final_output))
}
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", recovery, "state_abundances.txt"), final_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", recovery, "ecotype_abundances.txt"), final_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", recovery, "Ecotype_spatial_heatmaps.pdf"), final_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "recovery", recovery, "Ecotype_spatial_heatmaps.png"), final_output))
end = Sys.time()
cat(paste0("\nEcoTyper finished succesfully! Please find the results in: '", final_output, "'.\nRun time: ", format(end - start, digits = 0), "\n"))
運行一下
Rscript EcoTyper_recovery_visium.R -h
## usage: EcoTyper_recovery_visium.R [-c <PATH>] [-h]
##
## Arguments:
## -c <PATH>, --config <PATH>
## Path to the config files [required].
## -h, --help Print help message.
The configuration file
This script takes as input file a configuration file in YAML format. The configuration file for this tutorial is available in config_recovery_visium.yml
:
default :
Input :
Discovery dataset name : "Carcinoma"
Recovery dataset name : "VisiumBreast"
Input Visium directory : "example_data/VisiumBreast"
#Path to a file containing the precomputed cell fractions for the visium array
Recovery cell type fractions : "NULL"
Malignant cell of origin : "Epithelial.cells"
CIBERSORTx username : "<Please use your username from the CIBERSORTx website>"
CIBERSORTx token : "<Please obtain a token from the CIBERSORTx website>"
Output :
Output folder : "VisiumOutput"
Pipeline settings :
Number of threads : 10
The configuration file has three sections, Input, Pipeline settings, and Output. We next will describe the expected content in each of these sections, and instruct the user how to set the appropriate settings in their applications.
Input section
The Input section contains settings regarding the input data.
Discovery dataset name
Discovery dataset name : "Carcinoma"
Discovery dataset name should contain the name of the discovery dataset used for defining cell states. By default, the only accepted values are Carcinoma and Lymphoma (case sensitive), which will recover the cell states that we defined across carcinomas and in lymphoma, respectively. If the user defined cell states in their own data (Tutorials 4-6), the name of the discovery dataset is the value provided in the Discovery dataset name field of the configuration file used for running discovery. For this tutorial, we set the name of the discovery dataset to Carcinoma.
Recovery dataset name
Recovery dataset name : "VisiumBreast"
Recovery dataset name is the identifier used by EcoTyper to internally save and retrieve the information about the cell states/ecotypes abundances. Any value that contains alphanumeric characters and ’_’ is accepted for this field.
Input Visium directory
Input Visium directory : "example_data/VisiumBreast"
There are 4 input files needed for recovery on the visium data:
list.files("example_data/VisiumBreast")
## [1] "barcodes.tsv.gz" "features.tsv.gz"
## [3] "matrix.mtx.gz" "tissue_positions_list.csv"
The filtered feature-barcode matrices barcodes.tsv.gz
, features.tsv.gz
and matrix.mtx.gz
, in the format provided by 10x Genomics, and the tissue_positions_list.csv
file produced by the run summary images pipeline, containing the spatial position of barcodes.
Recovery cell type fractions
Recovery cell type fractions : "NULL"
Recovery cell type fractions should contain the path to a file containing the cell type fraction estimations for each spot on the visium array. This field is ignored when the discovery dataset is Carcinoma or Lymphoma or when the discovery has been performed as described in Tutorial 4, using Carcinoma_Fractions or Lymphoma_Fractions. It is only used when users provided their own cell type fractions for deriving cell states and ecotypes in Tutorial 4. In this case, the user needs to provide a path to a tab-delimited file for this field. The file should contain in the first column the same sample names used as column names in the input expression matrix, and in the next columns, the cell type fractions for the same cell populations used for discovering cell states and ecotypes. These fractions should sum up to 1 for each row. An example of such a file is provided in:
data = read.delim("example_data/visium_fractions_example.txt", nrow = 5)
dim(data)
## [1] 5 13
data
## Mixture Fibroblasts Endothelial.cells Epithelial.cells B.cells
## 1 AAACAAGTATCTCCCA.1 0.3747796 0.016948078 0.2164860 0.04116797
## 2 AAACACCAATAACTGC.1 0.1231510 0.028426736 0.6737582 0.02209104
## 3 AAACAGAGCGACTCCT.1 0.2383718 0.085296697 0.3124031 0.03159104
## 4 AAACAGGGTCTATATT.1 0.1178922 0.053757339 0.1128586 0.11847960
## 5 AAACAGTGTTCCTGGG.1 0.3699561 0.005238928 0.5008316 0.01311040
## CD4.T.cells CD8.T.cells Dendritic.cells Mast.cells Monocytes.and.Macrophages
## 1 0.10447645 0.033647446 0.016196773 0.021842932 0.06183330
## 2 0.04376232 0.025219723 0.006647209 0.008375436 0.03087553
## 3 0.04581258 0.028235504 0.025698640 0.020992271 0.04386487
## 4 0.11018235 0.154411312 0.004780762 0.013200087 0.09278191
## 5 0.02826441 0.007037966 0.005238555 0.006979820 0.02851720
## NK.cells PCs PMNs
## 1 0.030228865 0.06911276 0.013279833
## 2 0.006960189 0.02580987 0.004922716
## 3 0.020617429 0.13918007 0.007936014
## 4 0.000000000 0.21863726 0.003018626
## 5 0.007854087 0.02484208 0.002128834
Since in this tutorial we use the Carcinoma dataset as the discovery dataset, this field is not required. However, if it needs to be provided, it can be set as follows:
Recovery cell type fractions : "example_data/visium_fractions_example.txt"
Malignant cell of origin
Malignant cell of origin : "Epithelial.cells"
The cell of origin population for the cancer type being analyzed, amongst the cell types used for discovery. This field is used for plotting a gray background in the resulting output plot, with the intensity of gray depicting the abundance of the cell of origin population in each spot. It is not used when the discovery dataset is Carcinoma or Lymphoma or when the discovery has been performed as described in Tutorials 4-6, using Carcinoma_Fractions or Lymphoma_Fractions. In these cases, the malignant cells are automatically considered to be originating from Epithelial.cells or B.cells, respectively. Otherwise, this field needs to contain a column name in the file provided in Recovery cell type fractions field, corresponding to the appropriate cell type of origin.
CIBERSORTx username and token
CIBERSORTx username : "<Please use your username from the CIBERSORTx website>"
CIBERSORTx token : "<Please obtain a token from the CIBERSORTx website>"
The fields CIBERSORTx username and CIBERSORTx token should contain the username on the CIBERSORTx website and the token necessary to run the CIBERSORTx source code. The token can be obtained from the CIBERSORTx website.
The output section
The Output section contains a single field, Output folder, which specifies the path where the final output will be saved. This folder will be created if it does not exist.
Output folder : "VisiumOutput"
Number of threads
The last section, Pipeline settings, contains only one argument, the number of threads used for performing recovery:
Number of threads : 10
The command line
After editing the configuration file (config_recovery_visium.yml
), the command line for recovering the cell states and ecotypes in Visium Spatial Gene Expression data looks as illustrated below. Please note that this script might take up to two hours to run on 10 threads. Also, since CIBERSORTx is run on each spot, the memory requirements might exceed the memory available on a typical laptop. We recommend that this tutorial is run on a server with >32GB of RAM.
Rscript EcoTyper_recovery_visium.R -c config_recovery_visium.yml
The output format
EcoTyper generates for each cell type the following outputs:
- Cell state abundances:
data = read.delim("VisiumOutput/VisiumBreast/state_abundances.txt")
dim(data)
## [1] 3813 76
head(data[,1:10])
## ID X Y Sample Malignant B.cells_S01 B.cells_S02
## 1 AAACAAGTATCTCCCA.1 50 102 VisiumBreast 0.2164860 0 0
## 2 AAACACCAATAACTGC.1 59 19 VisiumBreast 0.6737582 0 0
## 3 AAACAGAGCGACTCCT.1 14 94 VisiumBreast 0.3124031 0 0
## 4 AAACAGGGTCTATATT.1 47 13 VisiumBreast 0.1128586 1 0
## 5 AAACAGTGTTCCTGGG.1 73 43 VisiumBreast 0.5008316 0 0
## 6 AAACATTTCCCGGATT.1 61 97 VisiumBreast 0.7553180 0 0
## B.cells_S03 B.cells_S04 B.cells_S05
## 1 0 0.0000000 0.8690817
## 2 0 0.0000000 0.0000000
## 3 0 0.0000000 0.0000000
## 4 0 0.0000000 0.0000000
## 5 0 0.2767688 0.0000000
## 6 0 0.3345834 0.0000000
- Plots illustrating the cell state abundance across state from each cell type. The intensity of charcoal represents the cell state abundance. The intensity of gray represents the fraction of the cancer cell of origin population:
knitr::include_graphics("VisiumOutput/VisiumBreast/Fibroblasts_spatial_heatmaps.png")
- Ecotype abundances:
data = read.delim("VisiumOutput/VisiumBreast/ecotype_abundances.txt")
dim(data)
## [1] 3813 15
head(data[,1:10])
## ID X Y Sample Malignant CE1
## VisiumBreast.1 AAACAAGTATCTCCCA.1 50 102 VisiumBreast 0.1142685 0.93179887
## VisiumBreast.2 AAACACCAATAACTGC.1 59 19 VisiumBreast 0.7334114 0.23355315
## VisiumBreast.3 AAACAGAGCGACTCCT.1 14 94 VisiumBreast 0.2441395 0.00000000
## VisiumBreast.4 AAACAGGGTCTATATT.1 47 13 VisiumBreast 0.0000000 0.09941146
## VisiumBreast.5 AAACAGTGTTCCTGGG.1 73 43 VisiumBreast 0.4992702 0.64507665
## VisiumBreast.6 AAACATTTCCCGGATT.1 61 97 VisiumBreast 0.8438427 0.02056875
## CE2 CE3 CE4 CE5
## VisiumBreast.1 0.0000000 0.0000000 0.7497744 0
## VisiumBreast.2 0.2357562 0.0000000 0.0000000 0
## VisiumBreast.3 0.0000000 0.0000000 0.0000000 0
## VisiumBreast.4 0.0000000 0.0000000 0.0000000 0
## VisiumBreast.5 0.0000000 0.1836142 0.0000000 0
## VisiumBreast.6 0.0000000 0.0000000 0.0000000 0
Plots illustrating the ecotype abundances. The intensity of charcoal represents the cell state abundance. The intensity of gray represents the fraction of the cancer cell of origin population:
knitr::include_graphics("VisiumOutput/VisiumBreast/Ecotype_spatial_heatmaps.png")
示例3赡艰,De novo Discovery of Cell States and Ecotypes in scRNA-seq Data
在教程中,將說明如何從 scRNA-seq 表達矩陣開始對細胞狀態(tài)和生態(tài)型進行從頭識別斤葱。 出于說明目的慷垮,我們使用來自結(jié)腸直腸癌的 scRNA-seq 的下采樣版本作為發(fā)現(xiàn)數(shù)據(jù)集揖闸,可在 example_data/scRNA_CRC_data.txt 以及示例注釋文件 example_data/scRNA_CRC_annotation.txt 中獲得。
Overview of the EcoTyper workflow for discovering cell states in scRNA-seq data
EcoTyper 通過一系列步驟從 scRNA-seq 數(shù)據(jù)中獲取細胞狀態(tài)和生態(tài)型:
1料身、提取細胞類型特異性基因:去除在給定細胞類型中未特異性表達的基因汤纸,是降低識別虛假細胞狀態(tài)可能性的重要考慮因素。在 scRNA-seq 數(shù)據(jù)中執(zhí)行細胞狀態(tài)發(fā)現(xiàn)之前芹血,Ecotyper 默認應(yīng)用非細胞類型特定基因的過濾器贮泞。具體來說,它在來自給定細胞類型的細胞和所有其他細胞類型組合的細胞之間執(zhí)行差異表達幔烛。對于細胞類型的計算效率和平衡表示啃擦,此步驟僅使用每種細胞類型最多 500 個隨機選擇的細胞。從每種細胞類型中過濾掉 Q 值 > 0.05(雙邊 Wilcox 檢驗饿悬,使用 Benjamini-Hochberg 校正進行多假設(shè)校正)的基因令蛉。
2、相關(guān)矩陣上的細胞狀態(tài)發(fā)現(xiàn):EcoTyper 利用非負矩陣分解 (NMF) 從單細胞表達譜中識別轉(zhuǎn)錄定義的細胞狀態(tài)狡恬。然而珠叔,直接應(yīng)用于 scRNA-seq 表達矩陣的 NMF 可能表現(xiàn)不佳,因為 scRNA-seq 數(shù)據(jù)通常是稀疏的弟劲。因此祷安,EcoTyper 首先將 NMF 應(yīng)用于來自給定細胞類型的每對細胞之間的相關(guān)矩陣。為了提高計算效率函卒,EcoTyper 在此步驟中最多僅使用 2,500 個隨機選擇的細胞辆憔。為了滿足 NMF 的非負性要求撇眯,相關(guān)矩陣使用 posneg 變換單獨處理报嵌。此函數(shù)將相關(guān)矩陣 Vi 轉(zhuǎn)換為兩個矩陣,一個僅包含正值熊榛,另一個僅包含符號反轉(zhuǎn)的負值锚国。這兩個矩陣隨后被連接以產(chǎn)生 Vi*。
對于每種細胞類型玄坦,EcoTyper 將 NMF 應(yīng)用于一系列等級(細胞狀態(tài)數(shù))血筑,默認為 2-20 個狀態(tài)。對于每個等級煎楣,NMF 算法使用不同的起始種子多次應(yīng)用(我們建議至少 50 次)豺总,以提高魯棒性。
3择懂、選擇細胞狀態(tài)數(shù):Cluster(狀態(tài))數(shù)選擇是 NMF 應(yīng)用中的一個重要考慮因素喻喳。我們發(fā)現(xiàn),以前依賴最小化誤差度量(例如困曙,RMSE表伦、KL 散度)或優(yōu)化信息論度量的方法要么無法收斂谦去,要么依賴于估算的基因數(shù)量。相比之下蹦哼,cophenetic 系數(shù)量化了給定等級(即Cluster的數(shù)量)的分類穩(wěn)定性鳄哭,范圍從 0 到 1,其中 1 是最大穩(wěn)定的纲熏。雖然通常選擇共同系數(shù)開始下降的等級妆丘,但這種方法很難應(yīng)用于共同系數(shù)在等級間呈現(xiàn)多模態(tài)形狀的情況,正如我們在某些細胞類型中發(fā)現(xiàn)的那樣赤套。因此飘痛,我們開發(fā)了一種更適合此類設(shè)置的啟發(fā)式方法。在每種情況下容握,排名是根據(jù)在 2-20 個Clusters范圍內(nèi)評估的共生系數(shù)自動選擇的(默認情況下)宣脉。具體來說,我們確定了在 2-20 區(qū)間內(nèi)的第一次出現(xiàn)剔氏,其共相系數(shù)降至 0.95 以下(默認情況下)塑猖,至少連續(xù)兩個等級高于該水平。然后谈跛,我們選擇了緊鄰該交叉點的等級羊苟,該等級最接近 0.95(默認情況下)。
4感憾、提取細胞狀態(tài)信息:解析第 2 步產(chǎn)生的 NMF 輸出蜡励,提取細胞狀態(tài)信息用于下游分析。
5阻桅、在表達矩陣上重新發(fā)現(xiàn)細胞狀態(tài):在識別相關(guān)矩陣上的細胞狀態(tài)之后凉倚,EcoTyper 執(zhí)行差異表達以識別與每個細胞狀態(tài)最高度相關(guān)的基因。生成的標記按每個狀態(tài)的倍數(shù)變化進行排序嫂沉,并選擇跨細胞狀態(tài)排名最高的前 1000 個基因用于新一輪的 NMF稽寒。如果可用的基因少于 1000 個,則選擇所有基因趟章。在 NMF 之前杏糙,每個基因都被縮放為均值 0 和單位方差。為了滿足 NMF 的非負性要求蚓土,細胞類型特異性表達矩陣使用 posneg 轉(zhuǎn)換單獨處理宏侍。此函數(shù)將輸入表達式矩陣 Vi 轉(zhuǎn)換為兩個矩陣,一個僅包含正值蜀漆,另一個僅包含符號反轉(zhuǎn)的負值谅河。這兩個矩陣隨后被連接以產(chǎn)生 Vi*。
對于每種細胞類型,EcoTyper 僅將 NMF 應(yīng)用于步驟 3 中選擇的等級旧蛾。和以前一樣莽龟,NMF 算法應(yīng)用多次(我們建議至少 50 次)具有不同的起始種子,以實現(xiàn)穩(wěn)健性锨天。6毯盈、提取細胞狀態(tài)信息:解析第 5 步產(chǎn)生的 NMF 輸出,提取細胞狀態(tài)信息用于下游分析病袄。
7搂赋、細胞狀態(tài) QC 過濾器:雖然 posneg 變換需要滿足標準化后 NMF 的非負性約束,但它會導(dǎo)致識別由負值多于正值的特征驅(qū)動的虛假細胞狀態(tài)益缠。 為了解決這個問題脑奠,我們設(shè)計了一個自適應(yīng)誤報指數(shù)(AFI),這是一個新的指數(shù)幅慌,定義為 W 矩陣中對應(yīng)于負特征和正特征的權(quán)重之和之間的比率宋欺。 EcoTyper 自動過濾具有 AFI > = 1 的狀態(tài)。
8胰伍、生態(tài)型(細胞群落)發(fā)現(xiàn):生態(tài)型或細胞群落是通過識別樣本中細胞狀態(tài)的共現(xiàn)模式得出的齿诞。首先,EcoTyper 利用 Jaccard 指數(shù)來量化發(fā)現(xiàn)隊列中樣本中每對細胞狀態(tài)之間的重疊程度骂租。為此祷杈,它將每個細胞狀態(tài) q 離散化為長度為 l 的二進制向量 a,其中 l = 發(fā)現(xiàn)隊列中的樣本數(shù)渗饮〉總的來說,這些向量包括二進制矩陣 A互站,其行數(shù)與跨細胞類型和 l 列(樣本)的細胞狀態(tài)相同私蕾。給定樣本 s,如果狀態(tài) q 是細胞類型 i 中所有狀態(tài)中最豐富的狀態(tài)云茸,則 EcoTyper 將 A(q,s) 設(shè)置為 1是目;否則 A(q,s) ← 0谤饭。然后它計算矩陣 A 中行(狀態(tài))上的所有成對 Jaccard 索引标捺,產(chǎn)生矩陣 J。使用超幾何檢驗揉抵,它評估任何給定的細胞狀態(tài)對 q 和k 沒有重疊亡容。在超幾何 p 值 > 0.01 的情況下,J(q,k) 的 Jaccard 指數(shù)設(shè)置為 0(即沒有重疊)冤今。為了在適應(yīng)異常值的同時識別社區(qū)闺兢,更新的 Jaccard 矩陣 J' 使用具有歐幾里得距離的平均鏈接進行層次聚類(R stats 包中的 hclust)。然后通過輪廓寬度最大化來確定最佳聚類數(shù)。從進一步分析中排除具有少于 3 個細胞狀態(tài)的Cluster屋谭。
Checklist before performing cell states and ecotypes discovery in scRNA-seq data
- a user-provided scRNA-seq expression matrix, on which the discovery will be performed (a discovery cohort). For this tutorial, we will use the example data in example_data/scRNA_CRC_data.txt.
- a sample annotation file, such as the one provided in example_data/scRNA_CRC_annotation.txt, with at least three columns: ID, CellType and Sample.
Cell states and ecotypes discovery in scRNA-seq data脚囊,先看一下主腳本EcoTyper_discovery_scRNA.R
suppressPackageStartupMessages({
library(config)
library(argparse)
source("pipeline/lib/config.R")
source("pipeline/lib/misc.R")
source("pipeline/lib/multithreading.R")
})
parser <- ArgumentParser(add_help = F)
arguments = parser$add_argument_group('Arguments')
arguments$add_argument("-c", "--config", type = "character", metavar="<PATH>",
help="Path to the config files [required].")
arguments$add_argument("-h", "--help", action='store_true', help="Print help message.")
args <- parser$parse_args()
#print(args)
if(args$h || is.null(args$config))
{
parser$print_help()
quit()
}
config_file = abspath(args$config)
config <- config::get(file = config_file)
check_discovery_configuration_scRNA(config)
discovery = config$Input$"Discovery dataset name"
discovery_type = config$Input$"Expression type"
scale_column = config$Input$"Annotation file column to scale by"
additional_columns = config$Input$"Annotation file column(s) to plot"
final_output = config$"Output"$"Output folder"
n_threads = config$"Pipeline settings"$"Number of threads"
nmf_restarts = config$"Pipeline settings"$"Number of NMF restarts"
max_clusters = config$"Pipeline settings"$"Maximum number of states per cell type"
cohpenetic_cutoff = config$"Pipeline settings"$"Cophenetic coefficient cutoff"
skip_steps = config$"Pipeline settings"$"Pipeline steps to skip"
p_value_cutoff = config$"Pipeline settings"$"Jaccard matrix p-value cutoff"
suppressWarnings({
final_output = abspath(final_output)
})
#Starting EcoTyper
setwd("pipeline")
start = Sys.time()
if(config$"Pipeline settings"$"Filter non cell type specific genes")
{
fractions = "Cell_type_specific_genes"
}else{
fractions = "All_genes"
}
if(!1 %in% skip_steps & config$"Pipeline settings"$"Filter non cell type specific genes")
{
cat("\nStep 1 (extract cell type specific genes)...\n")
annotation = read.delim(file.path(file.path("../datasets/discovery", discovery, "annotation.txt")))
cell_types = unlist(levels(as.factor(as.character(annotation$CellType))))
for(cell_type in cell_types)
{
print(cell_type)
PushToJobQueue(paste("Rscript state_discovery_scRNA_filter_genes.R", discovery, fractions, cell_type, scale_column))
}
RunJobQueue()
cat("Step 1 (extract cell type specific genes) finished successfully!\n")
}else{
cat("Skipping step 1 (extract cell type specific genes)...\n")
}
if(!2 %in% skip_steps)
{
cat("\nStep 2 (cell state discovery on correrlation matrices): Calculating correlation matrices...\n")
annotation = read.delim(file.path(file.path("../datasets/discovery", discovery, "annotation.txt")))
cell_types = unlist(levels(as.factor(as.character(annotation$CellType))))
for(cell_type in cell_types)
{
filter_genes = (fractions == "Cell_type_specific_genes")
PushToJobQueue(paste("Rscript state_discovery_scRNA_distances.R", discovery, fractions, cell_type, filter_genes, scale_column))
}
RunJobQueue()
cat("Step 2 (cell state discovery on correrlation matrices): Running NMF (Warning: This step might take a long time!)...\n")
for(cell_type in cell_types)
{
if(!file.exists(file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery_cross_cor", cell_type, "expression_top_genes_scaled.txt")))
{
next
}
for(n_clusters in 2:max_clusters)
{
for(restart in 1:nmf_restarts)
{
if(!file.exists(file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery_cross_cor", cell_type, n_clusters, "restarts", restart, "estim.RData")))
{
PushToJobQueue(paste("Rscript state_discovery_NMF.R", "discovery_cross_cor", discovery, fractions, cell_type, n_clusters, restart))
}else{
cat(paste0("Warning: Skipping NMF on '", cell_type, "' (number of states = ", n_clusters, ", restart ", restart, "), as the output file '", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "restarts", restart, "estim.RData"), "' already exists!\n"))
}
}
}
}
RunJobQueue()
cat("Step 2 (cell state discovery on correrlation matrices): Aggregating NMF results...\n")
for(cell_type in cell_types)
{
if(!file.exists(file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery_cross_cor", cell_type, "expression_top_genes_scaled.txt")))
{
next
}
PushToJobQueue(paste("Rscript state_discovery_combine_NMF_restarts.R", "discovery_cross_cor", discovery, fractions, cell_type, max_clusters, nmf_restarts))
}
RunJobQueue()
cat("Step 2 (cell state discovery on correrlation matrices) finished successfully!\n")
}else{
cat("Skipping step 2 (cell state discovery on correrlation matrices)...\n")
}
if(!3 %in% skip_steps)
{
cat("\nStep 3 (choosing the number of cell states)...\n")
PushToJobQueue(paste("Rscript state_discovery_rank_selection.R", "discovery_cross_cor", discovery, fractions, max_clusters, cohpenetic_cutoff))
RunJobQueue()
cat("Step 3 (choosing the number of cell states) finished successfully!\n")
}else{
cat("Skipping step 3 (choosing the number of cell states)...\n")
}
if(!4 %in% skip_steps)
{
cat("\nStep 4 (extracting cell state information)...\n")
system(paste("cp -f ", config_file, file.path("../EcoTyper", discovery, "config_used.yml")))
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{
cat(paste("Extracting cell states information for:", cell_type, "\n"))
n_clusters = key[key[,1] == cell_type, 2]
PushToJobQueue(paste("Rscript state_discovery_initial_plots_scRNA.R", "discovery_cross_cor", discovery, fractions, cell_type, n_clusters, "State", paste(additional_columns, collapse = " ")))
}
RunJobQueue()
cat("Step 4 (extracting cell state information) finished successfully!\n")
}else{
cat("\nSkipping step 4 (extracting cell state information)...\n")
}
if(!5 %in% skip_steps)
{
cat("\nStep 5 (cell state re-discovery in expression matrices)...\n")
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{
cat(paste("Extracting marker genes for cell states defined in:", cell_type, "\n"))
n_clusters = key[key[,1] == cell_type, 2]
PushToJobQueue(paste("Rscript state_discovery_extract_features_scRNA.R", discovery, fractions, cell_type, n_clusters))
}
RunJobQueue()
cat("\nStep 5 (cell state re-discovery in expression matrices): Running NMF on expression matrix...\n")
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{
if(!file.exists(file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, "expression_top_genes_scaled.txt")))
{
next
}
n_clusters = key[key[,1] == cell_type, 2]
for(restart in 1:nmf_restarts)
{
if(!file.exists(file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "restarts", restart, "estim.RData")))
{
PushToJobQueue(paste("Rscript state_discovery_NMF.R", "discovery", discovery, fractions, cell_type, n_clusters, restart))
}else{
cat(paste0("Warning: Skipping NMF on '", cell_type, "' (number of states = ", n_clusters, ", restart ", restart, "), as the output file '", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "restarts", restart, "estim.RData"), "' already exists!\n"))
}
}
}
RunJobQueue()
cat("Step 5 (cell state re-discovery in expression matrices): Aggregating NMF results...\n")
for(cell_type in key[,1])
{
if(!file.exists(file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, "expression_top_genes_scaled.txt")))
{
next
}
PushToJobQueue(paste("Rscript state_discovery_combine_NMF_restarts.R", "discovery", discovery, fractions, cell_type, max_clusters, nmf_restarts))
}
RunJobQueue()
cat("Step 5 (cell state re-discovery in expression matrices) finished successfully!\n")
}else{
cat("Skipping step 5 (cell state re-discovery in expression matrices)...\n")
}
if(!6 %in% skip_steps)
{
cat("\nStep 6 (extracting information for re-discovered cell states)...\n")
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{
cat(paste("Extracting cell states information for:", cell_type, "\n"))
n_clusters = key[key[,1] == cell_type, 2]
PushToJobQueue(paste("Rscript state_discovery_initial_plots.R", "discovery", discovery, fractions, cell_type, n_clusters, "State", paste(additional_columns, collapse = " ")))
}
RunJobQueue()
cat("Step 6 (extracting information for re-discovered cell states) finished successfully!\n")
}else{
cat("\nSkipping step 6 (extracting information for re-discovered cell states)...\n")
}
if(!7 %in% skip_steps)
{
cat("\nStep 7 (cell state QC filter)...\n")
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{
cat(paste("Filtering low-quality cell states for:", cell_type, "\n"))
n_clusters = key[key[,1] == cell_type, 2]
PushToJobQueue(paste("Rscript state_discovery_first_filter_scRNA.R", discovery, fractions, cell_type, n_clusters, "State", paste(additional_columns, collapse = " ")))
}
RunJobQueue()
cat("Step 7 (cell state QC filter) finished successfully!\n")
}else{
cat("\nSkipping step 7 (cell state QC filter)...\n")
}
if(!8 %in% skip_steps)
{
cat("\nStep 8 (ecotype discovery)...\n")
PushToJobQueue(paste("Rscript ecotypes_scRNA.R", discovery, fractions, p_value_cutoff))
RunJobQueue()
PushToJobQueue(paste("Rscript ecotypes_assign_samples_scRNA.R", discovery, fractions, "State",paste(additional_columns, collapse = " ")))
cat("Step 8 (ecotype discovery) finished successfully!\n")
RunJobQueue()
}else{
cat("Skipping step 8 (ecotype discovery)...\n")
}
cat("\nCopying EcoTyper results to the output folder!\n")
if(file.exists(final_output) && length(list.files(final_output)) > 0)
{
old_results_folder = paste0(final_output, format(Sys.time(), " %a %b %d %X %Y"))
dir.create(old_results_folder, recursive = T, showWarnings = F)
warning(paste0("The output folder contains files from a previous run. Moving those files to: '", old_results_folder, "'"))
system(paste0("mv -f ", final_output, "/* '", old_results_folder, "'"))
}
dir.create(final_output, recursive = T, showWarnings = F)
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"), final_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_plot.pdf"), final_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_plot.png"), final_output))
key = read.delim(file.path("../EcoTyper", discovery, fractions, "Analysis", "rank_selection", "rank_data.txt"))
for(cell_type in key[,1])
{
n_clusters = key[key[,1] == cell_type, 2]
ct_output = file.path(final_output, cell_type)
dir.create(ct_output, recursive = T, showWarnings = F)
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "gene_info.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "state_abundances.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "state_assignment.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "state_assignment_heatmap.pdf"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "state_assignment_heatmap.png"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "heatmap_data.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Cell_States", "discovery", cell_type, n_clusters, "heatmap_top_ann.txt"), ct_output))
}
ct_output = file.path(final_output, "Ecotypes")
dir.create(ct_output, recursive = T, showWarnings = F)
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "ecotypes.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "ecotype_assignment.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "ecotype_abundance.txt"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "heatmap_assigned_samples_viridis.pdf"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "heatmap_assigned_samples_viridis.png"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "jaccard_matrix.pdf"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "jaccard_matrix.png"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "nclusters_jaccard.png"), ct_output))
system(paste("cp -f", file.path("../EcoTyper", discovery, fractions, "Ecotypes", "discovery", "nclusters_jaccard.pdf"), ct_output))
end = Sys.time()
cat(paste0("\nEcoTyper finished succesfully! Please find the results in: '", final_output, "'.\nRun time: ", format(end - start, digits = 0), "\n"))
Rscript EcoTyper_discovery_scRNA.R -h
## usage: EcoTyper_discovery_scRNA.R [-c <PATH>] [-h]
##
## Arguments:
## -c <PATH>, --config <PATH>
## Path to the config files [required].
## -h, --help Print help message.
該腳本將 YAML 格式的配置文件作為輸入文件。 本教程的配置文件位于 config_discovery_scRNA.yml 中:
default :
Input :
Discovery dataset name : "discovery_scRNA_CRC"
Expression matrix : "example_data/scRNA_CRC_data.txt"
Annotation file : "example_data/scRNA_CRC_annotation.txt"
Annotation file column to scale by : NULL
Annotation file column(s) to plot : []
Output :
Output folder : "DiscoveryOutput_scRNA"
Pipeline settings :
#Pipeline steps:
# step 1 (extract cell type specific genes)
# step 2 (cell state discovery on correrlation matrices)
# step 3 (choosing the number of cell states)
# step 4 (extracting cell state information)
# step 5 (cell state re-discovery in expression matrices)
# step 6 (extracting information for re-discovered cell states)
# step 7 (cell state QC filter)
# step 8 (ecotype discovery)
Pipeline steps to skip : []
Filter non cell type specific genes : True
Number of threads : 10
Number of NMF restarts : 5
Maximum number of states per cell type : 20
Cophenetic coefficient cutoff : 0.95
#The p-value cutoff used for filtering non-significant overlaps in the jaccard matrix used for discovering ecotypes in step 8. Default: 1 (no filtering).
Jaccard matrix p-value cutoff : 1
配置文件包含三個部分桐磁,輸入悔耘、輸出和管道設(shè)置。 接下來我擂,我們將分別描述這三個部分的預(yù)期內(nèi)容衬以,并指導(dǎo)用戶如何在其應(yīng)用程序中設(shè)置適當?shù)脑O(shè)置。
Input section
The Input section contains settings regarding the input data.
Discovery dataset name
Discovery dataset name is the identifier used by EcoTyper to internally save and retrieve the information about the cell states/ecotypes defined on this discovery dataset. It is also the name to be provided to the -d/–discovery argument of scripts EcoTyper_recovery_scRNA.R
and EcoTyper_recovery_bulk.R
, when performing cell state/ecotypes recovery. Any value that contains alphanumeric characters and ’_’ is accepted for this field.
Discovery dataset name : "discovery_scRNA_CRC"
Expression matrix
Expression matrix : "example_data/scRNA_CRC_data.txt"
Expression matrix field should contain the path to a tab-delimited file containing the expression data, with genes as rows and cells as columns. The expression matrix should be in the TPM, CPM or other suitable normalized space. It should have gene symbols on the first column and gene counts for each cell on the next columns. Column (cells) names should be unique. Also, we recommend that the column names do not contain special characters that are modified by the R function make.names, e.g. having digits at the beginning of the name or containing characters such as space, tab or -:
The expected format for the expression matrix is:
data = read.delim("example_data/scRNA_CRC_data.txt", nrow = 5)
dim(data)
## [1] 5 13781
head(data[,1:5])
## Gene SMC01.T_AAAGATGCATGGATGG SMC01.T_AAAGTAGCAAGGACAC
## 1 A1BG 0 0
## 2 A1CF 0 0
## 3 A2M 0 0
## 4 A2ML1 0 0
## 5 A3GALT2 0 0
## SMC01.T_AAATGCCAGGATCGCA SMC01.T_AACTCTTCACAACGCC
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
Annotation file
Annotation file : "example_data/scRNA_CRC_annotation.txt"
A path to an annotation file should be provided in the field Annotation file. This file should contain a column called ID with the same names (e.g. cell barcodes) as the columns of the expression matrix, a column called CellType indicating cell type for each cell, and a column called Sample indicating the sample identifier for each cell. The latter is used for ecotype discovery. This file can contain any number of additional columns. The additional columns can be used for defining sample batches (see Section Annotation file column to scale by below) and for plotting color bars in the heatmaps output (see Section Annotation file column(s) to plot below). For the current example, the annotation file has the following format:
annotation = read.delim("example_data/scRNA_CRC_annotation.txt", nrow = 5)
dim(annotation)
## [1] 5 9
head(annotation)
## Index Patient Class Sample Cell_type Cell_subtype
## 1 SMC01-T_AAAGATGCATGGATGG SMC01 Tumor SMC01-T Epithelial cells CMS2
## 2 SMC01-T_AAAGTAGCAAGGACAC SMC01 Tumor SMC01-T Epithelial cells CMS2
## 3 SMC01-T_AAATGCCAGGATCGCA SMC01 Tumor SMC01-T Epithelial cells CMS2
## 4 SMC01-T_AACTCTTCACAACGCC SMC01 Tumor SMC01-T Epithelial cells CMS2
## 5 SMC01-T_AACTTTCGTTCGGGCT SMC01 Tumor SMC01-T Epithelial cells CMS2
## CellType ID Tissue
## 1 Epithelial.cells SMC01.T_AAAGATGCATGGATGG Tumor
## 2 Epithelial.cells SMC01.T_AAAGTAGCAAGGACAC Tumor
## 3 Epithelial.cells SMC01.T_AAATGCCAGGATCGCA Tumor
## 4 Epithelial.cells SMC01.T_AACTCTTCACAACGCC Tumor
## 5 Epithelial.cells SMC01.T_AACTTTCGTTCGGGCT Tumor
Annotation file column to scale by
Annotation file column to scale by : NULL
In order to discover pan-carcinoma cell states and ecotypes in the EcoType carcinoma paper, we standardize genes to mean 0 and unit variance within each tumor type (histology). Field Annotation file column to scale by allows users to specify a column name in the annotation file, by which the cells will be grouped when performing standardization. However, this is an analytical choice, depending on the purpose of the analysis. If the users are interested in defining cell states and ecotypes regardless of tumor type-specificity, this argument can be set to NULL. In this case, the standardization will be applied across all samples in the discovery cohort. The same will happen if the annotation file is not provided.
In the current example, this field is not used and therefore set to NULL.
Annotation file column(s) to plot
Annotation file column(s) to plot : ["Histology", "Tissue"]
Annotation file column(s) to plot field specifies which columns in the annotation file will be used as color bar in the output heatmaps, in addition to the cell state label column, plotted by default.
The output section
The Output section contains a single field, Output folder, which specifies the path where the final output will be saved. This folder will be created if it does not exist.
Output folder : "DiscoveryOutput_scRNA"
Pipeline settings
The last section, Pipeline settings, contains settings controlling how EcoTyper is run.
Pipeline steps to skip
#Pipeline steps:
# step 1 (extract cell type specific genes)
# step 2 (cell state discovery on correlation matrices)
# step 3 (choosing the number of cell states)
# step 4 (extracting cell state information)
# step 5 (cell state re-discovery in expression matrices)
# step 6 (extracting information for re-discovered cell states)
# step 7 (cell state QC filter)
# step 8 (ecotype discovery)
The Pipeline steps to skip option allows user to skip some of the steps outlined in section Overview of the EcoTyper workflow for discovering cell states. Please note that this option is only intended for cases when the pipeline had already been run once, and small adjustments are made to the parameters. For example, if the Cophenetic coefficient cutoff used in step 3 needs adjusting, the user might want to skip steps 1-2 and re-run from step 3 onwards.
Filter non cell type specific genes
Filter non cell type specific genes : True
Flag indicated whether to apply the filter for cell type specific genes in step 1, outlined in section Overview of the EcoTyper workflow for discovering cell states. For best results, we do recommend applying this filter.
Number of threads
Number of threads : 10
The number of threads EcoTyper will be run on.
Number of NMF restarts
Number of NMF restarts : 5
The NMF approach used by EcoTyper (Brunet et al.), can give slightly different results, depending on the random initialization of the algorithm. To obtain a stable solution, NMF is generally run multiple times with different seeds, and the solution that best explains the discovery data is chosen. Additionally, the variation of NMF solutions across restarts with different seeds is quantified using Cophenetic coefficients and used in step 4 of EcoTyper for selecting the number of states. The parameter Number of NMF restarts specifies how many restarts with different seed should EcoTyper perform for each rank selection, in each cell type. Since this is a very time consuming process, in this example we only use 5. However, for publication-quality results, we recommend at least 50 restarts.
Maximum number of states per cell type
Maximum number of states per cell type : 20
Maximum number of states per cell type specifies the upper end of the range for the number of states possible for each cell type. The lower end is 2.
Cophenetic coefficient cutoff
Cophenetic coefficient cutoff : 0.975
This field indicates the Cophenetic coefficient cutoff, in the range [0, 1], used for automatically determining the number of states in step 4. Lower values generally lead to more clusters being identified. In this particular example, we set it to 0.975.
Jaccard matrix p-value cutoff
Jaccard matrix p-value cutoff : 1
Ecotype identification on step 8 is performed by clustering a jaccard matrix that quantifies the sample overlap between each pair of states. Prior to performing ecotype identification, the jaccard matrix values corresponding to pairs of states for which the sample overlap is not significant are set to 0, in order to mitigate the noise introduced by spurious overlaps. The value provided in this field specifies the p-value cutoff above which the overlaps are considered non-significant. When the number of samples in the scRNA-seq dataset is small, such as in the current example, we recommend this filter is disabled (p-value cutoff = 1), to avoid over-filtering the jaccard matrix. However, we encourage users to set this cutoff to lower values (e.g. 0.05), if the discovery scRNA-seq dataset contains a number of samples large enough to reliably evaluate the significance of overlaps.
The command line
After editing the configuration file (config_discovery_scRNA.yml
), the de novo discovery cell states and ecotypes can be run as is illustrated below. Please note that this script might take 24-48 hours to run on 10 threads. Also, EcoTyper cannot be run on the example data from this tutorial using a typical laptop (16GB memory). We recommend that it is run on a server with >50-100GB of RAM or a high performance cluster.
Rscript EcoTyper_discovery_scRNA.R -c config_discovery_scRNA.yml
The output format
EcoTyper generates for each cell type the following outputs:
- Plots displaying the Cophenetic coefficient calculated in step 4. The horizontal dotted line indicates the Cophenetic coefficient cutoff provided in the configuration file Cophenetic coefficient cutoff field. The vertical dotted red line indicates the number of states automatically selected based on the Cophenetic coefficient cutoff provided. We recommend that users inspect this file to make sure that the automatic selection provides sensible results. If the user wants to adjust the Cophenetic coefficient cutoff after inspecting this plot, they can rerun the discovery procedure skipping steps 1-3. Please note that these plots indicate the number of states obtained before applying the filters for low-quality states in steps 6 and 7. Therefore, the final results will probably contain fewer states.
knitr::include_graphics("DiscoveryOutput_scRNA/rank_plot.png")
對于每種細胞類型校摩,都會產(chǎn)生以下輸出看峻,此處以成纖維細胞為例:
在發(fā)現(xiàn)數(shù)據(jù)集中的樣本中,在步驟 6 和 7(如果運行)中的 QC 過濾器之后剩余的細胞狀態(tài)豐度:
data = read.delim("DiscoveryOutput_scRNA/Fibroblasts/state_abundances.txt")
dim(data)
## [1] 5 1500
head(data[,1:5])
## SMC01.T_AAAGTAGAGTGGTAGC SMC01.T_ACACCCTGTTGGTAAA SMC01.T_ACATCAGTCGCCTGAG
## S01 2.074345e-14 4.262766e-02 1.352784e-14
## S02 2.074345e-14 3.332167e-01 3.957569e-01
## S03 2.781436e-02 1.204917e-02 1.706085e-01
## S04 2.074345e-14 9.441681e-13 1.352784e-14
## S05 2.074345e-14 6.121064e-01 7.391235e-02
## SMC01.T_ACTATCTAGCTAGTCT SMC01.T_ACTGATGAGCACCGCT
## S01 5.529717e-02 1.022447e-14
## S02 2.081811e-14 1.805516e-01
## S03 3.519208e-02 6.787254e-01
## S04 2.081811e-14 1.022447e-14
## S05 2.081811e-14 7.112757e-02
Assignment of samples in the discovery dataset to the cell state with the highest abundance. Only samples assigned to the cell states remaining after the QC filters in steps 6 and 7 (if run) are included. The remaining ones are considered unassigned and removed from this table:
data = read.delim("DiscoveryOutput_scRNA/Fibroblasts/state_assignment.txt")
dim(data)
## [1] 899 3
head(data)
## ID State InitialState
## 723 SMC15.T_CATCGAAGTGACCAAG S01 IS05
## 724 SMC18.T_CTTGGCTCAGTGACAG S01 IS05
## 725 SMC24.T_TACTTACAGCGCCTTG S01 IS05
## 726 SMC01.N_CACCAGGCAATAAGCA S01 IS05
## 727 SMC02.N_AGAGCTTTCTAACCGA S01 IS05
## 728 SMC02.N_ATAACGCCAATACGCT S01 IS05
A heatmap illustrating the expression of genes used for cell state discovery, that have the highest fold-change in one of the cell states remaining after the QC filters in steps 6 and 7 (if run). In the current example, the heatmap includes in the top color bar two rows corresponding to Tissue and Histology, that have been provided in configuration file field Annotation file column(s) to plot, in addition to cell state labels always plotted:
knitr::include_graphics("DiscoveryOutput_scRNA/Fibroblasts/state_assignment_heatmap.png")
The ecotype output files include:
The cell state composition of each ecotype (the set of cell states making up each ecotype):
ecotypes = read.delim("DiscoveryOutput_scRNA/Ecotypes/ecotypes.txt")
head(ecotypes[,c("CellType", "State", "Ecotype")])
## CellType State Ecotype
## 1 B.cells S02 E1
## 2 CD4.T.cells S02 E1
## 3 CD8.T.cells S01 E1
## 4 Dendritic.cells S03 E1
## 5 Fibroblasts S05 E1
## 6 Monocytes.and.Macrophages S03 E1
The number of initial clusters obtained by clustering the Jaccard index matrix, selected using the average silhouette:
knitr::include_graphics("DiscoveryOutput_scRNA/Ecotypes/nclusters_jaccard.png")
A heatmap of the Jaccard index matrix, after filtering ecotypes with less than 3 cell states:
knitr::include_graphics("DiscoveryOutput_scRNA/Ecotypes/jaccard_matrix.png")
The abundance of each ecotype in each sample in the discovery dataset::
The abundance of each ecotype in each sample in the discovery dataset:
abundances = read.delim("DiscoveryOutput_scRNA/Ecotypes/ecotype_abundance.txt")
dim(abundances)
## [1] 9 33
head(abundances[,1:5])
## SMC01.N SMC01.T SMC02.N SMC02.T SMC03.N
## E1 0.34064095 0.07302366 0.20329837 0.02049678 0.27718758
## E2 0.06078240 0.17093342 0.02937202 0.10322721 0.05241208
## E3 0.02315383 0.34562878 0.01355632 0.36202739 0.01278497
## E4 0.13787420 0.12543986 0.14604672 0.16681631 0.06725426
## E5 0.16081886 0.10434607 0.28980392 0.11903111 0.14459666
## E6 0.00000000 0.07347385 0.03524642 0.14282270 0.00000000
The assignment of samples in the discovery dataset to ecotypes. The samples not assigned to any ecotype are filtered out from this file:
assignments = read.delim("DiscoveryOutput_scRNA/Ecotypes/ecotype_assignment.txt")
dim(assignments)
## [1] 32 6
head(assignments[,1:5])
## ID MaxEcotype AssignmentP AssignmentQ AssignedToEcotypeStates
## SMC01-N SMC01-N E1 1.938649e-04 0.0012795085 TRUE
## SMC05-N SMC05-N E1 5.000404e-03 0.0183348142 TRUE
## SMC05-T SMC05-T E1 7.568441e-02 0.1541417608 TRUE
## SMC07-N SMC07-N E1 2.928585e-03 0.0138061877 TRUE
## SMC08-N SMC08-N E1 9.015769e-05 0.0007438009 TRUE
## SMC19-T SMC19-T E1 5.936002e-03 0.0195888071 TRUE
A heatmap of cell state fractions across the samples assigned to ecotypes:
knitr::include_graphics("DiscoveryOutput_scRNA/Ecotypes/heatmap_assigned_samples_viridis.png")
當然衙吩,方法還是很不錯的互妓,如果有配套的單細胞空間數(shù)據(jù),用一下這個方法尋找細胞狀態(tài)和生態(tài)型坤塞,還是非常給力的车猬,參考網(wǎng)址在EcoTyper
生活很好,有你更好