Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.
當(dāng)我們拿到單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)的時候号杠,不管結(jié)果怎么樣,翻來覆去發(fā)現(xiàn)其實就是一個基因和細(xì)胞的矩陣而已!一般基于這個矩陣嫉嘀,會先過濾細(xì)胞均一化路狮,降維她混,聚類(最重要的一步饲漾,因為它可以發(fā)現(xiàn)細(xì)胞的異質(zhì)性)珍剑,聚類結(jié)果的差異分析(聲稱找到了每個亞群的marker基因),最后做一下數(shù)據(jù)的可視化遇西。
且慢馅精,用非監(jiān)督的方式聚出來的類(一般叫做為分的類數(shù)),既然是細(xì)胞異質(zhì)性的體現(xiàn)粱檀,那么它到底是什么細(xì)胞呢洲敢,可以叫做什么,是CD4細(xì)胞呢還是microglia細(xì)胞茄蚯?因為是沒有意義的压彭,只有起了像CD4這樣的名字,這類細(xì)胞才有故事可以講渗常。
在定義細(xì)胞類型之前壮不,需要確定就哪種聚類結(jié)果來做,是圖聚類的結(jié)果還是k-means某一類的結(jié)果皱碘。如何來確定询一?看看分群的tsne圖是一個不錯的參考。
那么呢癌椿,之前你們家大師推薦過幾個數(shù)據(jù)庫single cell marker 基因數(shù)據(jù)庫是基于某亞群細(xì)胞的marker基因健蕊,根據(jù)已經(jīng)知道m(xù)arker基因的細(xì)胞類型,這些數(shù)據(jù)庫其實就是一個細(xì)胞類型和marker基因大表踢俄。根據(jù)marker基因列表與分析出來的差異基因列表以及基因豐度的關(guān)系(往往是做一個熱圖)來推斷某個cluster屬于哪一種細(xì)胞類型缩功。
另外還有一種定義細(xì)胞類型的方法是通過每個cluster與已知細(xì)胞類型的表達(dá)譜的相似性來定義細(xì)胞類型,已有的方法為SingleR都办、celaref嫡锌。均是R包,其中celaref是2019年的新品琳钉,也是本文主要介紹的一種方法势木。
celaref 簡介
走遍示例流程
安裝celaref包的時候,就把示例數(shù)據(jù)也下載了歌懒,首先載入數(shù)據(jù)跟压,然后觀察一下示例數(shù)據(jù)是怎么組織的,以便我們以后組織自己的數(shù)據(jù)歼培。在接觸一個新包的時候震蒋,最好的學(xué)習(xí)方法就是用示例數(shù)據(jù)走一遍然后看看有哪些功能(函數(shù)以及函數(shù)的參數(shù)可以用),多用help或者躲庄?來查看細(xì)節(jié)查剖。一個R包封裝的越好也就要求我們花時間去了解他的細(xì)節(jié)。
library(celaref)
# Paths to data files.
counts_filepath.query <- system.file("extdata", "sim_query_counts.tab", package = "celaref")
cell_info_filepath.query <- system.file("extdata", "sim_query_cell_info.tab", package = "celaref")
counts_filepath.ref <- system.file("extdata", "sim_ref_counts.tab", package = "celaref")
cell_info_filepath.ref <- system.file("extdata", "sim_ref_cell_info.tab", package = "celaref")
# Load data
toy_ref_se <- load_se_from_files(counts_file=counts_filepath.ref, cell_info_file=cell_info_filepath.ref)
head(toy_ref_se)
toy_query_se <- load_se_from_files(counts_file=counts_filepath.query, cell_info_file=cell_info_filepath.query)
可見我載入了兩個數(shù)據(jù):一個是toy_ref_se噪窘,reference笋庄;一個是toy_query_se,query.都是SummarizedExperiment對象:
對數(shù)據(jù)進行過濾,help一下這個函數(shù)直砂,看看都有哪些過濾條件菌仁。
# Filter data
toy_ref_se <- trim_small_groups_and_low_expression_genes(toy_ref_se)
toy_query_se <- trim_small_groups_and_low_expression_genes(toy_query_se)
對參考和需要鑒定的表達(dá)譜分別做差異分析:
# Setup within-experiment differential expression
de_table.toy_ref <- contrast_each_group_to_the_rest(toy_ref_se, dataset_name="ref")
de_table.toy_query <- contrast_each_group_to_the_rest(toy_query_se, dataset_name="query")
分別得到差異基因結(jié)果:
繪制一組小提琴圖,顯示每個查詢組的“top”基因在參考數(shù)據(jù)庫中的分布静暂。
# Plot
make_ranking_violin_plot(de_table.test=de_table.toy_query, de_table.ref=de_table.toy_ref)
根據(jù)參考數(shù)據(jù)集的相似性济丘,在查詢數(shù)據(jù)集中構(gòu)造一些合理的標(biāo)簽或組/集群。
# And get group labels
make_ref_similarity_names(de_table.toy_query, de_table.toy_ref)
可以看到除了Group2 沒有相應(yīng)的similarity 類似的type之外洽蛀,其余的都找到了相對應(yīng)的細(xì)胞類型(基于檢驗的摹迷,不是生物學(xué)的)。這當(dāng)然可以為我們定義細(xì)胞類型做一個統(tǒng)計上的參考郊供。
準(zhǔn)備數(shù)據(jù)
那么峡碉,如何應(yīng)用我們自己的數(shù)據(jù)來做呢?官網(wǎng)也給出了數(shù)據(jù)準(zhǔn)備的方法:
在準(zhǔn)備數(shù)據(jù)時需要以下數(shù)據(jù):
- Counts Matrix Number of gene tags per gene per cell.
- Cell information A sample-sheet table of cell-level information. Only two fields are essential:
cell_sample: A unique cell identifier
group: The cluster/group to which the cell has been assigned. - Gene information Optional. A table of extra gene-level information.
ID: A unique gene identifier
輸入數(shù)據(jù):
- tab鍵分割的ounts matrix
gene | Cell1 | cell2 | cell3 | cell4 | … | cell954 |
---|---|---|---|---|---|---|
GeneA | 0 | 1 | 0 | 1 | … | 0 |
GeneB | 0 | 3 | 0 | 2 | … | 2 |
GeneC | 1 | 40 | 1 | 0 | … | 0 |
.... |
- tab鍵分割的細(xì)胞分群信息
CellId | Sample | Cluster |
---|---|---|
cell1 | Control | cluster1 |
cell2 | Control | cluster7 |
… | … | … |
cell954 | KO | cluster8 |
From 10X pipeline output
dataset_se <- load_dataset_10Xdata('~/path/to/data/10X_mydata',
dataset_genome = "GRCh38",
clustering_set = "kmeans_7_clusters")
拿一個10X的例子試試吧
#Prepare 10X query dataset
先要把數(shù)據(jù)下載好驮审,發(fā)現(xiàn)這應(yīng)用的對象還是cellranger V2 pipeline的輸出結(jié)果鲫寄,注意為了和官網(wǎng)數(shù)據(jù)格式保持一致,我把文件夾命名為了疯淫,這個前后一致就可以了地来。
+--- 10X_pbmc4k
| +--- analysis
| | +--- clustering
| | | +--- graphclust
| | | | +--- clusters.csv
| | | +--- kmeans_10_clusters
| | | | +--- clusters.csv
| | | +--- kmeans_2_clusters
···
| | +--- diffexp
| | | +--- graphclust
| | | | +--- differential_expression.csv
| | | +--- kmeans_10_clusters
···
| | +--- pca
| | | +--- 10_components
| | | | +--- components.csv
| | | | +--- dispersion.csv
| | | | +--- genes_selected.csv
| | | | +--- projection.csv
| | | | +--- variance.csv
| | +--- tsne
| | | +--- 2_components
| | | | +--- projection.csv
| +--- filtered_gene_bc_matrices
| | +--- GRCh38
| | | +--- barcodes.tsv
| | | +--- genes.tsv
| | | +--- matrix.mtx
datasets_dir <- "D:\\Users\\Administrator\\Desktop\\RStudio\\single_cell\\celaref/celaref_extra_vignette_data/datasets"
dir(datasets_dir)
dataset_se.10X_pbmc4k_k7 <- load_dataset_10Xdata(
dataset_path = file.path(datasets_dir,'10X_pbmc4k'),
dataset_genome = "GRCh38",
clustering_set = "kmeans_7_clusters",
id_to_use = "GeneSymbol")
dataset_se.10X_pbmc4k_k7 <- trim_small_groups_and_low_expression_genes(dataset_se.10X_pbmc4k_k7)
可見我選擇的是kmeans_7_clusters 的聚類結(jié)果來進行細(xì)胞定義。處理過之后:
> dataset_se.10X_pbmc4k_k7
class: SummarizedExperiment
dim: 15407 4340
metadata(0):
assays(1): ''
rownames(15407): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(4): ensembl_ID GeneSymbol ID total_count
colnames(4340): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
colData names(2): cell_sample group
下一步是對數(shù)據(jù)的轉(zhuǎn)化和處理峡竣,也是one-to-others的DE分析靠抑,非常的慢和消耗內(nèi)存量九,有可能跑斷的适掰。關(guān)鍵是這個函數(shù)做了是什么?其實就是把我們的矩陣文件整理成上面演示的格式荠列。
de_table.10X_pbmc4k_k7 <- contrast_each_group_to_the_rest(dataset_se.10X_pbmc4k_k7, dataset_name="10X_pbmc4k_k7", num_cores=7)
Sorry, multithreading not supported on windows. Use linux, or set num_threads = 1 to suppress this message.Running single threaded # 這個sorry 可以忽略 类浪,在Linux下才能指定運行進程數(shù)。
`fData` has no primerid. I'll make something up.
`cData` has no wellKey. I'll make something up.
Assuming data assay in position 1, with name et is log-transformed.
Done!
Combining coefficients and standard errors
Calculating log-fold changes
Calculating likelihood ratio tests
Refitting on reduced model...
Done!
`fData` has no primerid. I'll make something up.
`cData` has no wellKey. I'll make something up.
Assuming data assay in position 1, with name et is log-transformed.
Completed [=======>----------------------------------------------------------------------------------------------------------------] 7% with 0 failures
windows下居然跑了一上午
準(zhǔn)備 reference數(shù)據(jù)集
來自響應(yīng)的數(shù)據(jù)庫肌似,因為我們的是血細(xì)胞费就,選擇的是haemosphere .
A suitable PBMC reference (a ‘HaemAtlas’) has been published by . They purified populations of PBMC cell types and measured gene expression via microarray. The data used here was downloaded in a normalised table from the website (Graaf et al. 2016).
下載完之后由于是微陣列數(shù)據(jù)需要做一些特殊處理。
- Logged, normalised expression values. Any low expression or poor quality measurements should have already been removed.
- Sample information (see contrast_each_group_to_the_rest_for_norm_ma_with_limma for details)
library(readr)
this_dataset_dir <- file.path(datasets_dir, 'haemosphere_datasets','watkins')
norm_expression_file <- file.path(this_dataset_dir, "watkins_expression.txt")
samples_file <- file.path(this_dataset_dir, "watkins_samples.txt")
norm_expression_table.full <- read.table(norm_expression_file, sep="\t", header=TRUE, quote="", comment.char="", row.names=1, check.names=FALSE)
samples_table <- read_tsv(samples_file, col_types = cols())
samples_table$description <- make.names( samples_table$description) # Avoid group or extra_factor names starting with numbers, for microarrays
> samples_table$description
[1] "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult"
[8] "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X63.years.adult" "X63.years.adult"
[15] "X63.years.adult" "X63.years.adult" "X63.years.adult" "X63.years.adult" "X53.years.adult" "X53.years.adult" "X53.years.adult"
[22] "X53.years.adult" "X53.years.adult" "X53.years.adult" "X60.years.adult" "X60.years.adult" "X60.years.adult" "X60.years.adult"
[29] "X60.years.adult" "X60.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult"
[36] "X48.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult"
[43] "NA." "NA." "NA." "NA." "NA." "NA." "NA."
[50] "NA."
從樣本表中可以看出川队,這個數(shù)據(jù)集包含了其他組織力细,但是作為PBMC的參考,我們只考慮外周血樣本固额。與其他數(shù)據(jù)加載函數(shù)一樣眠蚂,要從分析中刪除一個樣本(或單元格),從樣本表中刪除它就ok了斗躏。
amples_table <- samples_table[samples_table$tissue == "Peripheral Blood",]
> samples_table
# A tibble: 42 x 7
sampleId celltype cell_lineage surface_markers tissue description notes
<chr> <chr> <chr> <lgl> <chr> <chr> <lgl>
1 1674120023_A B lymphocyte B Cell Lineage NA Peripheral Blood X49.years.adult NA
2 1674120023_B granulocyte Neutrophil Lineage NA Peripheral Blood X49.years.adult NA
3 1674120023_C natural killer cell NK Cell Lineage NA Peripheral Blood X49.years.adult NA
4 1674120023_D Th lymphocyte T Cell Lineage NA Peripheral Blood X49.years.adult NA
5 1674120023_E Tc lymphocyte T Cell Lineage NA Peripheral Blood X49.years.adult NA
6 1674120023_F monocyte Macrophage Lineage NA Peripheral Blood X49.years.adult NA
7 1674120053_A B lymphocyte B Cell Lineage NA Peripheral Blood X49.years.adult NA
8 1674120053_B monocyte Macrophage Lineage NA Peripheral Blood X49.years.adult NA
9 1674120053_C granulocyte Neutrophil Lineage NA Peripheral Blood X49.years.adult NA
10 1674120053_D Tc lymphocyte T Cell Lineage NA Peripheral Blood X49.years.adult NA
# ... with 32 more rows
通常情況下逝慧,最難的部分是格式化輸入。應(yīng)該使用與查詢數(shù)據(jù)集相同的id,將微陣列表達(dá)式值作為規(guī)范化的笛臣、日志轉(zhuǎn)換的數(shù)據(jù)云稚。任何探頭或樣品級的過濾也應(yīng)事先進行。在這種情況下沈堡,從網(wǎng)站獲得的數(shù)據(jù)是正常的静陈,但仍然需要匹配id。
library("tidyverse")
library("illuminaHumanv2.db")
probes_with_gene_symbol_and_with_data <- intersect(keys(illuminaHumanv2SYMBOL),rownames(norm_expression_table.full))
# Get mappings - non NA
probe_to_symbol <- select(illuminaHumanv2.db, keys=rownames(norm_expression_table.full), columns=c("SYMBOL"), keytype="PROBEID")
probe_to_symbol <- unique(probe_to_symbol[! is.na(probe_to_symbol$SYMBOL),])
# no multimapping probes
genes_per_probe <- table(probe_to_symbol$PROBEID) # How many genes a probe is annotated against?
multimap_probes <- names(genes_per_probe)[genes_per_probe > 1]
probe_to_symbol <- probe_to_symbol[!probe_to_symbol$PROBEID %in% multimap_probes, ]
convert_expression_table_ids<- function(expression_table, the_probes_table, old_id_name, new_id_name){
the_probes_table <- the_probes_table[,c(old_id_name, new_id_name)]
colnames(the_probes_table) <- c("old_id", "new_id")
# Before DE, just pick the top expresed probe to represent the gene
# Not perfect, but this is a ranking-based analysis.
# hybridisation issues aside, would expect higher epressed probes to be more relevant to Single cell data anyway.
probe_expression_levels <- rowSums(expression_table)
the_probes_table$avgexpr <- probe_expression_levels[as.character(the_probes_table$old_id)]
the_genes_table <- the_probes_table %>%
group_by(new_id) %>%
top_n(1, avgexpr)
expression_table <- expression_table[the_genes_table$old_id,]
rownames(expression_table) <- the_genes_table$new_id
return(expression_table)
}
# Just the most highly expressed probe foreach gene.
norm_expression_table.genes <- convert_expression_table_ids(norm_expression_table.full,
probe_to_symbol, old_id_name="PROBEID", new_id_name="SYMBOL")
# Go...
de_table.Watkins2009PBMCs <- contrast_each_group_to_the_rest_for_norm_ma_with_limma(
norm_expression_table = norm_expression_table.genes,
sample_sheet_table = samples_table,
dataset_name = "Watkins2009PBMCs",
extra_factor_name = 'description',
sample_name = "sampleId",
group_name = 'celltype')
> de_table.Watkins2009PBMCs
# A tibble: 113,376 x 21
ID logFC AveExpr t P.Value adj.P.Val B ci CI_upper CI_lower ci_inner ci_outer log2FC fdr group sig sig_up
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <lgl> <lgl>
1 JCHA~ 6.86 8.65 27.5 7.13e-26 2.70e-23 49.0 0.0952 6.96 6.77 6.77 6.96 6.86 2.70e-23 B ly~ TRUE TRUE
2 FCRLA 6.41 8.77 23.7 1.13e-23 3.29e-21 44.0 0.103 6.51 6.30 6.30 6.51 6.41 3.29e-21 B ly~ TRUE TRUE
3 VPRE~ 6.18 8.66 40.2 1.20e-31 8.70e-29 62.0 0.0587 6.23 6.12 6.12 6.23 6.18 8.70e-29 B ly~ TRUE TRUE
4 TCL1A 6.09 8.52 42.8 1.33e-32 1.14e-29 64.1 0.0544 6.14 6.04 6.04 6.14 6.09 1.14e-29 B ly~ TRUE TRUE
5 CD79A 6.09 9.08 22.7 5.17e-23 1.40e-20 42.4 0.102 6.20 5.99 5.99 6.20 6.09 1.40e-20 B ly~ TRUE TRUE
6 CD19 5.90 8.44 38.4 6.09e-31 4.11e-28 60.5 0.0587 5.96 5.85 5.85 5.96 5.90 4.11e-28 B ly~ TRUE TRUE
7 HLA-~ 5.61 8.35 47.9 2.30e-34 2.56e-31 68.0 0.0447 5.66 5.57 5.57 5.66 5.61 2.56e-31 B ly~ TRUE TRUE
8 OSBP~ 5.48 7.83 53.2 5.60e-36 8.82e-33 71.4 0.0394 5.52 5.45 5.45 5.52 5.48 8.82e-33 B ly~ TRUE TRUE
9 CNTN~ 5.38 7.95 49.3 8.12e-35 9.60e-32 68.9 0.0416 5.42 5.34 5.34 5.42 5.38 9.60e-32 B ly~ TRUE TRUE
10 COBL~ 5.26 7.70 56.4 6.77e-37 1.28e-33 73.3 0.0356 5.30 5.23 5.23 5.30 5.26 1.28e-33 B ly~ TRUE TRUE
# ... with 113,366 more rows, and 4 more variables: gene_count <int>, rank <int>, rescaled_rank <dbl>, dataset <chr>
細(xì)胞類型定義(Compare 10X query PBMCs to to reference)
數(shù)據(jù)準(zhǔn)備好就和之前的示例是一樣的了踱蛀。
make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7, de_table.ref=de_table.Watkins2009PBMCs)
Hmm, there’s a few clusters where different the top genes are bunched near the top for a couple of different reference cell types.
Logging the plot will be more informative at the top end for this dataset.
make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7, de_table.ref=de_table.Watkins2009PBMCs, log10trans = TRUE)
可以打標(biāo)簽啦窿给。
label_table.pbmc4k_k7_vs_Watkins2009PBMCs <- make_ref_similarity_names(de_table.10X_pbmc4k_k7, de_table.Watkins2009PBMCs)
label_table.pbmc4k_k7_vs_Watkins2009PBMCs
可以看出,七個群有六個找到了與之相似的群名稱率拒,并給出了pval.
make_ranking_violin_plot(de_table.test=de_table.Watkins2009PBMCs, de_table.ref=de_table.10X_pbmc4k_k7, log10trans = TRUE)
行啦崩泡,人pbmc細(xì)胞的定義工作就完成了。官網(wǎng)也提供了小鼠細(xì)胞類型識別的教程猬膨。這里就不再復(fù)述啦角撞。
完成了細(xì)胞類型定義,單細(xì)胞的故事才剛剛開始勃痴。所以谒所,這不是結(jié)束,甚至不是結(jié)束的開始沛申,而可能是開始的結(jié)束劣领。
SingleR
celaref
Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.
SummarizedExperiment