10X單細(xì)胞（10X空間轉(zhuǎn)錄組）聚類(lèi)分析之樹(shù)形可視化（TreeCorTreat）

hello粹断，今天給大家分享一個(gè)很好的可視化軟件构回，使用基于樹(shù)的相關(guān)屏幕來(lái)分析和可視化表型和轉(zhuǎn)錄組特征以及多個(gè)細(xì)胞類(lèi)型分辨率級(jí)別的細(xì)胞類(lèi)型之間的關(guān)聯(lián)晤愧。文章在Tree-based Correlation Screen and Visualization for Exploring Phenotype-Cell Type Association in Multiple Sample Single-Cell RNA-Sequencing Experiments绍昂，可視化的部分非常nice羔飞，下面展示幾張效果圖。

圖片.png

好了挎袜，開(kāi)始我們的分享

Overview

多樣本的單細(xì)胞 RNA-seq 實(shí)驗(yàn)越來(lái)越多地用于發(fā)現(xiàn)可能影響樣本表型（例如疾餐缒簟）的細(xì)胞類(lèi)型及其分子特征肥惭。然而，分析和可視化復(fù)雜的細(xì)胞類(lèi)型-表型關(guān)聯(lián)仍然很重要紊搪。 TreeCorTreat 是一個(gè)開(kāi)源 R 包蜜葱，它通過(guò)使用基于樹(shù)的相關(guān)屏幕來(lái)分析和可視化表型和轉(zhuǎn)錄組特征以及多個(gè)細(xì)胞類(lèi)型分辨率級(jí)別的細(xì)胞類(lèi)型之間的關(guān)聯(lián)來(lái)解決這個(gè)問(wèn)題。使用 TreeCorTreat耀石，可以方便地探索和比較不同的特征類(lèi)型牵囤、表型特征、分析協(xié)議和數(shù)據(jù)集娶牌，并評(píng)估潛在混雜因素的影響奔浅。

TreeCorTreat 將基因表達(dá)矩陣（原始計(jì)數(shù)）馆纳、細(xì)胞meta數(shù)據(jù)和樣本meta數(shù)據(jù)作為輸入诗良。它提供了一個(gè)完整的pipeline來(lái)整合樣本之間的數(shù)據(jù)，識(shí)別細(xì)胞cluster及其層次結(jié)構(gòu)鲁驶，在細(xì)胞類(lèi)型比例和基因表達(dá)方面評(píng)估不同分辨率級(jí)別的樣本表型和細(xì)胞類(lèi)型之間的關(guān)聯(lián)鉴裹，并總結(jié)和可視化結(jié)果樹(shù)結(jié)構(gòu)的TreeCorTreat 圖。該pipeline由六個(gè)功能模塊組成：

Module 1: Data integration
Module 2: Define cell types at multiple resolutions
Module 3: Identify association between cell type proportion and sample phenotype
Module 4: Identify association between global gene expression and sample phenotype
Module 5: Identify differentially expressed genes
Module 6: Visualization via TreeCorTreat plot

模塊化結(jié)構(gòu)使用可以靈活地跳過(guò)某些分析步驟钥弯，并用自己的數(shù)據(jù)或分析功能代替它們径荔。

Input data and data preparation

TreeCorTreat 的輸入由三個(gè)部分組成：基因表達(dá)矩陣 E（G×C，行代表 G 基因脆霎，列代表 C 細(xì)胞）总处、細(xì)胞級(jí)meta數(shù)據(jù) M（C×K，其中 K ≥ 2）和樣本級(jí)meta數(shù)據(jù) L（S×J睛蛛，其中 J ≥ 2）鹦马。為了避免冗余并節(jié)省存儲(chǔ)空間，我們將meta數(shù)據(jù)拆分為細(xì)胞級(jí)meta數(shù)據(jù)和樣本級(jí)meta數(shù)據(jù)忆肾。細(xì)胞級(jí)meta數(shù)據(jù)主要注釋細(xì)胞級(jí)信息荸频，每個(gè)細(xì)胞必須至少包含 2 列：細(xì)胞條形碼和相應(yīng)的樣本標(biāo)識(shí)符 (ID)】透裕可以添加可選的第三列以提供單元格類(lèi)型注釋旭从。細(xì)胞條形碼用于耦合細(xì)胞級(jí)meta數(shù)據(jù) M 和基因表達(dá)矩陣 E。樣本級(jí)meta數(shù)據(jù)記錄樣本的感興趣表型（例如臨床結(jié)果）和其他相關(guān)協(xié)變量（例如年齡和性別）场仲。第一列包含唯一的樣本 ID和悦，其余列包含表型和協(xié)變量。細(xì)胞級(jí)meta數(shù)據(jù)和樣本級(jí)meta數(shù)據(jù)可以通過(guò)唯一的樣本 ID 鏈接渠缕。

# load in TreeCorTreat
options(warn=-1)
suppressMessages(library(TreeCorTreat))
suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))
suppressMessages(library(tidyr))

data(raw_data)
str(raw_data)

Module 1: Data integration

Harmony 算法用于整合來(lái)自不同樣本的細(xì)胞并將它們嵌入到一個(gè)共同的低維空間中摹闽。重新編寫(xiě)了 Seurat v3 中的 RunHarmony 和 BuildClusterTree 函數(shù)，并將以下步驟包裝到一個(gè)具有可調(diào)參數(shù)的獨(dú)立 treecor_harmony 函數(shù)中：庫(kù)大小歸一化褐健、集成特征選擇付鹿、主成分分析 (PCA)澜汤、Harmony 集成、無(wú)監(jiān)督 louvain 聚類(lèi)舵匾、統(tǒng)一流形近似和投影 (UMAP)俊抵、louvain 簇的層次聚類(lèi)和差異表達(dá)分析，以識(shí)別細(xì)胞類(lèi)型標(biāo)記基因以促進(jìn)注釋坐梯。

具體而言徽诲，對(duì)于文庫(kù)大小標(biāo)準(zhǔn)化，給定細(xì)胞的基因計(jì)數(shù)除以該細(xì)胞的總計(jì)數(shù)吵血，乘以比例因子 (10⁴) 并應(yīng)用自然對(duì)數(shù)轉(zhuǎn)換谎替。整合樣本的特征是通過(guò)基于每個(gè)樣本中的方差-均值關(guān)系選擇高度可變基因 (HVG) 來(lái)獲得的（使用 SelectIntegrationFeatures 函數(shù)），并根據(jù)被識(shí)別為 HVG 的樣本數(shù)量對(duì)特征進(jìn)行排序蹋辅。默認(rèn)情況下钱贯，選擇前 2000 個(gè) HVG 并將其送入下游 PCA 程序。基于前 20 個(gè) PC 進(jìn)行和聲侦另，并獲得修正的和聲嵌入秩命。然后將前 20 個(gè)和諧坐標(biāo)用于下游 louvain 聚類(lèi)和 UMAP 分析（使用 Seurat 中的 FindClusters 和 RunUMAP 函數(shù)）。

# integration
set.seed(12345)
integration <- treecor_harmony(count = raw_data[['count']], sample_meta = raw_data[['sample_meta']], output_dir = getwd())

集成的 Seurat 對(duì)象將存儲(chǔ)在本地目錄中褒傅，過(guò)濾后的基因表達(dá)矩陣或細(xì)胞級(jí)meta數(shù)據(jù)可以通過(guò) access_data_seurat 提取弃锐，用于下游分析。

# list integration result and data extracted from Seurat object
integrated_data <- access_data_seurat(seurat_obj = integration,output_dir = getwd())

在示例中殿托，沒(méi)有細(xì)胞被過(guò)濾掉霹菊，我們將更新的細(xì)胞級(jí)meta數(shù)據(jù)（附加列包括“細(xì)胞類(lèi)型”和 UMAP 坐標(biāo)）存儲(chǔ)為integrated_cellmeta。

data(integrated_cellmeta)
head(integrated_cellmeta)
##           barcode   sample celltype    UMAP_1     UMAP_2
## 1 HD-17-Su:345974 HD-17-Su        3 -3.393676 -7.0051720
## 2 HD-17-Su:345975 HD-17-Su        2 -5.090366  6.9038575
## 3 HD-17-Su:345976 HD-17-Su        2 -6.407369  6.9599665
## 4 HD-17-Su:345977 HD-17-Su        5 -6.225506  3.5054925
## 5 HD-17-Su:345978 HD-17-Su        1  8.189426  0.8951219
## 6 HD-17-Su:345979 HD-17-Su        9  9.156926 -4.1967079
# data for downstream analysis
sample_meta <- raw_data[['sample_meta']]
count <- raw_data[['count']]
cell_meta <- integrated_cellmeta

Module 2: Define cell types at multiple resolutions

前面的步驟通過(guò) louvain 聚類(lèi)或基于用戶(hù)提供的細(xì)胞級(jí)meta數(shù)據(jù)（即細(xì)胞meta數(shù)據(jù)中的可選列 3）定義了細(xì)胞cluster支竹。用戶(hù)可以添加文本標(biāo)簽旋廷，根據(jù)最高差異表達(dá)基因或已知細(xì)胞類(lèi)型標(biāo)記基因來(lái)注釋每個(gè)細(xì)胞cluster的細(xì)胞類(lèi)型。在這里唾戚，我們?cè)?UMAP 上疊加了幾個(gè)基因標(biāo)記（例如 CD3D柳洋、CD19 和 CD68）以粗略地注釋細(xì)胞cluster。

# cell clusters
label_text <- integrated_cellmeta %>% group_by(celltype) %>% summarise(UMAP_1 = median(UMAP_1),UMAP_2 = median(UMAP_2))

ggplot(integrated_cellmeta,aes(x=UMAP_1,y=UMAP_2,color = celltype)) + 
  geom_point(size = 0.1) +
  geom_text(data = label_text,aes(x=UMAP_1,y=UMAP_2,label = celltype),color = 'black',size = 5) +
  theme_classic() +
  guides(color = guide_legend(override.aes = list(size = 3)))

圖片.png

# gene markers
genes <- c('CD3D','CD14','CD19','NCAM1','CD4','CD8A',
           'FCGR3A','CD1C','CD68','CD79A','CSF3R',
           'CD33','CCR7','CD38','CD27','KLRD1')

rc <- Matrix::colSums(count)
sub_count <- count[genes,] %>% as.matrix
sub_norm <- log2(t(t(sub_count)/rc*1e6 + 1)) %>% as.matrix

df_marker <- t(sub_norm) %>% data.frame %>% mutate(barcode = rownames(.)) %>% gather(gene,expr,-barcode) %>% inner_join(cell_meta %>% select(barcode,UMAP_1,UMAP_2))

ggplot(df_marker,aes(x = UMAP_1,y = UMAP_2,col = expr)) +
  geom_point(size = 0.01, shape = ".") +
  scale_colour_viridis_c(option = 'C',direction = 1) +
  facet_wrap(~ gene) +
  theme_classic(base_size = 12) +
  theme(legend.position = 'bottom')

圖片.png

We roughly categorize 16 cell clusters into 3 large cell-types based on gene markers: B cells (CD19), T cells (CD3D) and Monocytes (CD14).

# new celltype annotation
new_label <- data.frame(old = sort(factor(unique(cell_meta$celltype,levels = 1:16)))) %>% 
  mutate(large_celltype = ifelse(old %in% c(1,9,11,12,13,16),'Mono',
                                 ifelse(old %in% c(2:6,8,10,14,15),'T','B')),
         new = paste0(large_celltype,'_c',old)) %>% select(-large_celltype)
##    old      new
## 1    1  Mono_c1
## 2   10    T_c10
## 3   11 Mono_c11
## 4   12 Mono_c12
## 5   13 Mono_c13
## 6   14    T_c14
## 7   15    T_c15
## 8   16 Mono_c16
## 9    2     T_c2
## 10   3     T_c3
## 11   4     T_c4
## 12   5     T_c5
## 13   6     T_c6
## 14   7     B_c7
## 15   8     T_c8
## 16   9  Mono_c9

Modify cell type annotation

Users can modify the cell type label via modify_label function to update cell type annotations in cell_meta:

# modify cell type names in `cell_meta`
cell_meta <- modify_label(new_label,hierarchy_list = NULL,cell_meta)$cell_meta
## Modify label in cell_meta
head(cell_meta[,c('barcode','sample','celltype')])
##           barcode   sample celltype
## 1 HD-17-Su:345974 HD-17-Su     T_c3
## 2 HD-17-Su:345975 HD-17-Su     T_c2
## 3 HD-17-Su:345976 HD-17-Su     T_c2
## 4 HD-17-Su:345977 HD-17-Su     T_c5
## 5 HD-17-Su:345978 HD-17-Su  Mono_c1
## 6 HD-17-Su:345979 HD-17-Su  Mono_c9

同樣叹坦，也可以使用此功能更新層次樹(shù)結(jié)構(gòu)中的細(xì)胞類(lèi)型注釋(i.e. modify_label(new_label,hierarchy_list = hierarchy_list,cell_meta = NULL)$hierarchy_list).

Construct hierarchical tree structure

為了便于在多個(gè)分辨率下進(jìn)行關(guān)聯(lián)分析熊镣，細(xì)胞cluster被進(jìn)一步分層聚類(lèi)。可以使用數(shù)據(jù)驅(qū)動(dòng)的方法或基于知識(shí)的方法來(lái)導(dǎo)出樹(shù)募书。

該樹(shù)可以由用戶(hù)基于先驗(yàn)知識(shí)（基于知識(shí)）提供绪囱，也可以使用層次聚類(lèi)從數(shù)據(jù)中導(dǎo)出（數(shù)據(jù)驅(qū)動(dòng)）。對(duì)于數(shù)據(jù)驅(qū)動(dòng)方法莹捡，通過(guò)應(yīng)用 treecor_harmony 函數(shù)（或 Seurat R 包中的 BuildClusterTree）在 PC 空間上構(gòu)建系統(tǒng)發(fā)育樹(shù)鬼吵。

Data-driven approach

在數(shù)據(jù)驅(qū)動(dòng)的方法中，樹(shù)是通過(guò)應(yīng)用 treecor_harmony 函數(shù)（或 Seurat R 包中的 BuildClusterTree ）通過(guò) louvain 集群的層次聚類(lèi)生成的篮赢。具體來(lái)說(shuō)齿椅，首先對(duì)每個(gè)細(xì)胞cluster內(nèi)的細(xì)胞求平均 PC 分?jǐn)?shù)琉挖，然后構(gòu)建分層樹(shù)。數(shù)據(jù)驅(qū)動(dòng)的方法可以提供一種無(wú)偏的方式來(lái)推斷底層樹(shù)結(jié)構(gòu)涣脚，但注釋每個(gè)中間樹(shù)節(jié)點(diǎn)可能具有挑戰(zhàn)性示辈。

Knowledge-based approach

在基于知識(shí)的方法中，可以通過(guò)提供一個(gè)字符串來(lái)描述不同粒度級(jí)別的集群的層次關(guān)系遣蚀，從而根據(jù)他們的先驗(yàn)知識(shí)來(lái)指定樹(shù)矾麻。樹(shù)中的葉節(jié)點(diǎn)對(duì)應(yīng)于從集成步驟中的 louvain 聚類(lèi)或來(lái)自細(xì)胞meta數(shù)據(jù)的“細(xì)胞類(lèi)型”列中獲得的細(xì)胞cluster。

# specify string
input_string <- '@All(@B(B_c7),@T(T_c15,T_c14,T_c10,T_c8,T_c6,T_c5,T_c4,T_c3,T_c2),@Monocyte(Mono_c16,Mono_c13,Mono_c12,Mono_c11,Mono_c9,Mono_c1))'

# extract hierarchy from string
hierarchy_structure <- extract_hrchy_string(input_string,special_character = '@', plot = T)

圖片.png

For demonstration purpose, we will use knowledge-based hierarchical structure in subsequent analysis.

Module 3: Identify association between cell type proportion and sample phenotype

細(xì)胞類(lèi)型比例的分析被包裝到一個(gè)獨(dú)立的函數(shù) treecor_ctprop 中芭梯。這里我們首先為每個(gè)樹(shù)節(jié)點(diǎn)定義特征险耀，然后評(píng)估特征與樣本表型之間的關(guān)聯(lián)。對(duì)于葉節(jié)點(diǎn)細(xì)胞簇玖喘，特征定義為樣本中落入該節(jié)點(diǎn)的細(xì)胞比例甩牺。對(duì)于中間父節(jié)點(diǎn)或根節(jié)點(diǎn)，可以選擇在聚合（默認(rèn)設(shè)置）芒涡、連接葉節(jié)點(diǎn)或連接直接子節(jié)點(diǎn)之一中定義特征柴灯。一旦定義了每個(gè)節(jié)點(diǎn)的特征卖漫，將遍歷每個(gè)樹(shù)節(jié)點(diǎn)以檢查其特征與樣本表型之間的相關(guān)性（或其他匯總統(tǒng)計(jì)量）费尽。計(jì)算匯總統(tǒng)計(jì)量的方法取決于表型是單變量還是多變量：

Univariate phenotype/Multivariate phenotype analyzed separately:
Pearson correlation (default) with correlation sign
Spearman correlation with correlation sign
Canonical correlation
Chi-squared statistic (via likelihood ratio test between a full model (with phenotype as explanatory variables) and a reduced model (intercept-only)) with coefficient sign
Multivariate phenotype analyzed jointly:
Canonical correlation
Chi-squared statistic (via likelihood ratio test)

The permutation-based p-value can be obtained and adjusted p-value is computed using Benjamini & Yekutieli procedure.

We can first use a kernel density plot to visualize cell density stratified by disease severity.

# density plot on UMAP embeddings
treecor_celldensityplot(cell_meta,
                        sample_meta,
                        row_variable = 'study',
                        col_variable = 'severity',
                        row_combined = F)

圖片.png

Next, we can assess the association between cell type proportion and disease severity using treecor_ctprop() pipeline:

# cell type prop pipeline (default)
res_ctprop_full <- treecor_ctprop(hierarchy_structure,
                                  cell_meta,
                                  sample_meta,
                                  response_variable = 'severity',
                                  method = "aggregate",
                                  analysis_type = "pearson",
                                  num_permutations = 100)
names(res_ctprop_full)

第一個(gè)元素是一個(gè)計(jì)算匯總統(tǒng)計(jì)量（例如皮爾遜相關(guān)性）以及 p 值和調(diào)整后的 p 值的表格。第二個(gè)元素是每個(gè)細(xì)胞類(lèi)型的 PC 矩陣列表羊始。但是旱幼，在默認(rèn)設(shè)置（聚合）中，比例聚合為向量（一維）突委，因此在這種情況下不進(jìn)行 PCA柏卤。

# extract Pearson correlation
res_ctprop <- res_ctprop_full[[1]] %>% mutate(severity.absolute_cor = abs(severity.pearson))
head(res_ctprop)
##   id severity.method severity.analysis_type severity.pearson severity.p
## 1  1       aggregate                pearson               NA         NA
## 2  2       aggregate                pearson        0.1147058 0.82828283
## 3  3       aggregate                pearson       -0.7614817 0.08080808
## 4  4       aggregate                pearson        0.7603030 0.08080808
## 5  5       aggregate                pearson        0.1147058 0.82828283
## 6  6       aggregate                pearson        0.5039696 0.11111111
##   severity.adjp severity.direction severity.p.sign severity.adjp.sign     x y
## 1            NA               <NA>            <NA>               <NA>  6.25 2
## 2     1.0000000                  +              ns                 ns  0.00 1
## 3     0.7781478                  -              ns                 ns  5.00 1
## 4     0.7781478                  +              ns                 ns 12.50 1
## 5     1.0000000                  +              ns                 ns  0.00 0
## 6     0.8321858                  +              ns                 ns  1.00 0
##      label  leaf severity.absolute_cor
## 1      All FALSE                    NA
## 2        B FALSE             0.1147058
## 3        T FALSE             0.7614817
## 4 Monocyte FALSE             0.7603030
## 5     B_c7  TRUE             0.1147058
## 6    T_c15  TRUE             0.5039696
colnames(res_ctprop)
##  [1] "id"                     "severity.method"        "severity.analysis_type"
##  [4] "severity.pearson"       "severity.p"             "severity.adjp"         
##  [7] "severity.direction"     "severity.p.sign"        "severity.adjp.sign"    
## [10] "x"                      "y"                      "label"                 
## [13] "leaf"                   "severity.absolute_cor"

然后可以使用 TreeCorTreat 圖來(lái)可視化結(jié)果。可以為不同的美學(xué)指定變量（例如顏色匀油、大小缘缚、alpha）。后面將詳細(xì)討論 TreeCorTreat 圖敌蚜。

# visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = res_ctprop,
                 response_variable = 'severity',
                 color_variable = 'direction',
                 size_variable = 'absolute_cor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2)

圖片.png

Module 4: Identify association between global gene expression and sample phenotype

全局基因表達(dá)的分析被包裝到一個(gè)獨(dú)立的函數(shù) treecor_expr 中桥滨。在這里，還首先為每個(gè)樹(shù)節(jié)點(diǎn)定義特征弛车，然后評(píng)估全局基因表達(dá)-樣本表型關(guān)聯(lián)齐媒。對(duì)于葉子節(jié)點(diǎn)，首先將節(jié)點(diǎn)中的所有細(xì)胞池化以獲得節(jié)點(diǎn)的偽批量配置文件纷跛。然后喻括，使用 R 中的局部加權(quán)散點(diǎn)圖平滑 (LOWESS) 程序，使用來(lái)自所有樣本的這些偽批量配置文件來(lái)選擇高度可變的基因贫奠。每個(gè)樣本中所選高變量基因的偽批量配置文件用作節(jié)點(diǎn)的特征向量唬血。這個(gè)特征向量是多元的望蜡。對(duì)于非葉節(jié)點(diǎn)，可以選擇在聚合（默認(rèn)設(shè)置）、連接葉節(jié)點(diǎn)或連接直接子節(jié)點(diǎn)之一中定義特征。一旦定義了每個(gè)節(jié)點(diǎn)的特征韧骗，將遍歷每個(gè)樹(shù)節(jié)點(diǎn)以檢查其特征與樣本表型之間的相關(guān)性（或其他匯總統(tǒng)計(jì)量）扎附。可以選擇規(guī)范相關(guān)或 F 統(tǒng)計(jì)量（Pillai-Bartlett枢舶，將完整模型（以表型作為解釋變量）與簡(jiǎn)化模型（僅截取）進(jìn)行比較）。

The following code illustrates an example of association between gene expression and disease severity, using default setting (aggregate).

# gene expression pipeline
res_expr_full <- treecor_expr(count,
                              hierarchy_structure,
                              cell_meta,
                              sample_meta,
                              response_variable = 'severity',
                              method = 'aggregate',
                              analysis_type = 'cancor',
                              num_permutations = 100)

By running treecor_expr, it results in two elements: the first element is a summary table of computed summary statistic (e.g. canonical correlation) along with p-value and adjusted p-value for each cell type; and the second element is a list of PC matrices for each cell type.

# extract canonical correlation
res_expr <- res_expr_full[[1]]
head(res_expr)
##   id severity.cancor severity.p severity.adjp severity.direction
## 1  1       0.8942939 0.03030303     0.4360897                  +
## 2  2       0.6428883 0.15151515     0.9911129                  +
## 3  3       0.9297493 0.03030303     0.4360897                  +
## 4  4       0.8324347 0.03030303     0.4360897                  +
## 5  5       0.6428883 0.15151515     0.9911129                  +
## 6  6       0.8807827 0.04040404     0.4845441                  +
##   severity.p.sign severity.adjp.sign     x y    label  leaf
## 1             sig                 ns  6.25 2      All FALSE
## 2              ns                 ns  0.00 1        B FALSE
## 3             sig                 ns  5.00 1        T FALSE
## 4             sig                 ns 12.50 1 Monocyte FALSE
## 5              ns                 ns  0.00 0     B_c7  TRUE
## 6             sig                 ns  1.00 0    T_c15  TRUE

The summary table is used to generate TreeCorTreat plot:
# visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2)

圖片.png

除了匯總表膏孟，還可以通過(guò)使用來(lái)自基因表達(dá)分析的 PC 矩陣分別評(píng)估每種細(xì)胞類(lèi)型。

# extract PCA list
pc_expr <- res_expr_full[[2]]
names(pc_expr)

##  [1] "All"      "B"        "T"        "Monocyte" "B_c7"     "T_c15"   
##  [7] "T_c14"    "T_c10"    "T_c8"     "T_c6"     "T_c5"     "T_c4"    
## [13] "T_c3"     "T_c2"     "Mono_c16" "Mono_c13" "Mono_c12" "Mono_c11"
## [19] "Mono_c9"  "Mono_c1"

# extract PCA matrix corresponding to 'T' tree node
pc_expr[['T']]

##                    PC1        PC2
## HD-17-Su    -44.684639   2.823131
## HD-19-Su    -51.774059   4.264000
## HD-23-Su    -51.352355 -23.472410
## HD-30-Su    -39.098933 -37.230307
## Se-137.1-Su  73.681369  47.520847
## Se-178.1-Su  11.966558  20.281313
## Se-180.1-Su   6.457736  53.349695
## Se-181.1-Su  94.804324 -67.536270

PC 矩陣列表由細(xì)胞級(jí)meta數(shù)據(jù)的細(xì)胞類(lèi)型列中指定的細(xì)胞類(lèi)型名稱(chēng)命名拌汇。因此柒桑，可以提取給定細(xì)胞類(lèi)型（例如 T 細(xì)胞類(lèi)別）的每個(gè)樣本的 PC 坐標(biāo)，并使用 treecor_samplepcaplot 可視化樣本級(jí) PCA 圖噪舀。

# PCA plot (using `T` node)
treecor_samplepcaplot(sample_meta,
                      pca_matrix = pc_expr[['T']],
                      response_variable = 'severity',
                      font_size = 10,
                      point_size = 3)

圖片.png

可以觀察到樣本的嚴(yán)重性沿著虛線指示的方向從健康到嚴(yán)重變化魁淳，這對(duì)應(yīng)于從 CCA 推斷出的最佳軸，該軸最大化了嵌入空間中嚴(yán)重性和樣本坐標(biāo)之間的相關(guān)性与倡。

Module 5: Identify differentially expressed genes

差異表達(dá)基因 (DEG) 的分析包含在函數(shù) treecor_deg() 中界逛。在這里，我們首先像以前一樣使用“聚合”方法計(jì)算每個(gè)樹(shù)節(jié)點(diǎn)的假體基因表達(dá)譜纺座。換句話說(shuō)息拜，對(duì)于每個(gè)節(jié)點(diǎn)，來(lái)自其descendants的所有細(xì)胞都被匯集并聚合到每個(gè)樣本中的偽批量配置文件中净响。 Pseudobulk 配置文件按庫(kù)大小標(biāo)準(zhǔn)化少欺。然后使用 Limma 進(jìn)行差異表達(dá)分析：

單獨(dú)分析單變量表型/多變量表型：通過(guò)使用協(xié)變量調(diào)整對(duì)表型的基因表達(dá)進(jìn)行回歸來(lái)擬合 limma 模型。將為每個(gè)節(jié)點(diǎn)報(bào)告通過(guò)用戶(hù)指定的錯(cuò)誤發(fā)現(xiàn)率 (FDR) 截?cái)嘀担J(rèn)截?cái)嘀?= 0.05）的 DEG馋贤，并將其與日志折疊更改赞别、t 統(tǒng)計(jì)、p 值和 FDR 一起保存到 csv 文件中配乓。
聯(lián)合分析多變量表型：使用數(shù)據(jù)驅(qū)動(dòng)方法 (PC1) 或用戶(hù)指定的權(quán)重（使用權(quán)重的多個(gè)性狀的線性組合）將多個(gè)性狀組合成單變量變量仿滔。然后將聚合的性狀作為單變量表型進(jìn)行分析。

# DEG pipeline
res_deg <- treecor_deg(count,
                       hierarchy_structure,
                       cell_meta,
                       sample_meta,
                       response_variable = 'severity')
names(res_deg)

## [1] "dge.summary"   "dge.ls"        "pseudobulk.ls"

# 1st: summary of number of DEGs 
head(res_deg$dge.summary) # or head(res_deg[[1]])

##      label severity.num_deg     x y id  leaf
## 1      All                0  6.25 2  1 FALSE
## 2        B                0  0.00 1  2 FALSE
## 3        T              993  5.00 1  3 FALSE
## 4 Monocyte                2 12.50 1  4 FALSE
## 5     B_c7                0  0.00 0  5  TRUE
## 6    T_c15              456  1.00 0  6  TRUE

# 2nd: extract DEGs for a specific tree node using cell type name
# (1) check phenotypes
names(res_deg$dge.ls)

## [1] "severity"

# (2) choose a phenotype
severity.dge.ls <- res_deg$dge.ls[['severity']]
# (3) extract DEGs of T cell category
head(severity.dge.ls[['T']]) 

##          severitySe.logFC severitySe.t severitySe.p severitySe.fdr
## FADS1            2.221299    13.884431 2.351480e-07    0.004165177
## JCHAIN           3.036823    12.451043 5.964215e-07    0.005282207
## INPPL1           1.755148    10.250902 3.058248e-06    0.017470015
## APOBEC3H         2.250345     9.908317 4.052855e-06    0.017470015
## GPR68            1.693644     9.515158 5.657245e-06    0.017470015
## CORO1C           1.813891     9.220954 7.316652e-06    0.017470015

# 3rd: extract sample-level pseudobulk for a specific tree node using cell type name
names(res_deg$pseudobulk.ls) # or names(res_deg[[3]])

##  [1] "All"      "B"        "T"        "Monocyte" "B_c7"     "T_c15"   
##  [7] "T_c14"    "T_c10"    "T_c8"     "T_c6"     "T_c5"     "T_c4"    
## [13] "T_c3"     "T_c2"     "Mono_c16" "Mono_c13" "Mono_c12" "Mono_c11"
## [19] "Mono_c9"  "Mono_c1"

head(res_deg$pseudobulk.ls[['T']]) 

##          HD-17-Su HD-19-Su  HD-23-Su HD-30-Su Se-137.1-Su Se-178.1-Su
## A1BG     5.493462 5.506674 5.6802740 5.876471   4.7791551   5.3461643
## A1BG-AS1 2.652148 1.854222 2.3063366 2.420313   2.9822341   2.2278498
## A2M      0.000000 0.000000 0.1968054 0.000000   0.6555113   0.3599841
## A2M-AS1  1.814409 1.413839 1.2995822 1.852688   3.5050187   2.9638223
## A4GALT   0.000000 0.000000 0.3699580 0.000000   0.0000000   0.0000000
## AAAS     4.653329 4.235223 4.1063058 4.665398   5.1026623   4.7283047
##          Se-180.1-Su Se-181.1-Su
## A1BG        5.516765    4.809994
## A1BG-AS1    0.000000    2.002053
## A2M         0.000000    1.001369
## A2M-AS1     3.578594    3.809896
## A4GALT      0.000000    0.000000
## AAAS        4.315328    4.250520

treecor_deg 返回一個(gè)包含三個(gè)元素的列表：第一個(gè)元素總結(jié)了每種細(xì)胞類(lèi)型的 DEG 數(shù)量扰付，第二個(gè)元素對(duì)應(yīng)于每個(gè)表型（response_variable）的每個(gè)細(xì)胞cluster中識(shí)別的特定 DEG堤撵，第三個(gè)元素存儲(chǔ)樣本級(jí)假體基因表達(dá)矩陣（對(duì)于每種細(xì)胞類(lèi)型，每一行是一個(gè)基因羽莺，每一列是一個(gè)樣本）实昨。

根據(jù)是單獨(dú)分析還是聯(lián)合分析多變量表型，匯總表中會(huì)有額外的列（即第一個(gè)元素 - dge.summary）盐固，記錄關(guān)于不同表型的 DEG 數(shù)量和 DEG 列表中的其他元素（即第二個(gè)元素 - dge.ls）荒给。

We can visualize the number of DEGs using TreeCorTreat plot:

# visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = res_deg$dge.summary,
                 response_variable = 'severity',
                 color_variable = 'num_deg',
                 size_variable = 'num_deg',
                 alpha_variable = NULL,
                 annotate_number = T,
                 annotate_number_column = 'num_deg',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2)

圖片.png

此外丈挟，可以使用 treecor_degheatmap 通過(guò)指定 n 和對(duì)數(shù)折疊變化的方向/符號(hào)來(lái)探索前 n 個(gè) DEG 的基因表達(dá)模式：

# heatmap for top 10 DEGs with positive log fold change (enriched in Severe)
treecor_degheatmap(sample_meta,
                   pseudobulk = res_deg$pseudobulk.ls[['T']],
                   deg_result = res_deg$dge.ls$severity[['T']],
                   top_n = 10,
                   deg_logFC = "positive",
                   annotation_col = c('severity','sex','age'),
                   annotation_colors = list(severity = c('HD' = 'green','Se' = 'red'),
                                            sex = c('F' = 'purple','M' = 'blue')))

圖片.png

# heatmap for top 10 DEGs with negative log fold change (enriched in Healthy)
treecor_degheatmap(sample_meta,
                   pseudobulk = res_deg$pseudobulk.ls[['T']],
                   deg_result = res_deg$dge.ls$severity[['T']],
                   top_n = 10,
                   deg_logFC = "negative",
                   annotation_col = c('severity','sex','age'),
                   annotation_colors = list(severity = c('HD' = 'green','Se' = 'red'),
                                            sex = c('F' = 'purple','M' = 'blue')))

圖片.png

# Combined two plots above
treecor_degheatmap(sample_meta,
                   pseudobulk = res_deg$pseudobulk.ls[['T']],
                   deg_result = res_deg$dge.ls$severity[['T']],
                   top_n = 10,
                   deg_logFC = "both",
                   annotation_col = c('severity','sex','age'),
                   annotation_colors = list(severity = c('HD' = 'green','Se' = 'red'),
                                            sex = c('F' = 'purple','M' = 'blue')))

圖片.png

Module 6: Visualization

TreeCorTreat 提供了多種函數(shù)來(lái)可視化中間和最終結(jié)果。例如志电，可以使用 treecor_celldensityplot 來(lái)顯示不同樣本組的細(xì)胞類(lèi)型比例（模塊 3）曙咽。 treecor_samplepcaplot 將樣本投影到共享的低維嵌入并找到最佳軸（模塊 4）。treecor_degheatmap 顯示給定細(xì)胞類(lèi)型的最高差異表達(dá)基因（模塊 5）挑辆。特別是例朱，為了總結(jié)最終結(jié)果，開(kāi)發(fā)了一個(gè)多功能函數(shù) treecortreatplot 以我們稱(chēng)為“TreeCorTreat plot”的格式顯示多分辨率細(xì)胞類(lèi)型-表型關(guān)聯(lián)鱼蝉。該圖由顯示細(xì)胞類(lèi)型層次結(jié)構(gòu)的樹(shù)狀圖和顯示每個(gè)樹(shù)節(jié)點(diǎn)分析結(jié)果的信息層組成洒嗤。

Tree skeleton representation

為了顯示樹(shù)，可以使用直線魁亦、經(jīng)典的角度彎曲表示或二次貝塞爾曲線表示來(lái)連接節(jié)點(diǎn)渔隶。可以根據(jù)自己的喜好選擇線條美感，例如線型（例如實(shí)線或虛線）和線條顏色洁奈。

Edge type

Straight line

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: edge_path_type
                 edge_path_type = "link")

圖片.png

Classical angle bend

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.6,
                 nonleaf_point_gap = 0.3,
                 ## parameter: edge_path_type
                 edge_path_type = "elbow")

圖片.png

Quadratic Bezier curves

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: edge_path_type
                 edge_path_type = "diagonal")

圖片.png

Line aesthetics: line type and color

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: line_type
                 line_type = 'dashed',
                 ## parameter: line_color
                 line_color = 'darkgreen')

image.png

Modify label or circle position

In some cases where label length (cell type name) is long or circle size is quite large, default label or circle position on the non-leaf part (left) may not work well. Therefore, we introduce nonleaf_label_pos and nonleaf_point_gap parameters which allow users to manually fix the label position or gap between multivariate phenotype. A tip is that if one have k phenotypes to be analyzed separately, one can first choose a proper gap between points (i.e. nonleaf_point_gap = 0.1) and let nonleaf_label_pos = $knonleaf_point_gap + small_number$ (i.e. let k=3, nonleaf_label_pos = 0.13+0.2 = 0.5).In the above examples, we specify gap between points at 0.2 and label position at 0.4.

Leaf representation

For external node configuration, TreeCorTreat plot aligns cell types in the rows and phenotypes in the columns. There are three different ways to visualize the results: balloon plot, heatmap and bar plot. Users can use color (color_variable), size (size_variable) and transparency (alpha_variable) to encode different information.

Balloon plot

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: plot_type
                 plot_type = 'balloon')

圖片.png

In the above balloon plot, the size of the balloon reflects the magnitude of canonical correlation between global gene expression profile and disease severity and color as well as transparency correspond to p-value significance.

Alternatively, we can use color to represent canonical correlation and size/transparency to represent p-value significance using the following code:

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'p.sign',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: plot_type
                 plot_type = 'balloon')

圖片.png

Heatmap

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: plot_type
                 plot_type = 'heatmap')

圖片.png

Barplot

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'p.sign',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: plot_type
                 plot_type = 'bar')

圖片.png

More advanced plotting figures

Annotate numbers

Users can specify which variable/color to be used in annotating TreeCorTreat plot via annotate_number, annotate_number_column and annotate_number_color arguments.The following code demonstrates an example of overlaying canonical correlation on top of TreeCorTreat:

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon',
                 annotate_number = T,
                 annotate_number_column = 'cancor',
                 annotate_number_color = 'black')

圖片.png

Modify legend title

If we want to modify legend title, one way is to modify column names of annotated_df directly before running treecortreatplot. The column names for summary statistic and statistical significance are usually defined in response_variable.colname format (e.g. severity.cancor, severity.direction (only exists in cell type proportion analysis), severity.p, severity.p.sign, etc). Thus, it’s crucial to check the column names are in the correct format before generating TreeCorTreat plot.

Suppose we hope to modify names for both response variables (‘severity’) and legends (i.e. ‘cancor’ and ‘p.sign’) by capitalizing the first letter in ‘severity’ and ‘cancor’ and replace ‘p.sign’ by ‘P-value significance’.

## modify column names
colnames(res_expr)

##  [1] "id"                 "severity.cancor"    "severity.p"        
##  [4] "severity.adjp"      "severity.direction" "severity.p.sign"   
##  [7] "severity.adjp.sign" "x"                  "y"                 
## [10] "label"              "leaf"

res_expr_new <- res_expr
colnames(res_expr_new) <- gsub('severity','Severity',colnames(res_expr_new))
colnames(res_expr_new) <- gsub('cancor','Cancor',colnames(res_expr_new))
colnames(res_expr_new) <- gsub('\\.p.sign','\\.P-value significance',colnames(res_expr_new))

## check modified column names
colnames(res_expr_new)

##  [1] "id"                            "Severity.Cancor"              
##  [3] "Severity.p"                    "Severity.adjp"                
##  [5] "Severity.direction"            "Severity.P-value significance"
##  [7] "Severity.adjp.sign"            "x"                            
##  [9] "y"                             "label"                        
## [11] "leaf"

## visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr_new,
                 response_variable = 'Severity',
                 color_variable = 'P-value significance',
                 size_variable = 'Cancor',
                 alpha_variable = 'P-value significance',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon')

圖片.png

Modify label colors

Users can also modify label colors to highlight cell types of interest. The advanced setting can be configured by advanced_list argument by providing label_info with fixed column names: ‘label’ (cell type label) and ‘label.color’ (a color vector). For example, we can highlight cell types that have significant global gene expression-disease severity correlation in red:

label_info <- data.frame(label = res_expr$label,
                         label.color = ifelse(res_expr$severity.p.sign == 'ns','black','red'))
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon',
                 advanced_list = list(label_info = label_info))

圖片.png

Create your own mapping from levels to aesthetic values

Users can customize mapping from levels in the data (e.g. categorical variables) to certain aesthetic values (e.g. color/size/transparency) via breaks and values (alike scale_manual() in ggplot2). This can be also achieved by providing color_info or size_info or alpha_info in advanced_list argument. The following code illustrates an example of changing color for statistical significance as blue (non-significant) and red (significant) and modifying transparency:

advanced_list <- list(color_info = data.frame(breaks = c('ns','sig'), values = c('blue','red')),
                      alpha_info = data.frame(breaks = c('ns','sig'), values = c(0.5,1)))
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon',
                 advanced_list = advanced_list)

圖片.png

Choose a different color palette

For categorical variable, color palette can be specified by color_info in advanced options; For continuous variable, we provide different palettes from RColorBrewer package (default is ‘Spectral’) and can be modified via palette as a element in the advanced_list.

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon',
                 advanced_list = list(palette = 'PiYG'))

圖片.png

Configure layout

The layout of TreeCorTreat plot can be modified via layout_widths in the advanced_list parameter, which specifies the relative left-to-right ratio (or non-leaf: leaf ratio).

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon',
                 advanced_list = list(layout_widths = c(2,1)))

圖片.png

Save TreeCorTreat plots

TreecorTreat plots can be saved into pdf, png or other graphical formats.

png('TreeCorTreat_plot.png', height = 8, width = 15, res = 300, units = 'in')
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon')
dev.off()

TreeCorTreat supports versatile multi-resolution analysis

為了在多個(gè)分辨率下分析細(xì)胞類(lèi)型-表型關(guān)聯(lián)间唉，細(xì)粒度細(xì)胞cluster（children）逐漸合并為更大的細(xì)胞cluster（parents）。當(dāng)通過(guò)合并兩個(gè)或多個(gè)子節(jié)點(diǎn)創(chuàng)建父節(jié)點(diǎn)時(shí)利术，來(lái)自子節(jié)點(diǎn)的信息可以以多種不同方式組合呈野，具有不同的含義。組合子節(jié)點(diǎn)的兩種基本方法是“聚合”和“串聯(lián)”氯哮。 “聚合”方法將來(lái)自子節(jié)點(diǎn)的所有細(xì)胞匯集到一個(gè)用于表示父節(jié)點(diǎn)的偽批量樣本中际跪。在“串聯(lián)”方法中商佛，首先將每個(gè)子節(jié)點(diǎn)內(nèi)的細(xì)胞池化喉钢，以形成該節(jié)點(diǎn)的偽批量樣本。一個(gè)使用從其偽批量樣本（例如良姆，高度可變的基因）中提取的特征或特征向量來(lái)表示每個(gè)子節(jié)點(diǎn)肠虽。然后通過(guò)將來(lái)自所有子節(jié)點(diǎn)的特征向量連接成一個(gè)更長(zhǎng)的向量來(lái)表示父節(jié)點(diǎn)。

To support different needs of users, TreeCorTreat provides three different ways to combine child nodes: ‘a(chǎn)ggregate’ (default), ‘concatenate leaf nodes’, and ‘concatenate immediate children’.

Aggregate

The default setting for both treecor_expr and treecor_ctprop is aggregation (see previous examples), where we aggregate raw read counts for gene expression or number of cells for cell composition for non-leaf nodes.

Concatenate leaf nodes

In the ‘concatenate leaf nodes’ approach, feature vectors of all terminal leaf nodes derived from a node are concatenated into a long vector to serve as the feature vector of the node. For global gene expression analysis, this will result in a concatenated vector consisting of pseudobulk expression of highly variable genes obtained from each leaf node. For cell type proportion analysis, this will result in a vector of cell type proportions of the leaf nodes.

# gene expression pipeline; concat_leaf
res_expr_concatLeaf <- treecor_expr(count,
                                    hierarchy_structure,
                                    cell_meta,
                                    sample_meta,
                                    response_variable = 'severity',
                                    method = 'concat_leaf',
                                    num_permutations = 100)$canonical_corr
head(res_expr_concatLeaf)

##   id severity.cancor severity.p severity.adjp severity.direction
## 1  1       0.7803956 0.03030303     0.4360897                  +
## 2  2       0.6428883 0.15151515     0.9911129                  +
## 3  3       0.8556124 0.03030303     0.4360897                  +
## 4  4       0.6743079 0.03030303     0.4360897                  +
## 5  5       0.6428883 0.15151515     0.9911129                  +
## 6  6       0.8807827 0.04040404     0.4845441                  +
##   severity.p.sign severity.adjp.sign     x y    label  leaf
## 1             sig                 ns  6.25 2      All FALSE
## 2              ns                 ns  0.00 1        B FALSE
## 3             sig                 ns  5.00 1        T FALSE
## 4             sig                 ns 12.50 1 Monocyte FALSE
## 5              ns                 ns  0.00 0     B_c7  TRUE
## 6             sig                 ns  1.00 0    T_c15  TRUE

Concatenate immediate children

In the ‘concatenate immediate children’ approach, a node’s immediate children are first identified. The feature vector of each immediate child is obtained using the ‘a(chǎn)ggregate’ approach. Then feature vector of target node is obtained by concatenating the feature vectors of its immediate children.

gene expression pipeline; concat_immediate_children

res_expr_concatImmChild <- treecor_expr(count,
                                        hierarchy_structure,
                                        cell_meta,
                                        sample_meta,
                                        response_variable = 'severity',
                                        method = 'concat_immediate_children',
                                        num_permutations = 100)$canonical_corr
head(res_expr_concatImmChild)

##   id severity.cancor severity.p severity.adjp severity.direction
## 1  1       0.8504976 0.03030303     0.4360897                  +
## 2  2       0.6428883 0.15151515     0.9911129                  +
## 3  3       0.8556124 0.03030303     0.4360897                  +
## 4  4       0.6743079 0.03030303     0.4360897                  +
## 5  5       0.6428883 0.15151515     0.9911129                  +
## 6  6       0.8807827 0.04040404     0.4845441                  +
##   severity.p.sign severity.adjp.sign     x y    label  leaf
## 1             sig                 ns  6.25 2      All FALSE
## 2              ns                 ns  0.00 1        B FALSE
## 3             sig                 ns  5.00 1        T FALSE
## 4             sig                 ns 12.50 1 Monocyte FALSE
## 5              ns                 ns  0.00 0     B_c7  TRUE
## 6             sig                 ns  1.00 0    T_c15  TRUE

# compare 3 methods (non-leaf nodes)
summary_ls <- list(res_expr %>% filter(!leaf) %>% select(id,label,severity.cancor) %>% rename(aggregate = severity.cancor),
                   res_expr_concatLeaf %>% filter(!leaf) %>% select(id,label,severity.cancor) %>% rename(concatLeaf = severity.cancor),
                   res_expr_concatImmChild %>% filter(!leaf) %>% select(id,label,severity.cancor) %>% rename(concatImmChild = severity.cancor))
summary <- Reduce(inner_join,summary_ls) 
summary

##   id    label aggregate concatLeaf concatImmChild
## 1  1      All 0.8942939  0.7803956      0.8504976
## 2  2        B 0.6428883  0.6428883      0.6428883
## 3  3        T 0.9297493  0.8556124      0.8556124
## 4  4 Monocyte 0.8324347  0.6743079      0.6743079

For some broad cell categories like B, T and Monocytes, the canonical correlation of leaf concatenation and immediate children concatenation are the same because the immediate children set is same as its corresponding leaf nodes. Depending on different tree hierarchies, three approaches may have different results and implications. This can also be visualized using TreeCorTreat plot:

# aggregate
colnames(res_expr) <- gsub('severity','Severity\n(Agg)',colnames(res_expr))
# concat_leaf
colnames(res_expr_concatLeaf) <- gsub('severity','Severity\n(Leaf)',colnames(res_expr_concatLeaf))
# concat_immediate_children
colnames(res_expr_concatImmChild) <- gsub('severity','Severity\n(ImmChild)',colnames(res_expr_concatImmChild))

# combine three approaches
res_combined <- Reduce(inner_join,list(res_expr,res_expr_concatLeaf,res_expr_concatImmChild))

# plot
treecortreatplot(hierarchy_structure,
                 annotated_df = res_combined,
                 response_variable = c('Severity\n(Agg)','Severity\n(ImmChild)','Severity\n(Leaf)'),
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.8,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon')

圖片.png

Analysis of multivariate outcomes

當(dāng)有多個(gè)表型性狀時(shí)玛追，可以將每個(gè)性狀作為單變量表型單獨(dú)分析税课，也可以作為多變量表型聯(lián)合分析。為了分析多變量表型與細(xì)胞類(lèi)型比例或全局基因表達(dá)之間的關(guān)聯(lián)痊剖，CCA 用于計(jì)算表型與細(xì)胞類(lèi)型特征（即細(xì)胞類(lèi)型比例或基因表達(dá)）之間的典型相關(guān)性韩玩。

Separate evaluation

# individually
multi_expr_sep <- treecor_expr(count,
                               hierarchy_structure,
                               cell_meta,
                               sample_meta,
                               response_variable = c('severity','age'),
                               num_permutations = 100)$canonical_corr
# visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = multi_expr_sep,
                 response_variable = c('severity','age'),
                 separate = T,
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.5,
                 nonleaf_point_gap = 0.2)

圖片.png

Joint evaluation

# jointly
multi_expr_joint <- treecor_expr(count,
                                 hierarchy_structure,
                                 cell_meta,
                                 sample_meta,
                                 response_variable = c('severity','age'),
                                 separate = F,
                                 num_permutations = 100)$canonical_corr
# visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = multi_expr_joint,
                 response_variable = c('severity','age'),
                 separate = F,
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.2,
                 nonleaf_point_gap = 0.1)

圖片.png

為了分析差異基因表達(dá)，首先使用主成分分析或使用指定權(quán)重的性狀線性組合將多變量表型轉(zhuǎn)換為單變量表型陆馁。轉(zhuǎn)化后的單變量表型結(jié)合了來(lái)自多個(gè)性狀的信息找颓，然后用于運(yùn)行差異表達(dá)分析。

# jointly
multi_deg_joint <- treecor_deg(count,
                               hierarchy_structure,
                               cell_meta,
                               sample_meta %>% mutate(severity = ifelse(severity == "HD", 0,1)),
                               response_variable = c('severity','age'),
                               separate = F,
                               save_as_csv = F)$dge.summary
head(multi_deg_joint)

##      label combined_phenotype.num_deg     x y id  leaf
## 1      All                          1  6.25 2  1 FALSE
## 2        B                          0  0.00 1  2 FALSE
## 3        T                        700  5.00 1  3 FALSE
## 4 Monocyte                          0 12.50 1  4 FALSE
## 5     B_c7                          0  0.00 0  5  TRUE
## 6    T_c15                          0  1.00 0  6  TRUE

Adjusting for covariates in TreeCorTreat analysis

TreeCorTreat is capable of handling covariates in the analysis, allowing users to adjust for potential confounders or unwanted technical variation. For example, instead of viewing age as a phenotype, one can also view it as a covariate and ask which cell type features are associated with disease severity after accounting for age.

# adjust for age
expr_adjusted <- treecor_expr(count,
                              hierarchy_structure,
                              cell_meta,
                              sample_meta,
                              response_variable = 'severity',
                              formula = '~age',
                              num_permutations = 100)$canonical_corr
# visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = expr_adjusted,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.2,
                 nonleaf_point_gap = 0.1)

圖片.png

可視化相當(dāng)不錯(cuò)叮贩，值得大家借鑒

生活很好击狮，有你更好

最后編輯于：2021.11.09 16:07:27

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

禁止轉(zhuǎn)載佛析，如需轉(zhuǎn)載請(qǐng)通過(guò)簡(jiǎn)信或評(píng)論聯(lián)系作者。

人面猴
序言：七十年代末彪蓬，一起剝皮案震驚了整個(gè)濱河市寸莫，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌档冬，老刑警劉巖膘茎，帶你破解...
沈念sama閱讀 206,013評(píng)論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異酷誓，居然都是意外死亡辽狈，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,205評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門(mén)呛牲，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)刮萌，“玉大人，你說(shuō)我怎么就攤上這事娘扩∽湃祝” “怎么了？”我有些...
開(kāi)封第一講書(shū)人閱讀 152,370評(píng)論 0贊 342
道士緝兇錄：失蹤的賣(mài)姜人
文/不壞的土叔我叫張陵琐旁，是天一觀的道長(zhǎng)涮阔。經(jīng)常有香客問(wèn)我，道長(zhǎng)灰殴，這世上最難降的妖魔是什么敬特？我笑而不...
開(kāi)封第一講書(shū)人閱讀 55,168評(píng)論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮牺陶，結(jié)果婚禮上伟阔，老公的妹妹穿的比我還像新娘。我一直安慰自己掰伸，他們只是感情好皱炉，可當(dāng)我...
茶點(diǎn)故事閱讀 64,153評(píng)論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布。她就那樣靜靜地躺著狮鸭，像睡著了一般合搅。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上歧蕉，一...
開(kāi)封第一講書(shū)人閱讀 48,954評(píng)論 1贊 283
城市分裂傳說(shuō)
那天灾部，我揣著相機(jī)與錄音，去河邊找鬼惯退。笑死赌髓，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播春弥，決...
沈念sama閱讀 38,271評(píng)論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼呛哟，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來(lái)了匿沛？” 一聲冷哼從身側(cè)響起扫责，我...
開(kāi)封第一講書(shū)人閱讀 36,916評(píng)論 0贊 259
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎逃呼，沒(méi)想到半個(gè)月后鳖孤，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 43,382評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡抡笼，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,877評(píng)論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年苏揣，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片推姻。...
茶點(diǎn)故事閱讀 37,989評(píng)論 1贊 333
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡平匈，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出藏古，到底是詐尸還是另有隱情增炭，我是刑警寧澤，帶...
沈念sama閱讀 33,624評(píng)論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布拧晕，位于F島的核電站隙姿，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏厂捞。R本人自食惡果不足惜输玷，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,209評(píng)論 3贊 307
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望靡馁。院中可真熱鬧欲鹏，春花似錦、人聲如沸奈嘿。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 30,199評(píng)論 0贊 19
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)裙犹。三九已至，卻和暖如春衔憨，著一層夾襖步出監(jiān)牢的瞬間叶圃，已是汗流浹背。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 31,418評(píng)論 1贊 260
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工践图，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留掺冠，地道東北人。一個(gè)月前我還...
沈念sama閱讀 45,401評(píng)論 2贊 352
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像德崭，于是被迫代替她去往敵國(guó)和親斥黑。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,700評(píng)論 2贊 345

10X單細(xì)胞（10X空間轉(zhuǎn)錄組）聚類(lèi)分析之樹(shù)形可視化（TreeCorTreat）

好了挎袜，開(kāi)始我們的分享

Overview

模塊化結(jié)構(gòu)使用可以靈活地跳過(guò)某些分析步驟钥弯，并用自己的數(shù)據(jù)或分析功能代替它們径荔。

Input data and data preparation

Module 1: Data integration

集成的 Seurat 對(duì)象將存儲(chǔ)在本地目錄中褒傅，過(guò)濾后的基因表達(dá)矩陣或細(xì)胞級(jí)meta數(shù)據(jù)可以通過(guò) access_data_seurat 提取弃锐，用于下游分析。

在示例中殿托，沒(méi)有細(xì)胞被過(guò)濾掉霹菊，我們將更新的細(xì)胞級(jí)meta數(shù)據(jù)（附加列包括“細(xì)胞類(lèi)型”和 UMAP 坐標(biāo)）存儲(chǔ)為integrated_cellmeta。

Module 2: Define cell types at multiple resolutions

We roughly categorize 16 cell clusters into 3 large cell-types based on gene markers: B cells (CD19), T cells (CD3D) and Monocytes (CD14).

Modify cell type annotation

Users can modify the cell type label via modify_label function to update cell type annotations in cell_meta:

同樣叹坦，也可以使用此功能更新層次樹(shù)結(jié)構(gòu)中的細(xì)胞類(lèi)型注釋(i.e. modify_label(new_label,hierarchy_list = hierarchy_list,cell_meta = NULL)$hierarchy_list).

Construct hierarchical tree structure

為了便于在多個(gè)分辨率下進(jìn)行關(guān)聯(lián)分析熊镣，細(xì)胞cluster被進(jìn)一步分層聚類(lèi)。 可以使用數(shù)據(jù)驅(qū)動(dòng)的方法或基于知識(shí)的方法來(lái)導(dǎo)出樹(shù)募书。

Data-driven approach

Knowledge-based approach

For demonstration purpose, we will use knowledge-based hierarchical structure in subsequent analysis.

Module 3: Identify association between cell type proportion and sample phenotype

The permutation-based p-value can be obtained and adjusted p-value is computed using Benjamini & Yekutieli procedure.

We can first use a kernel density plot to visualize cell density stratified by disease severity.

Next, we can assess the association between cell type proportion and disease severity using treecor_ctprop() pipeline:

然后可以使用 TreeCorTreat 圖來(lái)可視化結(jié)果。 可以為不同的美學(xué)指定變量（例如顏色匀油、大小缘缚、alpha）。 后面將詳細(xì)討論 TreeCorTreat 圖敌蚜。

Module 4: Identify association between global gene expression and sample phenotype

The following code illustrates an example of association between gene expression and disease severity, using default setting (aggregate).

By running treecor_expr, it results in two elements: the first element is a summary table of computed summary statistic (e.g. canonical correlation) along with p-value and adjusted p-value for each cell type; and the second element is a list of PC matrices for each cell type.

除了匯總表膏孟，還可以通過(guò)使用來(lái)自基因表達(dá)分析的 PC 矩陣分別評(píng)估每種細(xì)胞類(lèi)型。

可以觀察到樣本的嚴(yán)重性沿著虛線指示的方向從健康到嚴(yán)重變化魁淳，這對(duì)應(yīng)于從 CCA 推斷出的最佳軸，該軸最大化了嵌入空間中嚴(yán)重性和樣本坐標(biāo)之間的相關(guān)性与倡。

Module 5: Identify differentially expressed genes

根據(jù)是單獨(dú)分析還是聯(lián)合分析多變量表型，匯總表中會(huì)有額外的列（即第一個(gè)元素 - dge.summary）盐固，記錄關(guān)于不同表型的 DEG 數(shù)量和 DEG 列表中的其他元素（即第二個(gè)元素 - dge.ls）荒给。

We can visualize the number of DEGs using TreeCorTreat plot:

此外丈挟，可以使用 treecor_degheatmap 通過(guò)指定 n 和對(duì)數(shù)折疊變化的方向/符號(hào)來(lái)探索前 n 個(gè) DEG 的基因表達(dá)模式：

Module 6: Visualization

Tree skeleton representation

為了顯示樹(shù)，可以使用直線魁亦、經(jīng)典的角度彎曲表示或二次貝塞爾曲線表示來(lái)連接節(jié)點(diǎn)渔隶。 可以根據(jù)自己的喜好選擇線條美感，例如線型（例如實(shí)線或虛線）和線條顏色洁奈。

Edge type

Line aesthetics: line type and color

Modify label or circle position

Leaf representation

Balloon plot

In the above balloon plot, the size of the balloon reflects the magnitude of canonical correlation between global gene expression profile and disease severity and color as well as transparency correspond to p-value significance.

Alternatively, we can use color to represent canonical correlation and size/transparency to represent p-value significance using the following code:

Heatmap

Barplot

More advanced plotting figures

Annotate numbers

Users can specify which variable/color to be used in annotating TreeCorTreat plot via annotate_number, annotate_number_column and annotate_number_color arguments.The following code demonstrates an example of overlaying canonical correlation on top of TreeCorTreat:

Modify legend title

Suppose we hope to modify names for both response variables (‘severity’) and legends (i.e. ‘cancor’ and ‘p.sign’) by capitalizing the first letter in ‘severity’ and ‘cancor’ and replace ‘p.sign’ by ‘P-value significance’.

Modify label colors

Create your own mapping from levels to aesthetic values

Choose a different color palette

For categorical variable, color palette can be specified by color_info in advanced options; For continuous variable, we provide different palettes from RColorBrewer package (default is ‘Spectral’) and can be modified via palette as a element in the advanced_list.

Configure layout

The layout of TreeCorTreat plot can be modified via layout_widths in the advanced_list parameter, which specifies the relative left-to-right ratio (or non-leaf: leaf ratio).

Save TreeCorTreat plots

TreecorTreat plots can be saved into pdf, png or other graphical formats.

TreeCorTreat supports versatile multi-resolution analysis

To support different needs of users, TreeCorTreat provides three different ways to combine child nodes: ‘a(chǎn)ggregate’ (default), ‘concatenate leaf nodes’, and ‘concatenate immediate children’.

Aggregate

The default setting for both treecor_expr and treecor_ctprop is aggregation (see previous examples), where we aggregate raw read counts for gene expression or number of cells for cell composition for non-leaf nodes.

Concatenate leaf nodes

Concatenate immediate children

gene expression pipeline; concat_immediate_children

Analysis of multivariate outcomes

Separate evaluation

Joint evaluation

Adjusting for covariates in TreeCorTreat analysis

可視化相當(dāng)不錯(cuò)叮贩，值得大家借鑒

推薦閱讀更多精彩內(nèi)容

為了便于在多個(gè)分辨率下進(jìn)行關(guān)聯(lián)分析熊镣，細(xì)胞cluster被進(jìn)一步分層聚類(lèi)。可以使用數(shù)據(jù)驅(qū)動(dòng)的方法或基于知識(shí)的方法來(lái)導(dǎo)出樹(shù)募书。

然后可以使用 TreeCorTreat 圖來(lái)可視化結(jié)果。可以為不同的美學(xué)指定變量（例如顏色匀油、大小缘缚、alpha）。后面將詳細(xì)討論 TreeCorTreat 圖敌蚜。

為了顯示樹(shù)，可以使用直線魁亦、經(jīng)典的角度彎曲表示或二次貝塞爾曲線表示來(lái)連接節(jié)點(diǎn)渔隶。可以根據(jù)自己的喜好選擇線條美感，例如線型（例如實(shí)線或虛線）和線條顏色洁奈。