10X單細(xì)胞（10X空間轉(zhuǎn)錄組）BCR（TCR）數(shù)據(jù)分析之(10)changeo

hello县钥，這一篇我們來對軟件changeo進行補充利职，很多細(xì)節(jié)的內(nèi)容，我們需要深入知道一下。

首先第一個AIRR Community Rearrangement揖膜，這個直譯過來就是gene重排，我們來看一下

One of the primary initiatives of the Adaptive Immune Receptor Repertoire (AIRR) Community has been to develop a set of metadata standards for the submission of AIRR sequencing datasets. （說白了梅桩，獲得性免疫的主要結(jié)構(gòu)是什么）

This work has been carried out by the AIRR Community Minimal Standards Working Group. In order to support reproducibility, standard quality control, and data deposition in a common repository, the AIRR Community has agreed to six high-level data sets that will guide the publication, curation and sharing of AIRR-Seq data and metadata: Study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences. The detailed data elements within these sets are defined here (Download as TSV).

圖片.png

當(dāng)然次氨，這個地方還是需要更多的注釋。我們簡單了解一下即可摘投。

第二部分煮寡，BCR聚類閾值的選擇

首先是Distance to nearest neighbor

估計用于劃分克隆相關(guān)序列的最佳距離閾值是通過計算從數(shù)據(jù)集中每個序列到其最近鄰居的距離并在所得雙峰分布中找到將克隆相關(guān)序列與無關(guān)序列分開的斷點來完成的。這是通過以下步驟完成的：

Calculating of the nearest neighbor distances for each sequence.
Generating a histogram of the nearest neighbor distances followed by either manual inspect for the threshold separating the two modes or automated threshold detection.

我們看一下distance是怎么計算的

Calculating the nearest neighbor distances requires the following fields (columns) to be present in the table:

sequence_id
v_call
j_call
junction
junction_length

# Subset example data to one sample
library(shazam)
data(ExampleDb, package="alakazam")

Calculating nearest neighbor distances (heavy chain sequences)(最為核心的部分)

By default, distToNearest, the function for calculating distance between every sequence and its nearest neighbor, assumes that it is running under non-single-cell mode and that every input sequence is a heavy chain sequence and will be used for calculation（計算距離只采用了重鏈序列犀呼，TCR也是只采用了β鏈的CDR3）幸撕，需要一些參數(shù)來調(diào)整distance的measure方式。如果使用 `tigger` 包（也是一個很經(jīng)典的BCR分析軟件外臂，這個我們下一個分享）中的方法推斷出基因型坐儿，并且已將 `v_call_genotyped` 添加到數(shù)據(jù)庫中，則可以通過指定 `vCallColumn` 參數(shù)使用此列代替默認(rèn)的 `v_call` 列宋光。這將允許使用來自 `tigger` 的更準(zhǔn)確的 V call來對序列進行分組貌矿。此外，為了更“寬松”地處理不明確的 V(D)J 段調(diào)用罪佳，可以將參數(shù) `first` 設(shè)置為 `FALSE`逛漫。設(shè)置 `first=FALSE` 將使用所有可能基因的聯(lián)合來分組序列，而不是 field中的第一個基因赘艳。`modal`參數(shù)確定使用哪個基礎(chǔ) SHM 模型來計算距離酌毡。The default model is single nucleotide Hamming distance with gaps considered as a match to any nucleotide (ham). 其他選項包括類似于轉(zhuǎn)換/顛換模型 (hh_s1f) 的人類 Ig 特異性單核苷酸模型和來自 Yaari 等人，2013 年 (hh_s5f) 的相應(yīng) 5 聚體上下文模型蕾管，來自 Cui 等人的一對類似的小鼠特異性模型, 2016 (mk_rs1nf 和 mk_rs5nf)枷踏，以及氨基酸漢明距離 (aa)。

Note: Human and mouse distance measures that are backward compatible with SHazaM v0.1.4 and Change-O v0.3.3 are also provided as hs1f_compat and m1n_compat, respectively.

對于不對稱的模型（例如掰曾，A 到 B 的距離不等于 B 到 A 的距離）旭蠕，有一個對稱參數(shù)允許用戶指定是使用兩個距離的平均值還是最小值來確定總距離。

# Use nucleotide Hamming distance and normalize by junction length
dist_ham <- distToNearest(ExampleDb, sequenceColumn="junction", 
                          vCallColumn="v_call_genotyped", jCallColumn="j_call",
                          model="ham", normalize="len", nproc=1)

# Use genotyped V assignments, a 5-mer model and no normalization
dist_s5f <- distToNearest(ExampleDb, sequenceColumn="junction", 
                          vCallColumn="v_call_genotyped", jCallColumn="j_call",
                          model="hh_s5f", normalize="none", nproc=1)

Calculating nearest neighbor distances (single-cell paired heavy and light chain sequences)

`distToNearest`函數(shù)還支持在單細(xì)胞模式下運行旷坦，其中提供了包含單細(xì)胞配對 IGH:IGK/IGL掏熬、TRB:TRA 或 TRD:TRG 鏈序列的輸入 Example10x。在這種情況下塞蹭，默認(rèn)情況下孽江，細(xì)胞首先被分成包含相同重/長鏈（IGH、TRB番电、TRD）V 基因和 J 基因（如果指定岗屏，連接長度）和相同輕/短鏈（IGK , IGL, TRA, TRG) V 基因和 J 基因（如果指定辆琅，連接長度）。然后这刷，僅使用重鏈序列來計算最近鄰距離婉烟。

在單細(xì)胞模式下，輸入 `Example10x` 的每一行都應(yīng)該代表一個序列/鏈暇屋。來自同一cell的序列/鏈由 `cellIdColumn` 列中的cell ID 鏈接似袁。請注意，一個cell應(yīng)該恰好有一個 IGH 序列 (BCR) 或 TRB/TRD (TCR)咐刨。 `locusColumn` 列中的值必須是 IGH昙衅、IGI、IGK 或 IGL (BCR) 或 TRA定鸟、TRB而涉、TRD 或 TRG (TCR) 之一。要調(diào)用單細(xì)胞模式联予，必須指定 `cellIdColumn` 并且 `locusColumn` 必須正確啼县。

可以選擇分組是作為一個階段的過程還是兩個階段的過程來完成。這可以通過 `VJthenLen` 指定沸久。在一個階段的過程中（VJthenLen=FALSE）季眷，細(xì)胞被分成包含相同重/長鏈V基因、J基因和連接長度（V-J-長度組合）和相同輕鏈V-J-長度組合的分區(qū)卷胯。在兩階段過程中（VJthenLen=TRUE）子刮，cells are first divided by heavy/long chain V gene and J gene (V-J combination), and light/short chain V-J combination；然后通過相應(yīng)的junction lengths诵竭。

There is also a choice of whether grouping should be done using IGH (BCR) or TRB/TRD (TCR) sequences only, or using both IGH and IGK/IGL (BCR) or TRB/TRD and TRA/TRG (TCR) sequences. This is governed by onlyHeavy.(這個單鏈做更好一點)

# Single-cell mode 
# Group cells in a one-stage process (VJthenLen=FALSE) and using
# both heavy and light chain sequences (onlyHeavy=FALSE)

data(Example10x, package="alakazam")
dist_sc <- distToNearest(Example10x, cellIdColumn="cell_id", locusColumn="locus", 
                         VJthenLen=FALSE, onlyHeavy=FALSE)

無論是僅使用重鏈序列進行分組话告，還是使用重鏈和輕鏈序列進行分組，都將僅使用重鏈序列來計算最近鄰距離卵慰。因此，在單細(xì)胞模式下佛呻，返回的 `data.frame` 中輕鏈序列對應(yīng)的行將在 `dist_nearest` 字段中具有 `NA`裳朋。 (這個不重要)。

Using nearest neighbor distances to determine clonal assignment thresholds（選擇閾值）

SHazaM 中距離最近計算的主要用途是使用 Change-O 中的 `DefineClones`工具確定克隆分配的最佳閾值吓著。定義閾值依賴于區(qū)分克隆相關(guān)序列（由具有近鄰的序列表示）和單例（沒有近鄰的序列）鲤嫡，后者在最近鄰距離直方圖中顯示為兩種模式。

可以通過檢查最近鄰直方圖或使用 findThreshold 函數(shù)提供的自動閾值檢測算法之一來手動確定閾值绑莺。可用的方法是密度（平滑密度）和 gmm（伽馬/高斯混合模型）暖眼，并通過 findThreshold 的方法參數(shù)進行選擇。

通過人工檢查確定閾值

手動閾值檢測只涉及為 distToNearest 輸出的 dist_nearest 列中的值生成直方圖纺裁，并在兩種模式之間的valley內(nèi)選擇合適的值诫肠。

# Generate Hamming distance histogram
library(ggplot2)
p1 <- ggplot(subset(dist_ham, !is.na(dist_nearest)),
             aes(x=dist_nearest)) + 
    theme_bw() + 
    xlab("Hamming distance") + 
    ylab("Count") +
    scale_x_continuous(breaks=seq(0, 1, 0.1)) +
    geom_histogram(color="white", binwidth=0.02) +
    geom_vline(xintercept=0.12, color="firebrick", linetype=2)
plot(p1)

圖片.png

By manual inspection, the length normalized ham model distance threshold would be set to a value near 0.12 in the above example.

# Generate HH_S5F distance histogram
p2 <- ggplot(subset(dist_s5f, !is.na(dist_nearest)),
             aes(x=dist_nearest)) + 
    theme_bw() + 
    xlab("HH_S5F distance") + 
    ylab("Count") +
    scale_x_continuous(breaks=seq(0, 50, 5)) +
    geom_histogram(color="white", binwidth=1) +
    geom_vline(xintercept=7, color="firebrick", linetype=2)
plot(p2)

圖片.png

In this example, the unnormalized hh_s5f model distance threshold would be set to a value near 7

Automated threshold detection via smoothed density(機器選擇)

The `density` method will look for the minimum in the valley between two modes of a smoothed distribution based on the input vector (`distances`), which will generally be the `dist_nearest` column from the `distToNearest` output. Below is an example of using the `density`method for threshold detection.

# Find threshold using density method
output <- findThreshold(dist_ham$dist_nearest, method="density")
threshold <- output@threshold

# Plot distance histogram, density estimate and optimum threshold
plot(output, title="Density Method")

圖片.png

# Print threshold
print(output)
## [1] 0.1738391

Automated threshold detection via a mixture model

findThreshold 函數(shù)包括用于自動確定克隆分配閾值的方法司澎。 findThreshold (method="gmm") 的“gmm”方法（伽馬/高斯混合方法）使用單變量密度分布函數(shù)的四種組合之一對最近距離分布執(zhí)行最大似然擬合程序：“norm- norm”（兩個高斯分布）、“norm-gamma”（下高斯和上伽馬分布）栋豫、“gamma-norm”（下伽馬和上高斯分布）和“gamma-gamma”（兩個伽馬分布）挤安。默認(rèn)情況下，將通過計算靈敏度和特異性的平均值達到其最大值的距離來選擇閾值（cutoff="optimal"）丧鸯。還提供了替代閾值選擇標(biāo)準(zhǔn)蛤铜，包括曲線交點 (cutoff="intersect")、用戶定義的靈敏度 (cutoff="user", sen=x) 或用戶定義的特異性 (cutoff="user", spc=x)

在下面的示例中丛肢，混合模型方法 (method="gmm") 用于通過擬合兩個伽馬分布 (model="gamma-gamma") 來找到用于分離克隆相關(guān)序列的最佳閾值围肥。下圖中的紅色虛線定義了靈敏度和特異性的平均值達到最大值的距離。

# Find threshold using gmm method
output <- findThreshold(dist_ham$dist_nearest, method="gmm", model="gamma-gamma")

# Plot distance histogram, Gaussian fits, and optimum threshold
plot(output, binwidth=0.02, title="GMM Method: gamma-gamma")

圖片.png

# Print threshold
print(output)
## [1] 0.1221371

注意：由 plotGmmThreshold 繪制的直方圖的形狀由 binwidth 參數(shù)控制蜂怎。意思是穆刻，bin 大小的任何變化都會改變分布的形式，而 gmm 方法完全獨立于 bin 大小派敷，只涉及實際輸入數(shù)據(jù)蛹批。

Calculating nearest neighbor distances independently for subsets of data(相當(dāng)于亞群分析)

The `fields` argument to `distToNearest` will split the input `data.frame` into groups based on values in the specified fields (columns) and will treat them independently. For example, if the input data has multiple samples, then `fields="sample_id"` would allow each sample to be analyzed separately.

In the previous examples we used a subset of the original example data. In the following example, we will use the two available samples,`-1h` and `+7d`, and will set fields="sample_id". This will reproduce previous results for sample -1h and add results for sample +7d.

dist_fields <- distToNearest(ExampleDb, model="ham", normalize="len", 
                             fields="sample_id", nproc=1)

We can plot the nearest neighbor distances for the two samples:

# Generate grouped histograms
p4 <- ggplot(subset(dist_fields, !is.na(dist_nearest)), 
             aes(x=dist_nearest)) + 
    theme_bw() + 
    xlab("Grouped Hamming distance") + 
    ylab("Count") +
    geom_histogram(color="white", binwidth=0.02) +
    geom_vline(xintercept=0.12, color="firebrick", linetype=2) +
    facet_grid(sample_id ~ ., scales="free_y")
plot(p4)

圖片.png

In this case, the threshold selected for -1h seems to work well for +7d as well.

Calculating nearest neighbor distances across groups rather than within a groups(相當(dāng)于多樣本聯(lián)合分析)

指定 `distToNearest` 的`cross`參數(shù)會強制跨組執(zhí)行距離計算(多樣本一起計算)，這樣每個序列的最近鄰居將始終是不同組中的序列篮愉。在下面的示例中腐芍，我們設(shè)置 cross="sample"，它將數(shù)據(jù)分組為 -1h 和 +7d 樣本子集试躏。因此猪勇，樣本-1h 中序列的最近鄰距離將被限制為樣本+7d 中的最近序列，反之亦然颠蕴。

dist_cross <- distToNearest(ExampleDb, sequenceColumn="junction", 
                            vCallColumn="v_call_genotyped", jCallColumn="j_call",
                            model="ham", first=FALSE, 
                            normalize="len", cross="sample_id", nproc=1)
# Generate cross sample histograms
p5 <- ggplot(subset(dist_cross, !is.na(cross_dist_nearest)), 
             aes(x=cross_dist_nearest)) + 
    theme_bw() + 
    xlab("Cross-sample Hamming distance") + 
    ylab("Count") +
    geom_histogram(color="white", binwidth=0.02) +
    geom_vline(xintercept=0.12, color="firebrick", linetype=2) +
    facet_grid(sample_id ~ ., scales="free_y")
plot(p5)

圖片.png

This can provide a sense of overlap between samples or a way to compare within-sample variation to cross-sample variation.

大樣本數(shù)據(jù)的處理泣刹，Speeding up pairwise-distance-matrix calculations with subsampling

The `subsample` option in `distToNearest` allows to speed up calculations and reduce memory usage.

If there are very large groups of sequences that share V call, J call and junction length, `distToNearest` will need a lot of memory and it will take a long time to calculate all the distances. Without subsampling, in a large group of n=70,000 sequences `distToNearest` calculates a nn distance matrix. With subsampling, e.g. to s=15,000, the distance matrix for the same group has size sn, and for each sequence in db, the distance value is calculated by comparing the sequence to the subsampled sequences from the same V-J-junction length group.

# Explore V-J-junction length groups sizes to use subsample
# Show the size of the largest groups
library(dplyr)
library(alakazam)
top_10_sizes <- ExampleDb %>%
     group_by(junction_length) %>% # Group by junction length
     do(alakazam::groupGenes(., first=TRUE)) %>% # Group by V and J call
     mutate(GROUP_ID=paste(junction_length, vj_group, sep="_")) %>% # Create group ids
     ungroup() %>%
     group_by(GROUP_ID) %>% # Group by GROUP_ID
     distinct(junction) %>% # Vount unique junctions per group
     summarize(SIZE=n()) %>% # Get the size of the group
     arrange(desc(SIZE)) %>% # Sort by decreasing size
     select(SIZE) %>% 
     top_n(10) # Filter to the top 10

# Use 30 to subsample
# NOTE: This is a toy example. Subsampling to 30 sequence with real data is unwise
dist <- distToNearest(ExampleDb, sequenceColumn="junction", 
                      vCallColumn="v_call_genotyped", jCallColumn="j_call",
                      model="ham", 
                      first=FALSE, normalize="len",
                      subsample=30)

第三，就是上面提到的SHM模型犀被，SHM targeting models（目標(biāo)突變導(dǎo)致的距離度量）

靶向模型是特定突變的背景可能性椅您，基于周圍的序列上下文以及突變本身。該模型是從數(shù)據(jù)中觀察到的突變推斷出來的寡键。然后可以將模型轉(zhuǎn)換為距離函數(shù)掀泳，以根據(jù)觀察到的突變的可能性比較給定數(shù)據(jù)集的 Ig 序列。這是通過以下步驟完成的：

Infer a substitution model, which is the likelihood of a base mutating to each other base given the microsequence context.
Infer a mutability model, which is likelihood of a given base being mutated given the microsequence context and substitution model.
Visualize the mutability model to identify hot and cold spots.
Calculate a nucleotide distance matrix based on the underlying SHM models.

看看示例

A small example AIRR Rearrangement database is included in the alakazam package. Inferring a targeting model requires the following fields (columns) to be present in the table:

sequence_id
sequence_alignment
germline_alignment_d_mask
v_call

# Load example data
library(shazam)
data(ExampleDb, package="alakazam")

# Subset to IGHG for faster usage demonstration
db <- subset(ExampleDb, c_call == "IGHG")

Infer targeting model (substitution and mutability)

用于推斷替代率的函數(shù) (createSubstitutionMatrix) 計算從給定堿基到所有其他堿基的突變數(shù)西轩，這些突變發(fā)生在數(shù)據(jù)集中所有 5 聚體基序的中心位置员舵。 createSubstitutionMatrix 的模型參數(shù)允許用戶指定是計算所有突變，還是僅計算無聲突變以推斷模型藕畔。如果樣本序列马僻、種系序列和 V 調(diào)用的列名與 Change-O 默認(rèn)值不同，它們也可以作為參數(shù)傳入注服。此外韭邓，multipleMutation 參數(shù)決定了對單個 5-mer 中多個突變的處理：independent 獨立處理每個突變措近，并完全忽略具有多個突變的 5-mers。

# Create substitution model using silent mutations
sub_model <- createSubstitutionMatrix(db, model="s", 
                                      sequenceColumn="sequence_alignment",
                                      germlineColumn="germline_alignment_d_mask",
                                      vCallColumn="v_call")

用于推斷可變性模型的函數(shù) (createMutabilityMatrix) 計算數(shù)據(jù)集所有 5 聚體基序中的突變數(shù)量仍秤，并取決于推斷的替代率熄诡。與可用于推斷替代率的參數(shù)類似的參數(shù)可用于調(diào)整此函數(shù)。

# Create mutability model using silent mutations
mut_model <- createMutabilityMatrix(db, sub_model, model="s",
                                    sequenceColumn="sequence_alignment",
                                    germlineColumn="germline_alignment_d_mask",
                                    vCallColumn="v_call")

createMutabilityMatrix 創(chuàng)建一個 MutabilityModel 類的對象诗力，該對象包含一個 1024 歸一化可變性的命名數(shù)字向量凰浮。用于估計 5 聚體變異性的沉默和替代突變的數(shù)量分別記錄在 numMutS 和 numMutR 槽中。率苇本。源槽包含一個命名向量袜茧，指示是否推斷或測量了每個 5 聚體的可變性。具有可變性值和派生源的 data.frame瓣窄。

# Number of silent mutations used for estimating 5-mer mutabilities
mut_model@numMutS
# Number of replacement mutations used for estimating 5-mer mutabilities
mut_model@numMutR
# Mutability and source as a data.frame
head(as.data.frame(mut_model))

上述函數(shù)返回的推斷替換和可變性模型僅解釋了明確的 5 聚體笛厦。但是，在某些情況下俺夕，用戶可能需要具有模糊字符的 5 聚體突變的可能性(motif)裳凸。上述每個函數(shù)都有一個相應(yīng)的函數(shù)（extendSubstitutionMatrix 和 extendMutabilityMatrix）來擴展模型，通過對所有對應(yīng)的明確 5-mers 求平均值來推斷 5-mers 和 Ns劝贸。

# Extend models to include ambiguous 5-mers
sub_model <- extendSubstitutionMatrix(sub_model)
mut_model <- extendMutabilityMatrix(mut_model)

這些擴展的替換和可變性模型可用于創(chuàng)建整體 SHM 目標(biāo)矩陣（createTargetingMatrix）姨谷，它是可變性和替換的組合概率。

# Create targeting model matrix from substitution and mutability models
tar_matrix <- createTargetingMatrix(sub_model, mut_model)

通過使用單個函數(shù) createTargetingModel 直接從數(shù)據(jù)集中推斷 TargetingModel 對象映九，可以組合上述所有步驟梦湘。同樣，用于估計 5 聚體可變性的沉默和替代突變的數(shù)量也分別記錄在 numMutS 和 numMutR slots中件甥。

Additionally, it is generally appropriate to consider the mutations within a clone only once. Consensus sequences for each clone can be generated using the collapseClones function.

# Collapse sequences into clonal consensus
clone_db <- collapseClones(db, cloneColumn="clone_id", 
                           sequenceColumn="sequence_alignment",
                           germlineColumn="germline_alignment_d_mask",
                           nproc=1)
# Create targeting model in one step using only silent mutations
# Use consensus sequence input and germline columns
model <- createTargetingModel(clone_db, model="s", sequenceColumn="clonal_sequence", 
                              germlineColumn="clonal_germline", vCallColumn="v_call")

Visualize targeting model

數(shù)據(jù)集底層 SHM 可變性模型的可視化可用于研究熱點和冷點基序捌议。可變率圖上條形的長度對應(yīng)于給定 5 聚體中給定堿基發(fā)生突變的可能性。繪圖函數(shù) plotMutability 具有參數(shù)樣式引有，用于指定 5 聚體可變率的刺猬圖（圓形）或條形圖顯示瓣颅。如果僅需要特定堿基的可變性，則可以通過核苷酸參數(shù)指定譬正。

# Generate hedgehog plot of mutability model
plotMutability(model, nucleotides="A", style="hedgehog")

圖片.png

plotMutability(model, nucleotides="C", style="hedgehog")

圖片.png

# Generate bar plot of mutability model
plotMutability(model, nucleotides="G", style="bar")

圖片.png

plotMutability(model, nucleotides="T", style="bar")

圖片.png

最后弄捕，Calculate targeting distance matrix

在 Change-O pipeline中，hs5f 克隆方法依賴于推斷的目標(biāo)模型导帝。如果用戶希望使用從他們的數(shù)據(jù)中推斷出的靶向模型來分配序列之間的距離以進行克隆分組，則必須將觀察到的 SHM 靶向率轉(zhuǎn)換為距離穿铆。 calcTargetingDistance 函數(shù)返回每個 5 聚體與中心堿基的每個相應(yīng)突變之間的距離矩陣您单。也可以使用函數(shù) writeTargetingDistance 生成此矩陣并將其直接寫入文件。

# Calculate distance matrix
dist <- calcTargetingDistance(model)

真的內(nèi)容太多了荞雏，生活很好虐秦，有你更好

最后編輯于：2021.09.14 17:04:24

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

禁止轉(zhuǎn)載平酿，如需轉(zhuǎn)載請通過簡信或評論聯(lián)系作者。

人面猴
序言：七十年代末悦陋，一起剝皮案震驚了整個濱河市蜈彼，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌俺驶，老刑警劉巖幸逆，帶你破解...
沈念sama閱讀 206,013評論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異暮现，居然都是意外死亡还绘，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,205評論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門栖袋，熙熙樓的掌柜王于貴愁眉苦臉地迎上來拍顷，“玉大人，你說我怎么就攤上這事塘幅∥舭福” “怎么了？”我有些...
開封第一講書人閱讀 152,370評論 0贊 342
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵电媳，是天一觀的道長踏揣。經(jīng)常有香客問我，道長匆背，這世上最難降的妖魔是什么呼伸？我笑而不...
開封第一講書人閱讀 55,168評論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮钝尸，結(jié)果婚禮上括享，老公的妹妹穿的比我還像新娘。我一直安慰自己珍促，他們只是感情好铃辖，可當(dāng)我...
茶點故事閱讀 64,153評論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著猪叙，像睡著了一般娇斩。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上穴翩，一...
開封第一講書人閱讀 48,954評論 1贊 283
城市分裂傳說
那天犬第，我揣著相機與錄音，去河邊找鬼芒帕。笑死歉嗓，一個胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的背蟆。我是一名探鬼主播鉴分，決...
沈念sama閱讀 38,271評論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼哮幢，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了志珍？” 一聲冷哼從身側(cè)響起橙垢，我...
開封第一講書人閱讀 36,916評論 0贊 259
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎伦糯，沒想到半個月后柜某，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 43,382評論 1贊 300
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡舔株，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 35,877評論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年莺琳，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片载慈。...
茶點故事閱讀 37,989評論 1贊 333
活死人
序言：一個原本活蹦亂跳的男人離奇死亡惭等，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出办铡，到底是詐尸還是另有隱情辞做，我是刑警寧澤，帶...
沈念sama閱讀 33,624評論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布寡具，位于F島的核電站秤茅，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏童叠。R本人自食惡果不足惜框喳，卻給世界環(huán)境...
茶點故事閱讀 39,209評論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望厦坛。院中可真熱鬧五垮，春花似錦、人聲如沸杜秸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,199評論 0贊 19
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽撬碟。三九已至诞挨，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間呢蛤，已是汗流浹背惶傻。一陣腳步聲響...
開封第一講書人閱讀 31,418評論 1贊 260
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留其障，地道東北人达罗。一個月前我還...
沈念sama閱讀 45,401評論 2贊 352
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親粮揉。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 42,700評論 2贊 345

10X單細(xì)胞（10X空間轉(zhuǎn)錄組）BCR（TCR）數(shù)據(jù)分析之(10)changeo

hello县钥，這一篇我們來對軟件changeo進行補充利职，很多細(xì)節(jié)的內(nèi)容，我們需要深入知道一下。

首先第一個AIRR Community Rearrangement揖膜，這個直譯過來就是gene重排，我們來看一下

One of the primary initiatives of the Adaptive Immune Receptor Repertoire (AIRR) Community has been to develop a set of metadata standards for the submission of AIRR sequencing datasets. （說白了梅桩，獲得性免疫的主要結(jié)構(gòu)是什么）

當(dāng)然次氨，這個地方還是需要更多的注釋。我們簡單了解一下即可摘投。

第二部分煮寡，BCR聚類閾值的選擇

首先是Distance to nearest neighbor

估計用于劃分克隆相關(guān)序列的最佳距離閾值是通過計算從數(shù)據(jù)集中每個序列到其最近鄰居的距離并在所得雙峰分布中找到將克隆相關(guān)序列與無關(guān)序列分開的斷點來完成的。 這是通過以下步驟完成的：

我們看一下distance是怎么計算的

Calculating the nearest neighbor distances requires the following fields (columns) to be present in the table:

Calculating nearest neighbor distances (heavy chain sequences)(最為核心的部分)

Note: Human and mouse distance measures that are backward compatible with SHazaM v0.1.4 and Change-O v0.3.3 are also provided as hs1f_compat and m1n_compat, respectively.

對于不對稱的模型（例如掰曾，A 到 B 的距離不等于 B 到 A 的距離）旭蠕，有一個對稱參數(shù)允許用戶指定是使用兩個距離的平均值還是最小值來確定 總距離。

Calculating nearest neighbor distances (single-cell paired heavy and light chain sequences)

There is also a choice of whether grouping should be done using IGH (BCR) or TRB/TRD (TCR) sequences only, or using both IGH and IGK/IGL (BCR) or TRB/TRD and TRA/TRG (TCR) sequences. This is governed by onlyHeavy.(這個單鏈做更好一點)

Using nearest neighbor distances to determine clonal assignment thresholds（選擇閾值）

可以通過檢查最近鄰直方圖或使用 findThreshold 函數(shù)提供的自動閾值檢測算法之一來手動確定閾值绑莺。 可用的方法是密度（平滑密度）和 gmm（伽馬/高斯混合模型）暖眼，并通過 findThreshold 的方法參數(shù)進行選擇。

通過人工檢查確定閾值

手動閾值檢測只涉及為 distToNearest 輸出的 dist_nearest 列中的值生成直方圖纺裁，并在兩種模式之間的valley內(nèi)選擇合適的值诫肠。

By manual inspection, the length normalized ham model distance threshold would be set to a value near 0.12 in the above example.

In this example, the unnormalized hh_s5f model distance threshold would be set to a value near 7

Automated threshold detection via smoothed density(機器選擇)

The density method will look for the minimum in the valley between two modes of a smoothed distribution based on the input vector (distances), which will generally be the dist_nearest column from the distToNearest output. Below is an example of using the densitymethod for threshold detection.

Automated threshold detection via a mixture model

在下面的示例中丛肢，混合模型方法 (method="gmm") 用于通過擬合兩個伽馬分布 (model="gamma-gamma") 來找到用于分離克隆相關(guān)序列的最佳閾值围肥。 下圖中的紅色虛線定義了靈敏度和特異性的平均值達到最大值的距離。

注意：由 plotGmmThreshold 繪制的直方圖的形狀由 binwidth 參數(shù)控制蜂怎。 意思是穆刻，bin 大小的任何變化都會改變分布的形式，而 gmm 方法完全獨立于 bin 大小派敷，只涉及實際輸入數(shù)據(jù)蛹批。

Calculating nearest neighbor distances independently for subsets of data(相當(dāng)于亞群分析)

The fields argument to distToNearest will split the input data.frame into groups based on values in the specified fields (columns) and will treat them independently. For example, if the input data has multiple samples, then fields="sample_id" would allow each sample to be analyzed separately.

In the previous examples we used a subset of the original example data. In the following example, we will use the two available samples,-1h and +7d, and will set fields="sample_id". This will reproduce previous results for sample -1h and add results for sample +7d.

We can plot the nearest neighbor distances for the two samples:

In this case, the threshold selected for -1h seems to work well for +7d as well.

Calculating nearest neighbor distances across groups rather than within a groups(相當(dāng)于多樣本聯(lián)合分析)

This can provide a sense of overlap between samples or a way to compare within-sample variation to cross-sample variation.

大樣本數(shù)據(jù)的處理泣刹，Speeding up pairwise-distance-matrix calculations with subsampling

The subsample option in distToNearest allows to speed up calculations and reduce memory usage.

第三，就是上面提到的SHM模型犀被，SHM targeting models（目標(biāo)突變導(dǎo)致的距離度量）

看看示例

A small example AIRR Rearrangement database is included in the alakazam package. Inferring a targeting model requires the following fields (columns) to be present in the table:

Infer targeting model (substitution and mutability)

用于推斷可變性模型的函數(shù) (createMutabilityMatrix) 計算數(shù)據(jù)集所有 5 聚體基序中的突變數(shù)量仍秤，并取決于推斷的替代率熄诡。 與可用于推斷替代率的參數(shù)類似的參數(shù)可用于調(diào)整此函數(shù)。

這些擴展的替換和可變性模型可用于創(chuàng)建整體 SHM 目標(biāo)矩陣（createTargetingMatrix）姨谷，它是可變性和替換的組合概率。

通過使用單個函數(shù) createTargetingModel 直接從數(shù)據(jù)集中推斷 TargetingModel 對象映九，可以組合上述所有步驟梦湘。 同樣，用于估計 5 聚體可變性的沉默和替代突變的數(shù)量也分別記錄在 numMutS 和 numMutR slots中件甥。

Additionally, it is generally appropriate to consider the mutations within a clone only once. Consensus sequences for each clone can be generated using the collapseClones function.

Visualize targeting model

最后弄捕，Calculate targeting distance matrix

真的內(nèi)容太多了荞雏，生活很好虐秦，有你更好

推薦閱讀更多精彩內(nèi)容

估計用于劃分克隆相關(guān)序列的最佳距離閾值是通過計算從數(shù)據(jù)集中每個序列到其最近鄰居的距離并在所得雙峰分布中找到將克隆相關(guān)序列與無關(guān)序列分開的斷點來完成的。這是通過以下步驟完成的：

對于不對稱的模型（例如掰曾，A 到 B 的距離不等于 B 到 A 的距離）旭蠕，有一個對稱參數(shù)允許用戶指定是使用兩個距離的平均值還是最小值來確定總距離。

可以通過檢查最近鄰直方圖或使用 findThreshold 函數(shù)提供的自動閾值檢測算法之一來手動確定閾值绑莺。可用的方法是密度（平滑密度）和 gmm（伽馬/高斯混合模型）暖眼，并通過 findThreshold 的方法參數(shù)進行選擇。

在下面的示例中丛肢，混合模型方法 (method="gmm") 用于通過擬合兩個伽馬分布 (model="gamma-gamma") 來找到用于分離克隆相關(guān)序列的最佳閾值围肥。下圖中的紅色虛線定義了靈敏度和特異性的平均值達到最大值的距離。

注意：由 plotGmmThreshold 繪制的直方圖的形狀由 binwidth 參數(shù)控制蜂怎。意思是穆刻，bin 大小的任何變化都會改變分布的形式，而 gmm 方法完全獨立于 bin 大小派敷，只涉及實際輸入數(shù)據(jù)蛹批。

In the previous examples we used a subset of the original example data. In the following example, we will use the two available samples,`-1h` and `+7d`, and will set fields="sample_id". This will reproduce previous results for sample -1h and add results for sample +7d.

The `subsample` option in `distToNearest` allows to speed up calculations and reduce memory usage.

用于推斷可變性模型的函數(shù) (createMutabilityMatrix) 計算數(shù)據(jù)集所有 5 聚體基序中的突變數(shù)量仍秤，并取決于推斷的替代率熄诡。與可用于推斷替代率的參數(shù)類似的參數(shù)可用于調(diào)整此函數(shù)。

通過使用單個函數(shù) createTargetingModel 直接從數(shù)據(jù)集中推斷 TargetingModel 對象映九，可以組合上述所有步驟梦湘。同樣，用于估計 5 聚體可變性的沉默和替代突變的數(shù)量也分別記錄在 numMutS 和 numMutR slots中件甥。