hello县钥,這一篇我們來對軟件changeo進行補充利职,很多細(xì)節(jié)的內(nèi)容,我們需要深入知道一下。
One of the primary initiatives of the Adaptive Immune Receptor Repertoire (AIRR) Community has been to develop a set of metadata standards for the submission of AIRR sequencing datasets. (說白了梅桩,獲得性免疫的主要結(jié)構(gòu)是什么)
This work has been carried out by the AIRR Community Minimal Standards Working Group. In order to support reproducibility, standard quality control, and data deposition in a common repository, the AIRR Community has agreed to six high-level data sets that will guide the publication, curation and sharing of AIRR-Seq data and metadata: Study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences. The detailed data elements within these sets are defined here (Download as TSV).
當(dāng)然次氨,這個地方還是需要更多的注釋。我們簡單了解一下即可摘投。
第二部分煮寡,BCR聚類閾值的選擇
首先是Distance to nearest neighbor
估計用于劃分克隆相關(guān)序列的最佳距離閾值是通過計算從數(shù)據(jù)集中每個序列到其最近鄰居的距離并在所得雙峰分布中找到將克隆相關(guān)序列與無關(guān)序列分開的斷點來完成的。 這是通過以下步驟完成的:
- Calculating of the nearest neighbor distances for each sequence.
- Generating a histogram of the nearest neighbor distances followed by either manual inspect for the threshold separating the two modes or automated threshold detection.
我們看一下distance是怎么計算的
Calculating the nearest neighbor distances requires the following fields (columns) to be present in the table:
- sequence_id
- v_call
- j_call
- junction
- junction_length
# Subset example data to one sample
library(shazam)
data(ExampleDb, package="alakazam")
Calculating nearest neighbor distances (heavy chain sequences)(最為核心的部分)
By default, distToNearest, the function for calculating distance between every sequence and its nearest neighbor, assumes that it is running under non-single-cell mode and that every input sequence is a heavy chain sequence and will be used for calculation(計算距離只采用了重鏈序列犀呼,TCR也是只采用了β鏈的CDR3)幸撕,需要一些參數(shù)來調(diào)整distance的measure方式。如果使用 tigger
包(也是一個很經(jīng)典的BCR分析軟件外臂,這個我們下一個分享)中的方法推斷出基因型坐儿,并且已將 v_call_genotyped
添加到數(shù)據(jù)庫中,則可以通過指定 vCallColumn
參數(shù)使用此列代替默認(rèn)的 v_call
列宋光。這將允許使用來自 tigger
的更準(zhǔn)確的 V call來對序列進行分組貌矿。此外,為了更“寬松”地處理不明確的 V(D)J 段調(diào)用罪佳,可以將參數(shù) first
設(shè)置為 FALSE
逛漫。設(shè)置 first=FALSE
將使用所有可能基因的聯(lián)合來分組序列,而不是 field中的第一個基因赘艳。modal
參數(shù)確定使用哪個基礎(chǔ) SHM 模型來計算距離酌毡。The default model is single nucleotide Hamming distance with gaps considered as a match to any nucleotide (ham). 其他選項包括類似于轉(zhuǎn)換/顛換模型 (hh_s1f) 的人類 Ig 特異性單核苷酸模型和來自 Yaari 等人,2013 年 (hh_s5f) 的相應(yīng) 5 聚體上下文模型蕾管,來自 Cui 等人的一對類似的小鼠特異性模型, 2016 (mk_rs1nf 和 mk_rs5nf)枷踏,以及氨基酸漢明距離 (aa)。
Note: Human and mouse distance measures that are backward compatible with SHazaM v0.1.4 and Change-O v0.3.3 are also provided as hs1f_compat and m1n_compat, respectively.
對于不對稱的模型(例如掰曾,A 到 B 的距離不等于 B 到 A 的距離)旭蠕,有一個對稱參數(shù)允許用戶指定是使用兩個距離的平均值還是最小值來確定 總距離。
# Use nucleotide Hamming distance and normalize by junction length
dist_ham <- distToNearest(ExampleDb, sequenceColumn="junction",
vCallColumn="v_call_genotyped", jCallColumn="j_call",
model="ham", normalize="len", nproc=1)
# Use genotyped V assignments, a 5-mer model and no normalization
dist_s5f <- distToNearest(ExampleDb, sequenceColumn="junction",
vCallColumn="v_call_genotyped", jCallColumn="j_call",
model="hh_s5f", normalize="none", nproc=1)
Calculating nearest neighbor distances (single-cell paired heavy and light chain sequences)
distToNearest
函數(shù)還支持在單細(xì)胞模式下運行旷坦,其中提供了包含單細(xì)胞配對 IGH:IGK/IGL掏熬、TRB:TRA 或 TRD:TRG 鏈序列的輸入 Example10x。 在這種情況下塞蹭,默認(rèn)情況下孽江,細(xì)胞首先被分成包含相同重/長鏈(IGH、TRB番电、TRD)V 基因和 J 基因(如果指定岗屏,連接長度)和相同輕/短鏈(IGK , IGL, TRA, TRG) V 基因和 J 基因(如果指定辆琅,連接長度)。 然后这刷,僅使用重鏈序列來計算最近鄰距離婉烟。
在單細(xì)胞模式下,輸入 Example10x
的每一行都應(yīng)該代表一個序列/鏈暇屋。 來自同一cell的序列/鏈由 cellIdColumn
列中的cell ID 鏈接似袁。 請注意,一個cell應(yīng)該恰好有一個 IGH 序列 (BCR) 或 TRB/TRD (TCR)咐刨。 locusColumn
列中的值必須是 IGH昙衅、IGI、IGK 或 IGL (BCR) 或 TRA定鸟、TRB而涉、TRD 或 TRG (TCR) 之一。 要調(diào)用單細(xì)胞模式联予,必須指定 cellIdColumn
并且 locusColumn
必須正確啼县。
可以選擇分組是作為一個階段的過程還是兩個階段的過程來完成。 這可以通過 VJthenLen
指定沸久。 在一個階段的過程中(VJthenLen=FALSE)季眷,細(xì)胞被分成包含相同重/長鏈V基因、J基因和連接長度(V-J-長度組合)和相同輕鏈V-J-長度組合的分區(qū)卷胯。 在兩階段過程中(VJthenLen=TRUE)子刮,cells are first divided by heavy/long chain V gene and J gene (V-J combination), and light/short chain V-J combination; 然后通過相應(yīng)的junction lengths诵竭。
There is also a choice of whether grouping should be done using IGH (BCR) or TRB/TRD (TCR) sequences only, or using both IGH and IGK/IGL (BCR) or TRB/TRD and TRA/TRG (TCR) sequences. This is governed by onlyHeavy.(這個單鏈做更好一點)
# Single-cell mode
# Group cells in a one-stage process (VJthenLen=FALSE) and using
# both heavy and light chain sequences (onlyHeavy=FALSE)
data(Example10x, package="alakazam")
dist_sc <- distToNearest(Example10x, cellIdColumn="cell_id", locusColumn="locus",
VJthenLen=FALSE, onlyHeavy=FALSE)
無論是僅使用重鏈序列進行分組话告,還是使用重鏈和輕鏈序列進行分組,都將僅使用重鏈序列來計算最近鄰距離卵慰。 因此,在單細(xì)胞模式下佛呻,返回的 data.frame
中輕鏈序列對應(yīng)的行將在 dist_nearest
字段中具有 NA
裳朋。 (這個不重要)。
Using nearest neighbor distances to determine clonal assignment thresholds(選擇閾值)
SHazaM 中距離最近計算的主要用途是使用 Change-O 中的 DefineClones
工具確定克隆分配的最佳閾值吓著。 定義閾值依賴于區(qū)分克隆相關(guān)序列(由具有近鄰的序列表示)和單例(沒有近鄰的序列)鲤嫡,后者在最近鄰距離直方圖中顯示為兩種模式。
可以通過檢查最近鄰直方圖或使用 findThreshold 函數(shù)提供的自動閾值檢測算法之一來手動確定閾值绑莺。 可用的方法是密度(平滑密度)和 gmm(伽馬/高斯混合模型)暖眼,并通過 findThreshold 的方法參數(shù)進行選擇。
通過人工檢查確定閾值
手動閾值檢測只涉及為 distToNearest 輸出的 dist_nearest 列中的值生成直方圖纺裁,并在兩種模式之間的valley內(nèi)選擇合適的值诫肠。
# Generate Hamming distance histogram
library(ggplot2)
p1 <- ggplot(subset(dist_ham, !is.na(dist_nearest)),
aes(x=dist_nearest)) +
theme_bw() +
xlab("Hamming distance") +
ylab("Count") +
scale_x_continuous(breaks=seq(0, 1, 0.1)) +
geom_histogram(color="white", binwidth=0.02) +
geom_vline(xintercept=0.12, color="firebrick", linetype=2)
plot(p1)
By manual inspection, the length normalized ham model distance threshold would be set to a value near 0.12 in the above example.
# Generate HH_S5F distance histogram
p2 <- ggplot(subset(dist_s5f, !is.na(dist_nearest)),
aes(x=dist_nearest)) +
theme_bw() +
xlab("HH_S5F distance") +
ylab("Count") +
scale_x_continuous(breaks=seq(0, 50, 5)) +
geom_histogram(color="white", binwidth=1) +
geom_vline(xintercept=7, color="firebrick", linetype=2)
plot(p2)
In this example, the unnormalized hh_s5f model distance threshold would be set to a value near 7
Automated threshold detection via smoothed density(機器選擇)
The density
method will look for the minimum in the valley between two modes of a smoothed distribution based on the input vector (distances
), which will generally be the dist_nearest
column from the distToNearest
output. Below is an example of using the density
method for threshold detection.
# Find threshold using density method
output <- findThreshold(dist_ham$dist_nearest, method="density")
threshold <- output@threshold
# Plot distance histogram, density estimate and optimum threshold
plot(output, title="Density Method")
# Print threshold
print(output)
## [1] 0.1738391
Automated threshold detection via a mixture model
findThreshold 函數(shù)包括用于自動確定克隆分配閾值的方法司澎。 findThreshold (method="gmm") 的“gmm”方法(伽馬/高斯混合方法)使用單變量密度分布函數(shù)的四種組合之一對最近距離分布執(zhí)行最大似然擬合程序:“norm- norm”(兩個高斯分布)、“norm-gamma”(下高斯和上伽馬分布)栋豫、“gamma-norm”(下伽馬和上高斯分布)和“gamma-gamma”(兩個伽馬分布)挤安。 默認(rèn)情況下,將通過計算靈敏度和特異性的平均值達到其最大值的距離來選擇閾值(cutoff="optimal")丧鸯。 還提供了替代閾值選擇標(biāo)準(zhǔn)蛤铜,包括曲線交點 (cutoff="intersect")、用戶定義的靈敏度 (cutoff="user", sen=x) 或用戶定義的特異性 (cutoff="user", spc=x)
在下面的示例中丛肢,混合模型方法 (method="gmm") 用于通過擬合兩個伽馬分布 (model="gamma-gamma") 來找到用于分離克隆相關(guān)序列的最佳閾值围肥。 下圖中的紅色虛線定義了靈敏度和特異性的平均值達到最大值的距離。
# Find threshold using gmm method
output <- findThreshold(dist_ham$dist_nearest, method="gmm", model="gamma-gamma")
# Plot distance histogram, Gaussian fits, and optimum threshold
plot(output, binwidth=0.02, title="GMM Method: gamma-gamma")
# Print threshold
print(output)
## [1] 0.1221371
注意:由 plotGmmThreshold 繪制的直方圖的形狀由 binwidth 參數(shù)控制蜂怎。 意思是穆刻,bin 大小的任何變化都會改變分布的形式,而 gmm 方法完全獨立于 bin 大小派敷,只涉及實際輸入數(shù)據(jù)蛹批。
Calculating nearest neighbor distances independently for subsets of data(相當(dāng)于亞群分析)
The fields
argument to distToNearest
will split the input data.frame
into groups based on values in the specified fields (columns) and will treat them independently. For example, if the input data has multiple samples, then fields="sample_id"
would allow each sample to be analyzed separately.
In the previous examples we used a subset of the original example data. In the following example, we will use the two available samples,-1h
and +7d
, and will set fields="sample_id". This will reproduce previous results for sample -1h and add results for sample +7d.
dist_fields <- distToNearest(ExampleDb, model="ham", normalize="len",
fields="sample_id", nproc=1)
We can plot the nearest neighbor distances for the two samples:
# Generate grouped histograms
p4 <- ggplot(subset(dist_fields, !is.na(dist_nearest)),
aes(x=dist_nearest)) +
theme_bw() +
xlab("Grouped Hamming distance") +
ylab("Count") +
geom_histogram(color="white", binwidth=0.02) +
geom_vline(xintercept=0.12, color="firebrick", linetype=2) +
facet_grid(sample_id ~ ., scales="free_y")
plot(p4)
In this case, the threshold selected for -1h seems to work well for +7d as well.
Calculating nearest neighbor distances across groups rather than within a groups(相當(dāng)于多樣本聯(lián)合分析)
指定 distToNearest
的cross
參數(shù)會強制跨組執(zhí)行距離計算(多樣本一起計算),這樣每個序列的最近鄰居將始終是不同組中的序列篮愉。 在下面的示例中腐芍,我們設(shè)置 cross="sample",它將數(shù)據(jù)分組為 -1h 和 +7d 樣本子集试躏。 因此猪勇,樣本-1h 中序列的最近鄰距離將被限制為樣本+7d 中的最近序列,反之亦然颠蕴。
dist_cross <- distToNearest(ExampleDb, sequenceColumn="junction",
vCallColumn="v_call_genotyped", jCallColumn="j_call",
model="ham", first=FALSE,
normalize="len", cross="sample_id", nproc=1)
# Generate cross sample histograms
p5 <- ggplot(subset(dist_cross, !is.na(cross_dist_nearest)),
aes(x=cross_dist_nearest)) +
theme_bw() +
xlab("Cross-sample Hamming distance") +
ylab("Count") +
geom_histogram(color="white", binwidth=0.02) +
geom_vline(xintercept=0.12, color="firebrick", linetype=2) +
facet_grid(sample_id ~ ., scales="free_y")
plot(p5)
This can provide a sense of overlap between samples or a way to compare within-sample variation to cross-sample variation.
大樣本數(shù)據(jù)的處理泣刹,Speeding up pairwise-distance-matrix calculations with subsampling
The subsample
option in distToNearest
allows to speed up calculations and reduce memory usage.
If there are very large groups of sequences that share V call, J call and junction length, distToNearest
will need a lot of memory and it will take a long time to calculate all the distances. Without subsampling, in a large group of n=70,000 sequences distToNearest
calculates a nn distance matrix. With subsampling, e.g. to s=15,000, the distance matrix for the same group has size sn, and for each sequence in db, the distance value is calculated by comparing the sequence to the subsampled sequences from the same V-J-junction length group.
# Explore V-J-junction length groups sizes to use subsample
# Show the size of the largest groups
library(dplyr)
library(alakazam)
top_10_sizes <- ExampleDb %>%
group_by(junction_length) %>% # Group by junction length
do(alakazam::groupGenes(., first=TRUE)) %>% # Group by V and J call
mutate(GROUP_ID=paste(junction_length, vj_group, sep="_")) %>% # Create group ids
ungroup() %>%
group_by(GROUP_ID) %>% # Group by GROUP_ID
distinct(junction) %>% # Vount unique junctions per group
summarize(SIZE=n()) %>% # Get the size of the group
arrange(desc(SIZE)) %>% # Sort by decreasing size
select(SIZE) %>%
top_n(10) # Filter to the top 10
# Use 30 to subsample
# NOTE: This is a toy example. Subsampling to 30 sequence with real data is unwise
dist <- distToNearest(ExampleDb, sequenceColumn="junction",
vCallColumn="v_call_genotyped", jCallColumn="j_call",
model="ham",
first=FALSE, normalize="len",
subsample=30)
第三,就是上面提到的SHM模型犀被,SHM targeting models(目標(biāo)突變導(dǎo)致的距離度量)
靶向模型是特定突變的背景可能性椅您,基于周圍的序列上下文以及突變本身。 該模型是從數(shù)據(jù)中觀察到的突變推斷出來的寡键。 然后可以將模型轉(zhuǎn)換為距離函數(shù)掀泳,以根據(jù)觀察到的突變的可能性比較給定數(shù)據(jù)集的 Ig 序列。 這是通過以下步驟完成的:
- Infer a substitution model, which is the likelihood of a base mutating to each other base given the microsequence context.
- Infer a mutability model, which is likelihood of a given base being mutated given the microsequence context and substitution model.
- Visualize the mutability model to identify hot and cold spots.
- Calculate a nucleotide distance matrix based on the underlying SHM models.
看看示例
A small example AIRR Rearrangement database is included in the alakazam package. Inferring a targeting model requires the following fields (columns) to be present in the table:
- sequence_id
- sequence_alignment
- germline_alignment_d_mask
- v_call
# Load example data
library(shazam)
data(ExampleDb, package="alakazam")
# Subset to IGHG for faster usage demonstration
db <- subset(ExampleDb, c_call == "IGHG")
Infer targeting model (substitution and mutability)
用于推斷替代率的函數(shù) (createSubstitutionMatrix) 計算從給定堿基到所有其他堿基的突變數(shù)西轩,這些突變發(fā)生在數(shù)據(jù)集中所有 5 聚體基序的中心位置员舵。 createSubstitutionMatrix 的模型參數(shù)允許用戶指定是計算所有突變,還是僅計算無聲突變以推斷模型藕畔。 如果樣本序列马僻、種系序列和 V 調(diào)用的列名與 Change-O 默認(rèn)值不同,它們也可以作為參數(shù)傳入注服。 此外韭邓,multipleMutation 參數(shù)決定了對單個 5-mer 中多個突變的處理:independent 獨立處理每個突變措近,并完全忽略具有多個突變的 5-mers。
# Create substitution model using silent mutations
sub_model <- createSubstitutionMatrix(db, model="s",
sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
vCallColumn="v_call")
用于推斷可變性模型的函數(shù) (createMutabilityMatrix) 計算數(shù)據(jù)集所有 5 聚體基序中的突變數(shù)量仍秤,并取決于推斷的替代率熄诡。 與可用于推斷替代率的參數(shù)類似的參數(shù)可用于調(diào)整此函數(shù)。
# Create mutability model using silent mutations
mut_model <- createMutabilityMatrix(db, sub_model, model="s",
sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
vCallColumn="v_call")
createMutabilityMatrix 創(chuàng)建一個 MutabilityModel 類的對象诗力,該對象包含一個 1024 歸一化可變性的命名數(shù)字向量凰浮。 用于估計 5 聚體變異性的沉默和替代突變的數(shù)量分別記錄在 numMutS 和 numMutR 槽中。 率苇本。 源槽包含一個命名向量袜茧,指示是否推斷或測量了每個 5 聚體的可變性。 具有可變性值和派生源的 data.frame瓣窄。
# Number of silent mutations used for estimating 5-mer mutabilities
mut_model@numMutS
# Number of replacement mutations used for estimating 5-mer mutabilities
mut_model@numMutR
# Mutability and source as a data.frame
head(as.data.frame(mut_model))
上述函數(shù)返回的推斷替換和可變性模型僅解釋了明確的 5 聚體笛厦。 但是,在某些情況下俺夕,用戶可能需要具有模糊字符的 5 聚體突變的可能性(motif)裳凸。 上述每個函數(shù)都有一個相應(yīng)的函數(shù)(extendSubstitutionMatrix 和 extendMutabilityMatrix)來擴展模型,通過對所有對應(yīng)的明確 5-mers 求平均值來推斷 5-mers 和 Ns劝贸。
# Extend models to include ambiguous 5-mers
sub_model <- extendSubstitutionMatrix(sub_model)
mut_model <- extendMutabilityMatrix(mut_model)
這些擴展的替換和可變性模型可用于創(chuàng)建整體 SHM 目標(biāo)矩陣(createTargetingMatrix)姨谷,它是可變性和替換的組合概率。
# Create targeting model matrix from substitution and mutability models
tar_matrix <- createTargetingMatrix(sub_model, mut_model)
通過使用單個函數(shù) createTargetingModel 直接從數(shù)據(jù)集中推斷 TargetingModel 對象映九,可以組合上述所有步驟梦湘。 同樣,用于估計 5 聚體可變性的沉默和替代突變的數(shù)量也分別記錄在 numMutS 和 numMutR slots中件甥。
Additionally, it is generally appropriate to consider the mutations within a clone only once. Consensus sequences for each clone can be generated using the collapseClones function.
# Collapse sequences into clonal consensus
clone_db <- collapseClones(db, cloneColumn="clone_id",
sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
nproc=1)
# Create targeting model in one step using only silent mutations
# Use consensus sequence input and germline columns
model <- createTargetingModel(clone_db, model="s", sequenceColumn="clonal_sequence",
germlineColumn="clonal_germline", vCallColumn="v_call")
Visualize targeting model
數(shù)據(jù)集底層 SHM 可變性模型的可視化可用于研究熱點和冷點基序捌议。 可變率圖上條形的長度對應(yīng)于給定 5 聚體中給定堿基發(fā)生突變的可能性。 繪圖函數(shù) plotMutability 具有參數(shù)樣式引有,用于指定 5 聚體可變率的刺猬圖(圓形)或條形圖顯示瓣颅。 如果僅需要特定堿基的可變性,則可以通過 核苷酸參數(shù)指定譬正。
# Generate hedgehog plot of mutability model
plotMutability(model, nucleotides="A", style="hedgehog")
plotMutability(model, nucleotides="C", style="hedgehog")
# Generate bar plot of mutability model
plotMutability(model, nucleotides="G", style="bar")
plotMutability(model, nucleotides="T", style="bar")
最后弄捕,Calculate targeting distance matrix
在 Change-O pipeline中,hs5f 克隆方法依賴于推斷的目標(biāo)模型导帝。 如果用戶希望使用從他們的數(shù)據(jù)中推斷出的靶向模型來分配序列之間的距離以進行克隆分組,則必須將觀察到的 SHM 靶向率轉(zhuǎn)換為距離穿铆。 calcTargetingDistance 函數(shù)返回每個 5 聚體與中心堿基的每個相應(yīng)突變之間的距離矩陣您单。 也可以使用函數(shù) writeTargetingDistance 生成此矩陣并將其直接寫入文件。
# Calculate distance matrix
dist <- calcTargetingDistance(model)
真的內(nèi)容太多了荞雏,生活很好虐秦,有你更好