hello赋元,大家好,前面分享了很多有關(guān)TCR的數(shù)據(jù)分析反砌,今天我們來繼續(xù)這個專題,今天的分享都是為后續(xù)最終的目的做準(zhǔn)備,現(xiàn)在還都只是中間過程萌朱,好了宴树,我們今天我分享的軟件是changeo,大家可以參考文章Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data晶疼,2015年便發(fā)表了酒贬,而我們分享這個的目的在于集成這個軟件的算法,達(dá)到我們最后這個專題的目的翠霍,這個軟件主要關(guān)注在BCR的數(shù)據(jù)分析锭吨,我們來看看這個軟件的算法和原理。
背景
高通量測序技術(shù)的進(jìn)步現(xiàn)在允許對 B 細(xì)胞受體 (BCR) 和 T 細(xì)胞受體 (TCR) 庫進(jìn)行大規(guī)模表征寒匙。 適應(yīng)性免疫受體庫 (AIRR) 的高種系和體細(xì)胞多樣性對具有生物學(xué)意義的分析提出了挑戰(zhàn)——需要開發(fā)專門的計算方法零如。
Imcantation 框架為高通量 AIRR-seq 數(shù)據(jù)集提供了一個從頭到尾的分析生態(tài)系統(tǒng)。 從原始讀取開始,提供 Python 和 R 包用于預(yù)處理考蕾、群體結(jié)構(gòu)確定和repertoire分析祸憋。 (這個我們當(dāng)然喜聞樂見)。
核心軟件(軟件非常的多肖卧,這個專題我們也要持續(xù)很久)夺衍。
Contributed Packages
當(dāng)然,時至今日喜命,我們只關(guān)心軟件對于10X數(shù)據(jù)的分析(限于個人背景)沟沙,我們來看看軟件的處理和結(jié)果。
1壁榕、Converting 10X V(D)J data into the AIRR Community standardized format
To process 10X V(D)J data, a combination of AssignGenes.py and MakeDb.py(這是兩個集成的分析模塊矛紫,作者已經(jīng)封裝好,這會兒我們只需要知道其作用牌里,后續(xù)會做詳細(xì)的分享) can be used to generate a TSV file compliant with the AIRR Community Rearrangement(當(dāng)然這里又是一個集成的算法颊咬,我們也需要了解,后續(xù)我們詳細(xì)的分享牡辽,先了解一下概念喳篇,重排是描述重排的適應(yīng)性免疫受體鏈(例如,抗體重鏈或 TCR β 鏈)以及大量注釋的序列态辛。 這些注釋由 AIRR Rearrangement 模式定義麸澜,包括八個類別。 ) schema that incorporates annotation information provided by the Cell Ranger pipeline. The --10x filtered_contig_annotations.csv
specifies the path of the contig annotations file generated by cellranger vdj
, which can be found in the outs
directory.免疫真的很難奏黑,需要耐心和時間炊邦。
Generate AIRR Rearrangement data from the 10X V(D)J FASTA files using the steps below:
AssignGenes.py igblast -s filtered_contig.fasta -b ~/share/igblast \
--organism human --loci ig --format blast
MakeDb.py igblast -i filtered_contig_igblast.fmt7 -s filtered_contig.fasta \
-r IMGT_Human_*.fasta --10x filtered_contig_annotations.csv --extended
- 注:all_contig.fasta can be exchanged for filtered_contig.fasta, and all_contig_annotations.csv can be exchanged for filtered_contig_annotations.csv.
一些參數(shù)的轉(zhuǎn)換
- To process mouse data and/or TCR data alter the
--organism
and--loci
arguments to AssignGenes.py accordingly (e.g.,--organism mouse
,--loci tcr
) and use the appropriate V, D and J IMGT reference databases (e.g.,IMGT_Mouse_TR*.fasta
)
2、 Identifying clones from B cells in AIRR formatted 10X V(D)J data
Splitting into separate light and heavy chain files(這個地方跟之前TCR的處理類似熟史,但是方式不一樣)馁害。
要將 B 細(xì)胞從 AIRR 重排數(shù)據(jù)分組為克隆,必須將 MakeDb.py 的輸出解析為輕鏈文件和重鏈文件:
ParseDb.py select -d 10x_igblast_db-pass.tsv -f locus -u "IGH" \
--logic all --regex --outname heavy
ParseDb.py select -d 10x_igblast_db-pass.tsv -f locus -u "IG[LK]" \
--logic all --regex --outname light
接下來就是要進(jìn)行數(shù)據(jù)的過濾了蹂匹,這個過濾和我們的普通轉(zhuǎn)錄組差別巨大碘菜。
The ParseDb.py(又是一個集成模塊) tool provides a basic set of operations for manipulating Change-O database files from the commandline, including removing or updating rows and columns.
Removing non-productive sequences
After building a Change-O database from either IMGT/HighV-QUEST or IgBLAST output, you may wish to subset your data to only productive sequences. This can be done in one of two roughly equivalent ways using the ParseDb.py tool:(挑選有用的數(shù)據(jù))
ParseDb.py select -d HD13M_db-pass.tsv -f productive -u T
ParseDb.py split -d HD13M_db-pass.tsv -f productive
上面的第一行使用 select 子命令輸出標(biāo)記為 parse-select 的單個文件,該文件僅包含productive
列 (-f productive) 中值為 T (-u T) 的記錄限寞。
或者忍啸,上面的第二行使用 split 子命令輸出多個文件,每個文件包含具有productive
列中找到的值之一的記錄 (-f productive)昆烁。 這將生成標(biāo)記為productive-T
和productive-F
的兩個文件吊骤。
消除 C 區(qū)引物和參考比對之間的分歧
If you have data that includes both heavy and light chains in the same library, the V-segment and J-segment alignments from IMGT/HighV-QUEST or IgBLAST may not always agree with the isotype assignments from the C-region primers. In these cases, you can filter out such reads with the select subcommand of ParseDb.py. An example function call using an imaginary file db.tsv
is provided below:
ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IGH" \
--logic all --regex --outname heavy
ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IG[LK]" \
--logic all --regex --outname light
這些命令將要求所有 v_call缎岗、j_call 和 c_call 區(qū)域(-f v_call j_call c_call 和 --logic all)包含字符串 IGH(第 1-2 行)或 IGK 或 IGL(第 3-4 行)之一静尼。 --regex
參數(shù)允許對正則表達(dá)式進(jìn)行部分匹配和解釋。 這兩個命令的輸出是兩個文件,一個只包含重鏈(heavy_parse-select.tsv)鼠渺,一個只包含輕鏈(light_parse-select.tsv)鸭巴。
Exporting records to FASTA files
You may want to use external tools, or tools from pRESTO, on your Change-O result files. The ConvertDb.py tool provides two options for exporting data from tab-delimited files to FASTA format.
Standard FASTA
ConvertDb.py fasta -d HD13M_db-pass.tsv --if sequence_id \
--sf sequence_alignment --mf v_call duplicate_count
Where the column containing the sequence identifier is specified by --if sequence_id, the nucleotide sequence column is specified by --sf sequence_id, and additional annotations to be added to the sequence header are specified by --mf v_call duplicate_count.(我們一般也就需要轉(zhuǎn)換成fasta)
Assign clonal groups to the heavy chain data(最為關(guān)鍵的聚類)
The heavy chain file must then be clonally clustered separately.(重鏈需要單獨(dú)聚類,那到底如何做呢拦盹?)
Clustering sequences into clonal groups
首先第一步:Determining a clustering threshold
Before running DefineClones.py(集成模塊鹃祖,我們暫時先知道是做聚類的), it is important to determine an appropriate threshold for trimming the hierarchical clustering into B cell clones.(看來這里的聚類是層次聚類)。The distToNearest function in the SHazaM R package calculates the distance between each sequence in the data and its nearest-neighbor.(利用了R包SHazaM來計算序列的distance普舆,這里就像是之前分享的利用TCRdist計算TCR之間的序列距離),結(jié)果分布應(yīng)該是雙峰的恬口,第一種模式表示數(shù)據(jù)集中具有克隆親屬的序列,第二種模式表示singletons(這個單詞大家意會一下)沼侣。The ideal threshold for separating clonal groups is the value that separates the two modes of this distribution and can be found using the findThreshold function in the SHazaM R package.(閾值的選擇既有人工也有軟件識別),The distToNearest function allows selection of all parameters that are available in DefineClones.py. Using the length normalization parameter ensures that mutations are weighted equally regardless of junction sequence length.The distance to nearest-neighbor distribution for the example data is shown below. The threshold is approximately 0.16 - indicated by the red dotted line.(看來最好還是人工選擇祖能,當(dāng)然關(guān)于閾值的劃分我們還會再分享一篇文章)。
第二步蛾洛、克隆聚類(主要針對B細(xì)胞)
There are several parameter choices when grouping Ig sequences into B cell clones. The argument --act
set accounts for ambiguous V gene and J gene calls when grouping similar sequences. The distance metric --model ham
is nucleotide Hamming distance. Because the threshold was generated using length normalized distances, the --norm len
argument is selected with the previously determined threshold --dist 0.16:
DefineClones.py -d HD13M_db-pass.tsv --act set --model ham \
--norm len --dist 0.16
Correct clonal groups based on light chain data
DefineClones.py currently does not support light chain cloning. However, cloning can be performed after heavy chain cloning using light_cluster.py provided on the Immcantation Bitbucket repository in the scripts
directory:
light_cluster.py -d heavy_select-pass_clone-pass.tsv -e light_select-pass.tsv \
-o 10X_clone-pass.tsv
Here, heavy_select-pass_clone-pass.tsv
refers to the cloned heavy chain AIRR Rearrangement file, light_select-pass.tsv
refers to the light chain file, and 10X_clone-pass.tsv
is the resulting output file.
算法優(yōu)勢
- remove cells associated with more than one heavy chain
-
correct heavy chain clone definitions based on an analysis of the light chain partners associated with the heavy chain clone.(相互輔助的結(jié)果)
接下來就是Reconstructing germline sequences(重建種系序列)
Adding germline sequences to the database
The CreateGermlines.py tool is used to reconstruct the germline V(D)J sequence, from which the Ig lineage and mutations can be inferred(推斷ig的譜系和突變)养铸。 In addition to the alignment information parsed by MakeDb.py to generate the initial database, CreateGermlines.py also requires the set of germline sequences that were used for the alignment passed to the -r
argument. In the case of V-segment germlines, the reference sequences must be IMGT-gapped. Because the D-segment call for B cell receptor alignments is often low confidence, the default germline format (-g dmask) places Ns in the N/P and D-segments of the junction region rather than using the D-segment assigned during reference alignment; this can be modified to generate a complete germline (-g full) or a V-segment only germline (-g vonly) if you wish. The command below adds the germline sequence to the germline_alignment_d_mask column of the output database:(這一部分應(yīng)該是可選的,但是譜系示蹤確實非常重要)
CreateGermlines.py -d HD13M_db-pass.tsv -g dmask \
-r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
Alternatively, if you have run the clonal assignment task prior to invoking CreateGermlines.py, then adding the --cloned
argument is recommended, as this will generate a single germline of consensus length for each clone:(當(dāng)然轧膘,會涉及到很多其他軟件的適用钞螟,我們慢慢來,欲速則不達(dá))谎碍。
CreateGermlines.py -d HD13M_db-pass_clone-pass.tsv -g dmask --cloned \
-r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
IgPhyML lineage tree analysis(這也是這個軟件的最終目的)
IgPhyML 是一個程序鳞滨,旨在構(gòu)建系統(tǒng)發(fā)育樹并測試有關(guān) B 細(xì)胞親和力成熟的進(jìn)化假設(shè)。
B 細(xì)胞體細(xì)胞超突變 (SHM) 的生物學(xué)違反了大多數(shù)標(biāo)準(zhǔn)系統(tǒng)發(fā)育替代模型中的重要假設(shè)蟆淀; 此外太援,雖然大多數(shù)系統(tǒng)發(fā)育程序旨在分析單個譜系,但 B 細(xì)胞庫通常包含數(shù)千個譜系扳碍。 IgPhyML 通過實施替代模型來解決這兩個問題提岔,這些模型糾正了 SHM 的上下文敏感性質(zhì),并結(jié)合了來自多個譜系的信息笋敞,以提供更精確估計的全庫模型參數(shù)估計碱蒙。
An in-depth description of IgPhyML installation and usage can be found at the IgPhyML website.(這個軟件我們也需要深入分析和分享了,真的太多了).
分析開始
Once installed, IgPhyML can be run through BuildTrees by specifying the --igphyml
option. IgPhyML is easiest to run through the Immcantation Docker image. If this is not possible, these instructions require Change-O 0.4.6 or higher, Alakazam 0.3.0 or higher, and IgPhyML to be installed, with the executable in your PATH
variable.
The following commands should work as a first pass on many reasonably sized datasets, but if you really want to understand what’s going on or make sure what you’re doing makes sense, please check out the IgPhyML website.(后續(xù)分享這個軟件)夯巷。
Build trees and estimate model parameters
# Clone IgPhyML repository to get example files
git clone https://bitbucket.org/kleinstein/igphyml
# Move to examples directory
cd igphyml/examples
# Run BuildTrees
BuildTrees.py -d example.tsv --outname ex --log ex.log --collapse \
--sample 3000 --igphyml --clean all --nproc 1
此命令處理 BCR 序列的 AIRR 格式數(shù)據(jù)集赛惩,該數(shù)據(jù)集已與重建的生殖系進(jìn)行克隆聚類。 然后使用 GY94 模型快速構(gòu)建樹趁餐,并使用這些固定拓?fù)涔烙?HLP19 模型參數(shù)喷兼。 這可以通過增加 --nproc 選項來加速。 使用 --sample 選項進(jìn)行子采樣并不是絕對必要的后雷,但 IgPhyML 在應(yīng)用于大型數(shù)據(jù)集時會運(yùn)行緩慢季惯。 在這里吠各, --collapse 標(biāo)志用于折疊相同的序列。 強(qiáng)烈建議這樣做勉抓,因為相同的序列會減慢計算速度贾漏,而不會影響 IgPhyML 中的似然值。
可視化
可以使用 Alakazam 的 readIgphyml 函數(shù)讀取上述命令的輸出文件(又是一個集成模塊)藕筋。 在示例子文件夾中打開 R 會話后纵散,輸入以下命令。 請注意隐圾,使用 Docker 容器時伍掀,您需要在繪制樹后運(yùn)行 dev.off() 以在示例目錄中創(chuàng)建 pdf 繪圖:
library(alakazam)
library(igraph)
db = readIgphyml("ex_igphyml-pass.tab")
# Plot largest lineage tree
plot(db$trees[[1]],layout=layout_as_tree)
# Show HLP10 parameters
print(t(db$param[1,]))
CLONE "REPERTOIRE"
NSEQ "4"
NSITE "107"
TREE_LENGTH "0.286"
LHOOD "-290.7928"
KAPPA_MLE "2.266"
OMEGA_FWR_MLE "0.5284"
OMEGA_CDR_MLE "2.3324"
WRC_2_MLE "4.8019"
GYW_0_MLE "3.4464"
WA_1_MLE "5.972"
TW_0_MLE "0.8131"
SYC_2_MLE "-0.99"
GRS_0_MLE "0.2583"
To visualize a larger dataset with bigger trees, and bifurcating tree topologies, again open an R session in the examples directory:
library(alakazam)
library(ape)
db = readIgphyml("sample1_igphyml-pass.tab",format="phylo")
# Plot largest lineage tree
plot(ladderize(db$trees[[1]]),cex=0.7,no.margin=TRUE)
真的是又多又難,生活很好暇藏,有你更好