hello,大家好耸成,我們繼續(xù)我們的TCR數(shù)據(jù)分析宅静,這一專題會有非常多的內(nèi)容套利,我們慢慢分享召嘶,文獻(xiàn)在Quantifiable predictive features define epitopespecific T cell receptor repertoires,影響因子49(nature)曲梗。今天我們的任務(wù)還是要多學(xué)習(xí)一些基礎(chǔ)的概念和算法赞警。
TCRs from T cells that recognize the same pMHC epitope often share conserved sequence features, suggesting that it may be possible to predictively model epitope specificity(關(guān)于基因重排和抗原表位等相關(guān)的基礎(chǔ)知識妓忍,在我的文章10X單細(xì)胞(10X空間轉(zhuǎn)錄組)TCR數(shù)據(jù)分析之TCR 內(nèi)在調(diào)控潛力系統(tǒng)(TiRP)),這里強(qiáng)調(diào)的是對于相同的pMHC愧旦,TCR富集的序列會含有相同的motif世剖,這個已經(jīng)被無數(shù)的實驗證實,所以笤虫,表明有可能對表位特異性進(jìn)行預(yù)測建模旁瘫。(這也是我們這個專題的終極目的)。
這里就需要我們上一篇提到的內(nèi)容琼蚯,如果對抗原富集后的TCR進(jìn)行建模分析酬凳,首先a distance measure on the space of TCRs(TCR的距離度量) that permits clustering and visualization(這里的聚類和可視化與單細(xì)胞轉(zhuǎn)錄組不同), a robust repertoire diversity metric that accommodates the low number of paired public receptors observed when compared to single-chain analyses(允許少量的其他單鏈序列,畢竟尋找motif), and a distancebased classifier(分類器遭庶,這個在機(jī)器學(xué)習(xí)中非常常見) that can assign previously unobserved TCRs to characterized repertoires with robust sensitivity and specificity宁仔。
當(dāng)然,具體的抗原表位富集后的TCR序列contains a clustered group of receptors that share core sequence similarities, together with a dispersed set of diverse ‘outlier’ sequences(這是很自然的,這些相似的序列必然擁有相同的motif罚拟,從而特異性的結(jié)合抗原表位)台诗。通過識別核心序列中的共享基序,我們能夠突出顯示驅(qū)動 TCR 識別基本要素的關(guān)鍵保守殘基赐俗。 (看來這里的序列還是氨基酸序列)。
這里我們測序得到的TCR序列弊知,我們需要總結(jié)和分析的部分是include length, charge, and hydrophobicity of the CDR3 regions, clonal diversity (within individuals), and amino acid sequence sharing (across individuals) following well-established approaches to repertoire analysis阻逮。(建立的方法我們后面介紹,總之秩彤,很多指標(biāo)需要我們深入分析叔扼,而不簡簡單單是基因序列,單細(xì)胞的TCR分析需要我們升級)漫雷。
Mean values for CDR3 length, charge, and hydrophobicity tightly clustered for the majority of the epitopes, and all CDR3 features showed substantially overlapping ranges(看來確實可以依據(jù)抗原富集來尋找起作用的motif)瓜富。
這里簡單回顧一下作者的發(fā)現(xiàn),(1)found negative correlations between CDR3 charge and peptide charge(CDR3的電荷和肽段電荷成反比,以及 CDR3 長度和肽長度之間)降盹。表明電荷和長度互補(bǔ)可能在某些表位的 pMHC 識別中起作用(基礎(chǔ)知識与柑,了解即可)。(2)Whereas substantial levels of sharing or publicity were observed for individual chains(單鏈比較蓄坏,很多都是一樣的)价捧,當(dāng)考慮配對的 αβ 受體時,觀察到個體之間的共享水平較低(這一點很有意思涡戳,單鏈比較有大量的相同演痒,而配對的雙鏈卻鮮有一致的惠勒,有意思)。
單細(xì)胞測TCR的作用衔沼,By using paired single-cell TCRαβ sequencing, we were able to determine whether V and J segment usage was correlated both within a chain (for example, Vα –Jα , Vβ –Jβ ) and across chains (for example, Vα –Vβ , Vα –Jβ).(尋找相關(guān)性)。
相對于沒有進(jìn)行抗原表位富集的TCR序列觅赊,病毒抗原表位識別后的TCR序列found varying degrees of dominance of single and pairwise gene associations。(這個也是在預(yù)料之中)。
- 圖注:V and J gene segment usage and covariation in epitopespecific responses(V 和 J 基因片段使用和表位特異性反應(yīng)中的協(xié)變). a, Gene segment usage and gene–gene pairing landscapes are illustrated using four vertical stacks(垂直堆疊) (one for each V and J segment) connected by curved paths whose thickness is proportional to the number of TCR clones with the respective gene pairing(就是芍查牛基圖) (each panel is labelled with the four gene segments atop their respective colour stacks and the epitope identifier in the top middle). Genes are coloured by frequency within the repertoire with a fixed colour sequence used throughout the manuscript which begins red (most frequent), green (second most frequent), blue, cyan, magenta, and black. The enrichment of gene segments relative to background frequencies is indicated by up or down arrows with an arrowhead number equal to the log2 of the fold change. b, Jensen–Shannon divergence(有關(guān)JS散度大家可以參考文章KL散度、JS散度掌测、Wasserstein距離) between the observed gene frequency distributions and background frequencies, normalized by the mean Shannon entropy of the two distributions (higher values reflect stronger gene preferences). c, Adjusted mutual information of gene usage correlations between regions (higher values indicate more strongly covarying gene usage). The lower limits of the colour ranges in b and c were chosen to highlight significant changes内贮。 A summary of the number of subjects, total number of TCR sequences
- 圖注:Gene segment usage and gene–gene pairing landscapes are illustrated graphically using four vertical stacks (one for each V and J segment) connected by curved segments with thickness proportional to the number of TCRs with the respective gene pairing (each panel is labelled with the four gene segments atop their respective colour stacks and the epitope identifier in the top middle). Genes are coloured by frequency within the repertoire with a fixed colour sequence used throughout the manuscript which begins red (most frequent), green (second most frequent), blue, cyan, magenta, and black. Clonally expanded TCRs were reduced to a single data point for this analysis. The number of clones is indicated to the left of each panel. The enrichment of gene segments relative to background frequencies is indicated by up or down arrows, with each successive arrowhead corresponding to an additional twofold deviation (for example, one arrowhead = twofold enrichment, two arrowheads = fourfold enrichment).(和上圖的表現(xiàn)形式一致)。
每個表位特異性反應(yīng)的特征是單個基因的過度表達(dá)以及顯著的基因配對偏好汞斧,這就為我們對單獨的抗原表位進(jìn)行建模尋找motif提供了理論依據(jù)夜郁。每個表位特異性基因頻率分布和背景分布之間的 Jensen-Shannon 散度用于量化基因偏好的總大小 (這個需要我們有一點的算法基礎(chǔ))。We quantified the degree of gene usage covariation between pairs of segments using the adjusted mutual information score(這也是重要的一環(huán))粘勒。
為了尋找motif竞端,TCR的距離定義就需要排上用場了。(概念和計算原理上篇已經(jīng)說過庙睡,CD3的懲罰更重)事富。
- 圖注:2D kernel principal components analysis (PCA) projection of the TCRdist landscape coloured by Vα (left panel) and Vβ (right panel) gene usage. Three groups of receptors that correspond to TCR logos and clusters depicted in c are indicated with dashed ellipses.(單細(xì)胞都很常見的方法)
- 圖注:Epitope-specific TCR landscapes were projected into two dimensions (2D) using kernel PCA analysis applied to the TCRdist distance matrix: TCRs with small TCRdist values tend to project to nearby points in 2D. The same 2D projection is shown in the four panels of each row, coloured by Vα , Jα , Vβ and Jβ gene segment usage (left to right, respectively). The colours are based on gene frequency in the projected repertoire and follow the same sequence used throughout the manuscript: in decreasing order, 1, red; 2, green; 3, blue; 4, cyan; 5, magenta; 6, black; followed by assorted colours for rare frequencies. A summary of number of subjects,
To complement these landscape projections, we performed based
clustering of the epitope-specific receptors and constructed hierarchical
distance trees(一個很好的分析軟件,TCRdist)(It is important to note that clonal expansions are not reflected in these repertoire landscape analyses, as each unique receptor is included only once.)乘陪,不計算重復(fù))统台,developed a TCR logo representation that summarizes the gene frequencies, CDR3 amino acid sequences, and inferred rearrangement(這個地方也需要注意,大家做過生化實驗的應(yīng)該都懂這個)啡邑。主要有一個cluster組成贱勃,其他的序列也是相似的結(jié)構(gòu),這就為我們尋找motif提供了便利谤逼。除了相似受體的核心cluster之外贵扰,每個repertoire還包含彼此明顯不同的受體的多個區(qū)域。
structures of a set of TCRs as a tool to further annotate these clusters
- 圖注:Average-linkage dendrogram of TCRdist receptor clusters coloured by generation probability, with TCR logos for selected receptor subsets (the branches enclosed in dashed boxes labelled with size of the TCR clusters). Each logo depicts the V- (left side) and J- (right side) gene frequencies, CDR3 amino acid sequences (middle), and inferred rearrangement structure (bottom bars coloured by source region, light grey for the V-region, dark grey for J, black for D, and red for N-insertions) of the grouped receptors. (n = 13 mice, 291 TCR clones.)
盡管 CDR3 序列保守性在 TCRdist 簇標(biāo)識中很明顯流部,但這些共享的 CDR3 殘基中有許多直接來自 V 和 J 區(qū)的基因組序列戚绕,因此反映了觀察到的基因使用偏差,為了尋找CDR3的motif序列枝冀,采用了遞歸搜索算法舞丛,identified sequence patterns that occur significantly more often in the observed receptors than in two V- and J-gene-matched background sets of receptor sequences(這需要結(jié)構(gòu)生物學(xué)的只是了,知道的太少了宾茂,慚愧)瓷马。
- 注:Enriched CDR3 sequence motifs define key features of epitope specificity. The top-scoring CDR3α (left TCR logo) and CDR3β (right TCR logo) sequence motifs are shown for each repertoire. The motif sequence logo is shown at full height (top) and scaled (bottom) by per-column relative entropy to background frequencies derived from TCRs with matching gene-segment composition in order to highlight motif positions under selection. For three epitopes with solved ternary TCR–pMHC structures, the enriched motif positions are mapped onto the 3D structure: motif positions shown in green sticks; peptide in magenta; alpha (beta) chain in yellow (blue) cartoons; selected hydrogen bonds shown as dotted green lines。
propose that these statistically enriched, non-germline-encoded motifs have a critical role in mediating TCR recognition(應(yīng)該是這樣的)跨晴,對TCR的蛋白結(jié)構(gòu)分析也證明了這一點欧聘。所以我們對于TCR的序列分析,能夠識別驅(qū)動 TCR 識別(抗原)essential elements的關(guān)鍵保守殘基端盆,這個分析怀骤,太重要了费封。
接下來應(yīng)用 TCRdist 測量來定量評估表位特異性庫中的受體多樣性和density,采用了一個new diversity metric (TCRdiv) that generalizes Simpson’s diversity index(辛普森多樣性指數(shù)蒋伦,大家可以百度一下弓摘,看看這個指數(shù)) by capturing similarity among receptors in addition to exact identity, as Simpson’s diversity index is highly sensitive to sampling noise because of the relative rarity of observing identical αβ pairs among individuals。
Examination of TCRdiv scores for the analysed repertoires for single chains as well as paired receptors clarified trends seen in the earlier analyses(例如:the PB1 repertoire exhibited low diversity in the α -chain and high β -chain diversity)
如上所述痕届,我們的landscape分析表明韧献,每個repertoire都由一組或多組共享相似序列特征的cluster受體以及更多樣化的離群cluster組成⊙薪校考慮到cluster和發(fā)散的 TCR 的貢獻(xiàn)锤窑,開發(fā)了一個特定于repertoires的最近鄰評分(NN 距離),它捕獲了每個受體周圍的受體密度(計算為受體與其在repertoires中的最近鄰受體之間的平均 TCRdist)嚷炉。 Although variation across repertoires was apparent in the NN-distance distributions渊啰,大多數(shù)表位表現(xiàn)出近似雙峰分布,其中一個具有低 NN 距離的受體峰代表受體分布的主要和密集采樣的主要cluster申屹,而具有更大 NN 距離的受體的第二個峰反映了異常受體绘证。
為了確認(rèn)這些非成簇受體的抗原特異性,把兩個峰的受體提取出來哗讥,然后實驗衡量binding特異性四聚體的能力(識別相應(yīng)抗原的能力)嚷那。在每種情況下都確認(rèn)了受體的反應(yīng)性,表明這些不同的異常受體中至少有一些是legitimate杆煞,if unconventional, solutions to the problem of epitope specificity车酣,部分解釋了這種現(xiàn)象。
這個軟件還有分類器的功能索绪,幫助我們識別專有T細(xì)胞的motif,比如浸潤腫瘤的TCR序列等等贫悄,非常有價值瑞驱,今天的基礎(chǔ)知識我們就到這里,下一篇我們分享軟件TCRdist的算法和代碼窄坦。
生活很好唤反,有你更好