劉小澤寫于19.7.8-第二單元第八講:乳腺癌領(lǐng)域之PAM50分類
筆記目的:根據(jù)生信技能樹的單細(xì)胞轉(zhuǎn)錄組課程探索smart-seq2技術(shù)相關(guān)的分析技術(shù)
課程鏈接在:http://jm.grazy.cn/index/mulitcourse/detail.html?cid=53
什么是PAM50
首次接觸這個名詞肯定很蒙尝艘,因為它是乳腺癌領(lǐng)域的分類名詞查辩,需要看許多資料才能了解烤芦,我也一樣,看了一些推文蝗蛙、英文資料、文章,才做了一些總結(jié)
PAM50的意思是Prediction Analysis of Microarray 50
毡惜,可以看到是芯片時代的產(chǎn)物了瓤狐,它是2009年由Parker提出的瞬铸,原文在:https://ascopubs.org/doi/full/10.1200/JCO.2008.18.1370,目前接近3000引用量础锐。
使用的芯片是Agilent human 1Av2 microarrays or custom-designed Agilent human 22k arrays嗓节,數(shù)據(jù)在GSE10886,它研究了189個prototype samples皆警,得到了一個50個分類基因與5個對照基因的RT-qPCR定量結(jié)果拦宣,得到4個gene expression–based “intrinsic” subtypes:Luminal A, Luminal B, HER2-enriched and Basal-like(詳見:https://genome.unc.edu/pubsup/breastGEO/pam50_centroids.txt)。
關(guān)于這幾種分子亞型的介紹:https://www.breastcancer.org/symptoms/types/molecular-subtypes
Luminal A:hormone-receptor positive (estrogen-receptor and/or progesterone-receptor positive), HER2 negative, low levels of the protein Ki-67 => grow slowly and have the best prognosis.
Luminal B:hormone-receptor positive (estrogen-receptor and/or progesterone-receptor positive), either HER2 positive or HER2 negative信姓,high levels of Ki-67 => grow slightly faster than luminal A & prognosis is slightly worse
-
Triple-negative/basal-like: hormone-receptor negative (estrogen-receptor and progesterone-receptor negative) , HER2 negative
More common with BRCA1 gene mutations among younger and African-American women..
-
HER2-enriched: hormone-receptor negative (estrogen-receptor and progesterone-receptor negative), HER2 positive => grow faster than luminal cancers & worse prognosis
BUT often successfully treated with targeted therapies aimed at the HER2 protein (e.g. Herceptin, Perjeta, Tykerb, Nerlynx, Kadcyla)
Normal-like: similar to luminal A => prognosis is slightly worse than luminal A but also good
乳腺癌發(fā)育來自兩種不同的細(xì)胞:基體細(xì)胞和管腔細(xì)胞鸵隧,還有不同的激素類型(ER/PR、HER2受體)意推,之前在臨床上都是根據(jù)一些IHC marker來進(jìn)行分類
The most common immunohistochemical breast cancerprognostic and therapeutic markers used include: estrogen receptor, human epidermal growth factor receptor-2, Ki-67, progesterone receptor, and p53. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4127609/)
乳腺癌是一個高度異質(zhì)性的疾病豆瘫,即使臨床分期和病理分級相同,患者對治療的反應(yīng)和預(yù)后也是不同的菊值。目前仍然是根據(jù)臨床病理特點如HER2表達(dá)外驱、雌激素受體狀態(tài)、腫瘤大小腻窒、分級和淋巴結(jié)轉(zhuǎn)移等選擇輔助治療略步,包括化療,內(nèi)分泌治療定页,抗HER2治療等趟薄。為了指導(dǎo)預(yù)后,常常采用TNM分期典徊、臨床病理指標(biāo)杭煎,后來由于高通量數(shù)據(jù)的產(chǎn)生恩够,多基因預(yù)測成為了一個新途徑。
舉個例子:可以看表達(dá)量羡铲,比如有50個基因蜂桶,有10個特定基因高它們就表示Luminal A,有其他10個基因高就是Luminal B也切,這就是一個模式扑媚;我們只需要比較我們的表達(dá)矩陣和這個模式進(jìn)行對應(yīng)
多基因檢測有兩項已經(jīng)通過了FDA的批準(zhǔn):
- 21-gene OncotypeDx assay (Genome Health Inc, Redwood City, CA):risk stratify early-stage estrogen receptor (ER) –positive breast cancer
- 70-gene MammaPrint (Agendia, Huntington Beach, CA):ER-positive and ER-negative early-stage node-negative breast cancer.
另外前人的研究還有:
- Single Sample Predictor (SSP) :SSP2003 、SSP2006雷恃、PAM50
- Subtype Classification Model (SCM):SCMOD1疆股、SCMOD2 、simple three-gene model (SCMGENE)
利用genefu包來熟悉PAM50分類器
這個是Bioconductor的包倒槐,使用正確的方式安裝好
官方教程在:https://rdrr.io/bioc/genefu/f/inst/doc/genefu.pdf
用包需知 1
自帶了5個乳腺癌芯片數(shù)據(jù)集(breastCancerMAINZ=》GSE11121旬痹、breastCancerTRANSBIG=》GSE7390、breastCancerUPP=》GSE3494讨越、breastCancerUNT=》GSE2990两残、breastCancerNKI=》數(shù)據(jù)集沒有上傳到GEO):https://vip.biotrainee.com/d/689-5
breastCancerMAINZ=》GSE11121
文章:The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Res 2008 Jul 1;68(13):5405-13.
Sci-hub: https://sci-hub.tw/10.1158/0008-5472.can-07-5206
方法:GPL96(HG-U133A) Affymetrix Human Genome U133A Array芯片,其中包含了200 tumors of patients who were not treated by systemic therapy after surgery using a discovery approach.
臨床信息:biological process of proliferation把跨、steroid hormone receptor expression人弓、B cell and T cell infiltration
breastCancerTRANSBIG=》GSE7390
文章:Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res 2007 Jun 1;13(11):3207-14.
Sci-hub: https://sci-hub.tw/10.1158/1078-0432.ccr-06-2765
方法: GPL96 (HG-U133A) Affymetrix Human Genome U133A Array 芯片,frozen samples from 198 N- systemically untreated patients
breastCancerUPP=》GSE3494
文章:An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci U S A 2005 Sep 20;102(38):13550-5.
Sci-hub:https://sci-hub.tw/10.2307/3376671
方法: GPL96 (HG-U133A) Affymetrix Human Genome U133A Array 芯片着逐,freshly frozen breast tumors from a population-based cohort of 315 women representing 65% of all breast cancers resected in Uppsala County, Sweden, from January 1, 1987 to December 31, 1989.
breastCancerUNT =》GSE2990
文章:Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 2006 Feb 15;98(4):262-72
Sci-hub:https://sci-hub.tw/10.1093/jnci/djj052
方法: GPL96 (HG-U133A) Affymetrix Human Genome U133A Array 芯片票从, 189 invasive breast carcinomas and from three published gene expression datasets from breast carcinomas.
最后一個breastCancerNKI使用的是Agilent公司芯片
用包需知 2
這個R包除了包裝了PAM50分類,還加入了其他許多分類標(biāo)準(zhǔn)滨嘱,詳見https://rdrr.io/bioc/genefu/man/峰鄙,使用PAM50是因為它的引用量很高,認(rèn)可度較高
開始用包
# 加載數(shù)據(jù)
rm(list = ls())
options(stringsAsFactors = F)
load(file = '../input.Rdata')
a[1:4,1:4]
head(df)
# 檢查行名(基因名)
> head(rownames(dat))
[1] "0610007P14Rik" "0610009B22Rik" "0610009L18Rik" "0610009O20Rik"
[5] "0610010F05Rik" "0610010K14Rik"
除了很多不像常規(guī)基因名的基因以外太雨,還有很多基因大小寫不一致吟榴,這是因為這個數(shù)據(jù)是小鼠的,而小鼠的基因名與人類的不同在于:首字母大寫囊扳,其余小寫
首先就是將這里的dat
基因名全變?yōu)榇髮?/p>
rownames(dat)=toupper(rownames(dat))
當(dāng)然吩翻,最好直接使用小鼠的分類器,但是目前沒有锥咸,因此只能使用人類的狭瞎,不是很準(zhǔn)確,但是這個分類是可以借鑒的
# 加載genefu
library(genefu)
# 可以看到會加載很多依賴包搏予,包含機器學(xué)習(xí)熊锭、并行、分類法
Loading required package: limma
Loading required package: biomaRt
Loading required package: iC10
Loading required package: pamr
Loading required package: cluster
Loading required package: impute
Loading required package: iC10TrainingData
Loading required package: AIMS
Loading required package: e1071
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel
這個包也需要轉(zhuǎn)置后的表達(dá)矩陣(基因為列)
> ddata=t(dat)
> ddata[1:4,1:4]
0610007P14Rik 0610009B22Rik 0610009L18Rik 0610009O20Rik
SS2_15_0048_A3 0.000000 0 0 0.000000
SS2_15_0048_A6 0.000000 0 0 0.000000
SS2_15_0048_A5 6.459884 0 0 2.544699
SS2_15_0048_A4 6.313884 0 0 3.025273
> dim(ddata)
[1] 768 12198
> s=colnames(ddata);head(s);tail(s) ##獲得基因名
[1] "0610007P14Rik" "0610009B22Rik" "0610009L18Rik" "0610009O20Rik"
[5] "0610010F05Rik" "0610010K14Rik"
[1] "ERCC-00160" "ERCC-00162" "ERCC-00163" "ERCC-00165" "ERCC-00170"
[6] "ERCC-00171"
## 發(fā)現(xiàn)有的基因名是不符合常規(guī)認(rèn)知的,因此需要進(jìn)行基因名轉(zhuǎn)換
# 看下人類這個基因注釋包中都包含哪些碗殷,發(fā)現(xiàn)有org.Hs.egSYMBOL精绎,應(yīng)該就是需要的
ls("package:org.Hs.eg.db")
# 這個注釋信息是Bimap格式的,需要先轉(zhuǎn)換成數(shù)據(jù)框锌妻,利用toTable函數(shù)
> class(org.Hs.egSYMBOL)
[1] "AnnDbBimap"
> s2g=toTable(org.Hs.egSYMBOL)
# 求小鼠的基因與人類的基因的交集代乃,利用match函數(shù),返回位置信息(如果沒有對應(yīng)仿粹,就返回NA)搁吓。存在NA的原因就是:小鼠有的對應(yīng)不上人類基因名,并且人類的基因也有未知的
> g=s2g[match(s,s2g$symbol),1]
# 然后做成一個數(shù)據(jù)框
> dannot=data.frame(probe=s,
"Gene.Symbol" =s,
"EntrezGene.ID"=g)
# 下面去掉ddata和dannot中NA的行
> ddata=ddata[,!is.na(dannot$EntrezGene.ID)] #ID轉(zhuǎn)換
> dim(ddata)
[1] 768 10487 # 相比之前大約去掉2000個基因
> dannot=dannot[!is.na(dannot$EntrezGene.ID),]
# 看下去除NA后的基因注釋和表達(dá)矩陣吭历,必須保證注釋的基因ID和表達(dá)矩陣的基因ID一一對應(yīng)
> head(dannot)
probe Gene.Symbol EntrezGene.ID
372 A4GALT A4GALT 53947
393 AAAS AAAS 8086
394 AACS AACS 65985
396 AAGAB AAGAB 79719
397 AAK1 AAK1 22848
398 AAMDC AAMDC 28971
> ddata[1:4,1:4]
A4GALT AAAS AACS AAGAB
SS2_15_0048_A3 8.516383 0 0.000000 0
SS2_15_0048_A6 7.111928 0 0.000000 0
SS2_15_0048_A5 3.415452 0 0.000000 0
SS2_15_0048_A4 6.848774 0 7.168196 0
可以進(jìn)行g(shù)enefu分析了堕仔,分型就是使用molecular.subtyping
函數(shù)
s<-molecular.subtyping(sbt.model = "pam50",data=ddata,
annot=dannot,do.mapping=TRUE)
# 結(jié)果就是將768個細(xì)胞
> table(s$subtype)
Basal Her2 LumB LumA Normal
42 58 46 543 79
# 可以利用原始的樣本信息數(shù)據(jù)框df進(jìn)行clust分組與分子分型之間關(guān)系的探索
> df$subtypes=subtypes
> table(df[,c(1,5)])
subtypes
g Basal Her2 LumA LumB Normal
1 36 30 205 13 28
2 3 25 217 31 24
3 1 2 102 1 15
4 2 1 19 1 12
注意:雖然這里可以實現(xiàn)分類,但是PAM50是針對乳腺癌患者進(jìn)行分類的毒涧,而我們這里是針對單細(xì)胞贮预;而且細(xì)胞也不是癌細(xì)胞贝室,是CAFs(cancer associated fiberblast)
不管是什么細(xì)胞契讲,最后都能得到一個表達(dá)矩陣,算法是不會考慮矩陣來源的滑频,因此即便是正常細(xì)胞的矩陣捡偏,也可以分類成5種乳腺癌亞型,所以分類的前提還是自己熟悉數(shù)據(jù)的生物學(xué)背景
探索PAM50
看一下pam50峡迷,它是一個列表
> str(pam50)
List of 7
$ method.cor : chr "spearman"
$ method.centroids: chr "mean"
$ std : chr "none"
$ rescale.q : num 0.05
$ mins : num 5
$ centroids : num [1:50, 1:5] 0.718 0.537 -0.575 -0.119 0.3 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:50] "ACTR3B" "ANLN" "BAG1" "BCL2" ...
.. ..$ : chr [1:5] "Basal" "Her2" "LumA" "LumB" ...
$ centroids.map :'data.frame': 50 obs. of 3 variables:
..$ probe : chr [1:50] "ACTR3B" "ANLN" "BAG1" "BCL2" ...
..$ probe.centroids: chr [1:50] "ACTR3B" "ANLN" "BAG1" "BCL2" ...
..$ EntrezGene.ID : int [1:50] 57180 54443 573 596 332 644 891 898 991 990 ...
然后取出基因名银伟,存儲在centroids
中:
pam50genes=pam50$centroids.map[c(1,3)]
# 發(fā)現(xiàn)有的基因已經(jīng)不是標(biāo)準(zhǔn)的symbol了,PAM50是2009年的基因名绘搞,因此需要進(jìn)行修改
pam50genes[pam50genes$probe=='CDCA1',1]='NUF2'
pam50genes[pam50genes$probe=='KNTC2',1]='NDC80'
pam50genes[pam50genes$probe=='ORC6L',1]='ORC6'
以第一個基因為例:https://www.genecards.org/cgi-bin/carddisp.pl?gene=NUF2&keywords=NUF2
> x=dat
# 找到pam50在原始表達(dá)矩陣行名中的基因彤避,發(fā)現(xiàn)一共有38個
> pam50genes$probe[pam50genes$probe %in% rownames(x)]
[1] "ANLN" "BAG1" "BCL2" "BIRC5" "BLVRA" "CCNB1"
[7] "CCNE1" "CDC20" "CDC6" "NUF2" "CDH3" "CENPF"
[13] "CEP55" "CXXC5" "EGFR" "ERBB2" "ESR1" "FOXC1"
[19] "KIF2C" "NDC80" "MAPT" "MDM2" "MELK" "MIA"
[25] "MKI67" "MLPH" "MMP11" "MYBL2" "MYC" "ORC6"
[31] "PHGDH" "PTTG1" "RRM2" "SFRP1" "SLC39A6" "TYMS"
[37] "UBE2C" "UBE2T"
> x=x[pam50genes$probe[pam50genes$probe %in% rownames(x)] ,]
下面進(jìn)行熱圖可視化
# 在原來group_list基礎(chǔ)上,添加亞型信息夯辖,為了下面pheatmap中的anno_col設(shè)置
tmp=data.frame(group=group_list,
subtypes=subtypes)
rownames(tmp)=colnames(x)
# 畫熱圖
library(pheatmap)
pheatmap(x,show_rownames = T,show_colnames = F,
annotation_col = tmp)
圖片本身不重要琉预,因為這里數(shù)據(jù)的使用是不合適的≥锕樱可以看到圆米,大部分基因都是luminal A
如果要繼續(xù)歸一化就是:
x=t(scale(t(x)))
x[x>1.6]=1.6
x[x< -1.6]= -1.6
pheatmap(x,show_rownames = T,show_colnames = F,
annotation_col = tmp)