隔離的第八天履澳,孤獨(dú)的情緒漸漸消散了缓屠,音樂(lè)敌完,朋友羊初,親情长赞,還有游戲得哆,果然讓心情會(huì)變好很多,有位圣賢說(shuō)過(guò)栋操,其實(shí)人過(guò)的快不快樂(lè)饱亮,并不是說(shuō)我們擁有了房子近上,車子,還有和喜歡的姑娘在一起葱绒,而是良好的人際關(guān)系地淀,好了拒迅,這一篇我來(lái)詳細(xì)分享一下harmony的各種用法璧微,包括harmony與Seurat 的聯(lián)合使用,harmony與NMF的聯(lián)合使用胞得,以及python版本的harmonypy的運(yùn)用阶剑,我們一一來(lái)分享,希望大家有所收益素邪。
oytdd2r83q.png
1兔朦、R版本沽甥,harmony and Seurat乏奥,這個(gè)應(yīng)該是運(yùn)用的最多的,示例代碼如下:
library(Seurat)
library(harmony)
............前面的處理過(guò)程大家自己看一下就好
numsap=1
for (each in samples){
pbmc <- readRDS(paste0(path,'/',each,'_QC.rds'))
colnames(pbmc@assays$RNA@counts) <- str_replace_all(colnames(pbmc@assays$RNA@counts), '-1',paste0('-',numsap))
ob <- CreateSeuratObject(counts =pbmc@assays$RNA@counts,project =each,min.cells = min_cells)
ob$stim <-each
ob <- NormalizeData(ob)
# ob <- FindVariableFeatures(ob, selection.method = "vst",nfeatures = Nfeatures)
numsap=numsap+1
ob.list[[each]] <- ob
}
outdir = myoutdir
seurat.obj = merge(x=ob.list[[1]], y=ob.list[[2]])
if(length(ob.list) > 2){
for (j in 3:length(ob.list)){
seurat.obj = merge(x=seurat.obj, y=ob.list[[j]])
}
}
#Nfeatures=length(rownames(x =seurat.obj)) 盡量不要用所有基因跑
seurat.obj <- FindVariableFeatures(seurat.obj, selection.method = "vst",nfeatures = Nfeatures)
seurat.obj<-ScaleData(seurat.obj,verbose = FALSE,split.by = 'orig.idents')
seurat.obj<-RunPCA(seurat.obj,pc.genes = seurat.obj@var.genes, npcs = cca_use, verbose = FALSE)
sample = compare
# run harmany batch correction
seurat.obj <- RunHarmony(seurat.obj,"stim", plot_convergence = TRUE)
harmony_embeddings <- Embeddings(seurat.obj, 'harmony')
harmony.data<-cbind(rownames(harmony_embeddings),harmony_embeddings)
colnames(harmony.data)[1]<-"Barcode"
write.table(harmony.data,file=paste0(outdir,'/Anchors/',prefix,'_harmony.csv'),row.names=F,col.names=T,sep=',',quote=F)
## run TSNE
seurat.obj = RunTSNE(seurat.obj, dims.use=1:cca_use, do.fast=TRUE, reduction.use="harmony")
# run UMAP
seurat.obj <- RunUMAP(seurat.obj, reduction = "harmony", dims = 1:cca_use,umap.method = "umap-learn",metric = "correlation")
seurat.obj <- FindNeighbors(seurat.obj, reduction = "harmony", dims = 1:cca_use)
seurat.obj <- FindClusters(seurat.obj, reduction.type = "harmony",resolution = resolution)
#saveRDS(seurat.obj,file=paste0(outdir,'/',prefix,'_combined.rds'))
需要注意的地方
- 讀取的時(shí)候我這里寫(xiě)的是QC.rds,大家也可以讀矩陣,10X的結(jié)果文件等等痕鳍,不要拘泥于我寫(xiě)的方法笼呆。
- RunHarmony的第二個(gè)參數(shù)我寫(xiě)了stim诗赌,其實(shí)就是樣本信息秸弛,這個(gè)分析因人而異递览,你的樣本信息保存在哪一列,你就寫(xiě)哪一列绞铃。
- 示例代碼有一些參數(shù)用變量替代镜雨,如果大家做過(guò)單細(xì)胞,應(yīng)該都很容易理解儿捧。
好了荚坞,我們的第二部分挑宠,NMF and harmony
前段時(shí)間有位朋友說(shuō)NMF和harmony能不能聯(lián)合使用,答案是可以的颓影,這里展示一下示例代碼各淀,但是大家要知道哪些單細(xì)胞處理的分析軟件用到了NMF,當(dāng)然诡挂,我們分析一般使用基礎(chǔ)的NMF包就可以了,我們先來(lái)看看NMF如何與Seurat無(wú)縫銜接(這部分大家需要掌握):
library(Seurat)
library(tidyverse)
library(NMF)
rm(list = ls())
## 創(chuàng)建seurat對(duì)象
pbmc <- Read10X_h5("pbmc.h5")
pbmc <- CreateSeuratObject(pbmc, project = "pbmc", min.cells = 3, min.features = 500)
pbmc$percent.mt <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, percent.mt<20)
pbmc <- NormalizeData(pbmc) %>% FindVariableFeatures() %>% ScaleData(do.center = F)
## 高變基因表達(dá)矩陣的分解
# pbmc大體可分成T咆畏,B南捂,NK,CD14+Mono旧找,CD16+Mono溺健,DC,Platelet等類型钮蛛,考慮冗余后設(shè)置rank=10
vm <- pbmc@assays$RNA@scale.data
res <- nmf(vm, 10, method = "snmf/r", seed = 'nndsvd')
runtime(res)
# 用戶 系統(tǒng) 流逝
#1063.147 78.019 1139.831
## 分解結(jié)果返回suerat對(duì)象
pbmc@reductions$nmf <- pbmc@reductions$pca
pbmc@reductions$nmf@cell.embeddings <- t(coef(res))
pbmc@reductions$nmf@feature.loadings <- basis(res)
## 使用nmf的分解結(jié)果降維聚類
set.seed(219)
pbmc.nmf <- RunUMAP(pbmc, reduction = 'nmf', dims = 1:10) %>%
FindNeighbors(reduction = 'nmf', dims = 1:10) %>% FindClusters()
其中有個(gè)問(wèn)題鞭缭,為什么要用NMF?魏颓?
- 對(duì)比PCA分析的結(jié)果岭辣,NMF雖然毫不遜色,但是它的運(yùn)行時(shí)間更長(zhǎng)甸饱,我們?yōu)槭裁匆肗MF呢沦童?一個(gè)很重要的原因是NMF的因子可解釋性更強(qiáng),每個(gè)因子貢獻(xiàn)度最大的基因基本代表了某種或某個(gè)狀態(tài)細(xì)胞的表達(dá)模式叹话,相比差異分析得到marker基因更有代表性偷遗。
接下來(lái)就是NMF 與 harmony的聯(lián)合使用,相信大家都已經(jīng)會(huì)寫(xiě)了驼壶,示例代碼放在這里:
library(Seurat)
library(tidyverse)
library(NMF)
rm(list = ls())
## 創(chuàng)建seurat對(duì)象
pbmc <- Read10X_h5("pbmc.h5")
pbmc <- CreateSeuratObject(pbmc, project = "pbmc", min.cells = 3, min.features = 500)
pbmc$percent.mt <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, percent.mt<20)
pbmc <- NormalizeData(pbmc) %>% FindVariableFeatures() %>% ScaleData(do.center = F)
## 高變基因表達(dá)矩陣的分解
# pbmc大體可分成T氏豌,B,NK热凹,CD14+Mono泵喘,CD16+Mono,DC般妙,Platelet等類型纪铺,考慮冗余后設(shè)置rank=10
vm <- pbmc@assays$RNA@scale.data
res <- nmf(vm, 10, method = "snmf/r", seed = 'nndsvd')
runtime(res)
# 用戶 系統(tǒng) 流逝
#1063.147 78.019 1139.831
## 分解結(jié)果返回suerat對(duì)象
pbmc@reductions$nmf <- pbmc@reductions$pca
pbmc@reductions$nmf@cell.embeddings <- t(coef(res))
pbmc@reductions$nmf@feature.loadings <- basis(res)
seurat.obj <- RunHarmony(seurat.obj,"stim", plot_convergence = TRUE)
harmony_embeddings <- Embeddings(seurat.obj, 'harmony')
harmony.data<-cbind(rownames(harmony_embeddings),harmony_embeddings)
## 使用nmf的分解結(jié)果降維聚類
set.seed(219)
pbmc.nmf <- RunUMAP(pbmc, reduction = 'harmony', dims = 1:10) %>%
FindNeighbors(reduction = 'nmf', dims = 1:10) %>% FindClusters()
####下游的處理都一樣了
但是這里注意一個(gè)問(wèn)題,harmony默認(rèn)是識(shí)別pca的股冗,如果出來(lái)slot報(bào)錯(cuò)霹陡,就用NMF的分析結(jié)果覆蓋掉PCA。
python版本的harmonypy
安裝的話很簡(jiǎn)單
pip install harmonypy
示例代碼
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans
from scipy.stats.stats import pearsonr
import harmonypy as hm
meta_data = pd.read_csv("data/meta.tsv.gz", sep = "\t")
data_mat = pd.read_csv("data/pcs.tsv.gz", sep = "\t")
data_mat = np.array(data_mat)
vars_use = ['dataset']
# meta_data
#
# cell_id dataset nGene percent_mito cell_type
# 0 half_TGAAATTGGTCTAG half 3664 0.017722 jurkat
# 1 half_GCGATATGCTGATG half 3858 0.029228 t293
# 2 half_ATTTCTCTCACTAG half 4049 0.015966 jurkat
# 3 half_CGTAACGACGAGAG half 3443 0.020379 jurkat
# 4 half_ACGCCTTGTTTACC half 2813 0.024774 t293
# .. ... ... ... ... ...
# 295 t293_TTACGTACGACACT t293 4152 0.033997 t293
# 296 t293_TAGAATTGTTGGTG t293 3097 0.021769 t293
# 297 t293_CGGATAACACCACA t293 3157 0.020411 t293
# 298 t293_GGTACTGAGTCGAT t293 2685 0.027846 t293
# 299 t293_ACGCTGCTTCTTAC t293 3513 0.021240 t293
# [300 rows x 5 columns]
# data_mat[:5,:5]
#
# array([[ 0.0071695 , -0.00552724, -0.0036281 , -0.00798025, 0.00028931],
# [-0.011333 , 0.00022233, -0.00073589, -0.00192452, 0.0032624 ],
# [ 0.0091214 , -0.00940727, -0.00106816, -0.0042749 , -0.00029096],
# [ 0.00866286, -0.00514987, -0.0008989 , -0.00821785, -0.00126997],
# [-0.00953977, 0.00222714, -0.00374373, -0.00028554, 0.00063737]])
ho = hm.run_harmony(data_mat, meta_data, vars_use)
# Write the adjusted PCs to a new file.
res = pd.DataFrame(ho.Z_corr)
res.columns = ['X{}'.format(i + 1) for i in range(res.shape[1])]
res.to_csv("data/adj.tsv.gz", sep = "\t", index = False)
# Test 2
########################################################################
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans
from scipy.stats.stats import pearsonr
import harmonypy as hm
meta_data = pd.read_csv("data/pbmc_3500_meta.tsv.gz", sep = "\t")
data_mat = pd.read_csv("data/pbmc_3500_pcs.tsv.gz", sep = "\t")
from time import time
start = time()
ho = hm.run_harmony(data_mat, meta_data, ['donor'])
end = time()
print("elapsed {:.2f} seconds".format(end - start)) # 24 seconds for python, 5 seconds for Rcpp
res = pd.DataFrame(ho.Z_corr).T
res.columns = ['PC{}'.format(i + 1) for i in range(res.shape[1])]
res.to_csv("data/pbmc_3500_pcs_harmonized_python.tsv.gz", sep = "\t", index = False)
harm = pd.read_csv("data/pbmc_3500_pcs_harmonized.tsv.gz", sep = "\t")
cors = []
for i in range(res.shape[1]):
cors.append(pearsonr(res.iloc[:,i].values, harm.iloc[:,i].values))
print([np.round(x[0], 3) for x in cors])
python版本的harmony是為了scanpy而準(zhǔn)備的,所以也是和scanpy無(wú)縫銜接烹棉,寫(xiě)法跟上面的R版本很一致攒霹,把scanpy對(duì)應(yīng)的信息給到harmony就可以了
import harmonypy as hm
import pandas as pd
import scanpy as sc
adata = adata = sc.read_10x_mtx(
'data/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file
var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
cache=True)
###或者讀取其他格式的文件,包括h5ad浆洗,loom催束,csv,h5等伏社,看大家的需求抠刺,前面的質(zhì)控處理就不多說(shuō)了,直接做完pca摘昌,然后下一步:
ho = hm.run_harmony(adata.X, adata.obs, ['Sample'])
res = pd.DataFrame(ho.Z_corr)
res.columns = ['X{}'.format(i + 1) for i in range(res.shape[1])]
res.to_csv("data/adj.tsv.gz", sep = "\t", index = False)
###然后一樣速妖,把得到的信息添加到scanpy的分析對(duì)象就可以了
最后是python版本的NMF與harmony的聯(lián)合使用,就不多介紹了聪黎,大家應(yīng)該都會(huì)寫(xiě)了罕容。
生活很好,有你更好