Github官網(wǎng)里的介紹:
SPRING ( https://doi.org/10.1093/bioinformatics/btx792 ) 是預(yù)處理腳本的集合和基于 Web 瀏覽器的工具,用于可視化高維數(shù)據(jù)并與之交互。在此處查看示例數(shù)據(jù)集改淑。SPRING 是為單細(xì)胞 RNA-Seq 數(shù)據(jù)開發(fā)的碍岔,但可以更廣泛地應(yīng)用。最小輸入是高維數(shù)據(jù)點(diǎn)(細(xì)胞)矩陣和維名稱(基因)列表
高維數(shù)據(jù)的低維可視化通常是不完美的朵夏。SPRING 不是試圖呈現(xiàn)單個(gè)細(xì)胞數(shù)據(jù)的單一“確定性”視圖蔼啦,而是允許探索多種可視化,以開發(fā)數(shù)據(jù)結(jié)構(gòu)的直覺仰猖。SPRING 的核心是創(chuàng)建數(shù)據(jù)點(diǎn)的 k 最近鄰 (kNN) 圖捏肢,并使用力導(dǎo)向布局在 2D 中可視化該圖。
環(huán)境配置
conda create python=2.7 -n SPRING
source /home/miniconda3/bin/activate SPRING
conda install scikit-learn numpy scipy matplotlib h5py
pip install networkx fa2 python-louvain
分析使用
參考數(shù)據(jù)集和腳本
下載一個(gè)案例看看:
wget https://github.com/AllonKleinLab/SPRING_dev/blob/master/data_prep/spring_example_pbmc4k.ipynb
主目錄應(yīng)包含以下文件:
matrix.mtx
counts_norm_sparse_cells.hdf5
counts_norm_sparse_genes.hdf5
genes.txt
以及每個(gè) SPRING 圖的一個(gè)子目錄:
categorical_coloring_data.json
cell_filter.npy
cell_filter.txt
color_data_gene_sets.csv
color_stats.json
coordinates.txt
edges.csv
graph_data.json
run_info.json
安裝jupyter notebook
conda install jupyter
下載SPRING的github包
git clone https://git@github.com/AllonKleinLab/SPRING.git
sudo chmod -R a+w SPRING
cd SPRING
測(cè)試是否完成部署
- 通過輸入啟動(dòng)本地服務(wù)器
python -m SimpleHTTPServer 8000 &
- 然后在網(wǎng)絡(luò)瀏覽器(最好是 Chrome)中轉(zhuǎn)到 http://192.168.3.33:8000/springViewer.html?datasets/centroids
說明環(huán)境搭建基本完成
然后在R中饥侵,用自己Seurat對(duì)象的數(shù)據(jù)矩陣鸵赫、便簽進(jìn)行預(yù)處理轉(zhuǎn)換成輸入文件
# 保存矩陣
library(data.table)
setwd('./SPRING/example_inputs/project/')
tt<-data.frame(seu@assays$RNA@counts)
tt[1:10, 1:10]
fwrite(tt, file = 'seu_mtx.xls', quote = F, sep = '\t', row.names = T, col.names = T)
# 保存Seurat分群、樣本來源躏升、細(xì)胞類型標(biāo)簽
write.table(data.frame(orig.ident=as.character(seu$orig.ident),
cluster=as.character(seu$seurat_clusters)),
file = 'orig.ident_list.xls', sep = '\t',
quote = F, col.names = T, row.names = F)
數(shù)據(jù)預(yù)處理
啟動(dòng)Jupyer Notebook
jupyter notebook --ip 192.168.3.33 --port 8889 &
打開瀏覽器輸入并用Token登錄:
192.168.3.33:8889
運(yùn)行Python代碼:
import pickle, numpy as np
# Import SPRING helper functions
from preprocessing_python import *
#矩陣導(dǎo)入
import numpy as np
import pandas as pd
mtx = pd.read_csv('./example_inputs/project/seu_mtx.xls',
delimiter='\t', encoding='utf-8')
#矩陣調(diào)整格式
mtx.index = mtx.iloc[:,:1]['Unnamed: 0']
mtx.index
mtx = mtx.iloc[:, 1:]
mtx.head()
gene_list = list(mtx.index)
gene_list[1:10]
#矩陣轉(zhuǎn)置
mtx = mtx.transpose()
# 轉(zhuǎn)換為numpy數(shù)組
mtx = mtx.to_numpy()
# 數(shù)據(jù)過濾辩棒、PCA降維, kNN聚類
# Filter out cells with fewer than 1000 UMIs
print 'Filtering cells'
mtx,cell_filter = filter_cells(mtx, 1000)
# Normalize gene expression data
# Only use genes that make up <
print 'Row-normalizing'
mtx = row_normalize(mtx)
# Filter genes with mean expression < 0.1 and fano factor < 3
print 'Filtering genes'
_,gene_filter = filter_genes(mtx,0.1,3)
# Z-score the gene-filtered expression matrix and do PCA with 20 pcs
print 'Zscoring and PCA'
Epca = get_PCA(Zscore(mtx[:,gene_filter]),20)
# get euclidean distances in the PC space
print 'Getting distance matrix'
D = get_distance_matrix(Epca)
# 導(dǎo)入標(biāo)簽信息
tt = pd.read_csv('./example_inputs/project/orig.ident_list.xls', delimiter='\t')
orig_ident = list(tt['orig.ident'])
cluster_list = list(tt['cluster'])
celltype=list(tt['celltype'])
# 整合成字典
cell_groupings = dict() #'cluster':cluster_list, 'timepoint':orig_ident
cell_groupings['cluster'] = cluster_list
cell_groupings['timepoint'] = orig_ident
cell_groupings['celltype'] = celltype
# 生成網(wǎng)頁可視化的輸入文件夾里面的文件
# load additional data (gene_list, cell_groupings, custom_colors)
# gene_list is a list of genes with length E.shape[1]
# cell_groupings is a dict of the form: { <grouping_name> : [<cell1_label>, <cell2_label>,...] }
# a "grouping" could be the sample id, cluster label, or any other categorical variable
# custom_colors is a dict of the form { <color_track_name> : [<cell1_value>, <cell2_value>,...] }
# a "custom color" is any continuous variable that you would like to use for coloring cels.
# gene_list, cell_groupings, custom_colors = pickle.load(open('example_inputs/python_data.p'))
# save a SPRING plots with k=5 edges per node in the directory "datasets/frog_python/"
# coarse graining can also be performed using the optional coarse_grain_X parameter
print 'Saving SPRING plot'
save_spring_dir(mtx,D,5,gene_list,'datasets/project',
cell_groupings=cell_groupings,
# custom_colors=custom_colors,
coarse_grain_X=1)
在瀏覽器中可視化數(shù)據(jù)并進(jìn)行下載
打開瀏覽器,輸入:http://192.168.3.33:8000/springViewer.html?datasets/project
后面就可以進(jìn)行對(duì)單個(gè)基因表達(dá)量的可視化
以及把自己的標(biāo)簽映射上去
然后是下載保存圖片
總結(jié)
- 個(gè)人認(rèn)為這個(gè)SPRING算法效果不是特別好膨疏,尤其針對(duì)幾個(gè)樣本合并的情況
- 即使原始的輸入數(shù)據(jù)不變一睁,但是每一次運(yùn)行出來的圖像樣子都是不一樣的,沒辦法復(fù)現(xiàn)(按作者的說法是“Low-dimensional visualizations of high-dimensional data are usually imperfect. Rather than attempting to present a single ‘definitive’ view of single cell data, SPRING allows exploration of multiple visualizations in order to develop an intuition for data structure. ”)
- 更多功能和坑后面再摸索
==================================================================
后續(xù)dev開發(fā)板踩坑
環(huán)境搭建差不多佃却,遇到?jīng)]有的庫pip安裝即可
官網(wǎng):https://github.com/AllonKleinLab/SPRING_dev
https://github.com/AllonKleinLab/SPRING_dev.git
#創(chuàng)建conda環(huán)境后啟動(dòng)環(huán)境
source path/bin/activate SPRING_dev
cd ./SPRING_dev
#啟動(dòng)Jupyter
jupyter notebook --ip 192.168.3.33 --port 8899 &
打開瀏覽器輸入 192.168.3.33:8899
數(shù)據(jù)預(yù)處理
#import modules, define some functions for loading, saving and processing a gene-barcode matrix
%matplotlib inline
import pkg_resources
import collections
import numpy as np
import pandas as pd
import scipy.sparse as sp_sparse
import tables
import matplotlib
import matplotlib.pyplot as plt
import h5py
np.random.seed(0)
GeneBCMatrix = collections.namedtuple('GeneBCMatrix', ['gene_ids', 'gene_names', 'barcodes', 'matrix'])
def get_matrix_from_h5(filename, genome):
with tables.open_file(filename, 'r') as f:
try:
dsets = {}
for node in f.walk_nodes('/' + genome, 'Array'):
dsets[node.name] = node.read()
matrix = sp_sparse.csc_matrix((dsets['data'], dsets['indices'], dsets['indptr']), shape=dsets['shape'])
if 'id' in dsets.keys():
return GeneBCMatrix(dsets['id'], dsets['name'], dsets['barcodes'], matrix)
else:
return GeneBCMatrix(dsets['genes'], dsets['gene_names'], dsets['barcodes'], matrix)
except tables.NoSuchNodeError:
raise Exception("Genome %s does not exist in this file." % genome)
except KeyError:
raise Exception("File is missing one or more required datasets.")
def save_matrix_to_h5(gbm, filename, genome):
flt = tables.Filters(complevel=1)
with tables.open_file(filename, 'w', filters=flt) as f:
try:
group = f.create_group(f.root, genome)
f.create_carray(group, 'genes', obj=gbm.gene_ids)
f.create_carray(group, 'gene_names', obj=gbm.gene_names)
f.create_carray(group, 'barcodes', obj=gbm.barcodes)
f.create_carray(group, 'data', obj=gbm.matrix.data)
f.create_carray(group, 'indices', obj=gbm.matrix.indices)
f.create_carray(group, 'indptr', obj=gbm.matrix.indptr)
f.create_carray(group, 'shape', obj=gbm.matrix.shape)
except:
raise Exception("Failed to write H5 file.")
def subsample_matrix(gbm, barcode_indices):
return GeneBCMatrix(gbm.gene_ids, gbm.gene_names, gbm.barcodes[barcode_indices], gbm.matrix[:, barcode_indices])
def get_expression(gbm, gene_name):
gene_indices = np.where(gbm.gene_names == gene_name)[0]
if len(gene_indices) == 0:
raise Exception("%s was not found in list of gene names." % gene_name)
return gbm.matrix[gene_indices[0], :].toarray().squeeze()
def save_count_matrix_to_h5(E, gene_list, filename):
with h5py.File(filename, 'w') as hf:
hf.create_dataset("indptr" , data= E.indptr)
hf.create_dataset("indices", data= E.indices)
hf.create_dataset("data" , data= E.data)
hf.create_dataset("shape" , data= E.shape)
hf.create_dataset("genes" , data= gene_list)
%pylab inline
from helper_functions import *
from collections import defaultdict
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'DejaVu Sans'
plt.rc('font', size=14)
plt.rcParams['pdf.fonttype'] = 42
sample_list = ["sample1", "sample2"]
# D stores all the data; one entry per library
D = {}
for j, s in enumerate(sample_list):
print(j, s)
# D stores all the data; one entry per library
D = {}
for j, s in enumerate(sample_list):
filename = './Project/'+ s +'/outs/raw_feature_bc_matrix.h5'
raw_matrix_h5 = filename
print raw_matrix_h5
if filename == './Project/'+ s +'/outs/raw_feature_bc_matrix.h5':
genome = "matrix"
else:
genome = 'GRCh38'
D[s] = {}
D[s]['meta'] = {}
gbm = get_matrix_from_h5(raw_matrix_h5, genome)
D[s]['E'] = transpose(gbm.matrix)
D[s]['meta']['gene_list']=gbm.gene_names
D[s]['meta']['gene_id']=gbm.gene_ids
print D[s]['E'].shape
D[s]['cell_index']=gbm.barcodes
print (len(D[s]['cell_index']))
gene_list = D[s]['meta']['gene_list']
gene_id = D[s]['meta']['gene_id']
Filter cells by total counts and number of genes detected
# plot total counts histograms - don't actually filter out any barcodes yet
# adjust total counts thresholds
for j,s in enumerate(sample_list):
D[s]['total_counts'] = np.sum(D[s]['E'], axis=1).A[:,0]
D[s]['genes_detected'] = np.sum(D[s]['E']>0, axis=1).A[:,0]
D[s]['meta']['min_tot']= np.mean(D[s]['total_counts'])+np.std(D[s]['total_counts'])
D[s]['meta']['min_genes_detected']=200;
fig3 = plt.figure()
ax3 = fig3.add_subplot(111)
ax3.set_xlabel('Transcripts per barcode (log10)')
ax3.set_ylabel('genes detected')
ax3.scatter(log(1 + D[s]['total_counts']),D[s]['genes_detected']);
ax3.plot(ax3.get_xlim(),[D[s]['meta']['min_genes_detected'],D[s]['meta']['min_genes_detected']]);
title(s)
D_orig=D
for j,s in enumerate(sample_list):
ix = (D[s]['genes_detected'] > 200) & (D[s]['total_counts'] > 500)
if np.sum(ix) > 0:
print s, np.sum(ix), '/', D[s]['E'].shape[0]
# Actually filter out low-count barcodes
for j,s in enumerate(sample_list):
print '--- %s ---' %s
print 'Pre-filter: %i barcodes' %D[s]['E'].shape[0]
tmpfilt = np.nonzero((D[s]['genes_detected'] > 200) & (D[s]['total_counts'] > 500))[0]
D[s] = filter_dict(D[s], tmpfilt)
print 'Post-filter: %i barcodes' %D[s]['E'].shape[0]
Filter cells by mito fraction
# get mitochondrial genes
mt_ix = [i for i,g in enumerate(gene_list) if g.startswith('MT-')]
print [gene_list[i] for i in mt_ix]
# plot mito-gene frac histograms - don't actually filter out any cells yet
# set mito-gene frac threshold
for j,s in enumerate(sample_list):
D[s]['meta']['max_mt'] = 0.2
for j,s in enumerate(sample_list):
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xscale='linear', yscale='linear',
xlabel='MT frac.', ylabel='no. cells')
D[s]['mito_frac'] = np.sum(D[s]['E'][:,mt_ix], axis=1).A[:,0] / np.sum(D[s]['E'], axis=1,dtype=float).A[:,0]
ax.hist(D[s]['mito_frac'], cumulative=False,
bins=np.linspace(0, 1, 30))
ax.plot([D[s]['meta']['max_mt'],D[s]['meta']['max_mt']],ax.get_ylim());
title(s)
print D[s]['E'].shape[0], np.sum(D[s]['mito_frac'] <= D[s]['meta']['max_mt'])
# Actually filter out mito-high cells
D_filt=D
for s in sample_list:
print '--- %s ---' %s
print 'Pre-filter: %i barcodes' %D[s]['E'].shape[0]
tmpfilt = np.nonzero(D[s]['mito_frac'] <= D[s]['meta']['max_mt'])[0]
D[s] = filter_dict(D[s], tmpfilt)
print 'Post-filter: %i barcodes' %D[s]['E'].shape[0]
Merge data, normalize
# create master dataset (all SPRING subsets will refer back to this)
samp_lookup = {}
samp_id_flat = np.array([],dtype=str)
for s in D.keys():
samp_id_flat = np.append(samp_id_flat, [s] * D[s]['E'].shape[0])
E = scipy.sparse.lil_matrix((len(samp_id_flat), len(gene_list)), dtype=int)
total_counts = np.zeros(len(samp_id_flat), dtype=int)
mito_frac = np.zeros(len(samp_id_flat), dtype=float)
genes_detected = np.zeros(len(samp_id_flat), dtype=int)
for s in D.keys():
print s
E[samp_id_flat == s, :] = D[s]['E']
total_counts[samp_id_flat == s] = D[s]['total_counts']
mito_frac[samp_id_flat == s] = D[s]['mito_frac']
genes_detected[samp_id_flat == s] = D[s]['genes_detected']
E = E.tocsc()
E_full=E
# remove genes that are not expressed by any cells
# optionally remove mito and rps genes
gene_list = D[sample_list[0]]['meta']['gene_list']
gene_id = D[sample_list[0]]['meta']['gene_id']
remove_crapgenes = 1
if remove_crapgenes:
import re
mt_ix = [i for i,g in enumerate(gene_list) if g.startswith('MT-')]
rp_ix = [i for i,g in enumerate(gene_list) if g.startswith('RP1')
or g.startswith('RP2') or g.startswith('RP3')
or g.startswith('RP4') or g.startswith('RP5')
or g.startswith('RP6') or g.startswith('RP7')
or g.startswith('RP8') or g.startswith('RP9')
or g.startswith('RPL') or g.startswith('RPS')
]
#or g.startswith('RP4','RP5','RP6') or g.startswith('RP7','RP8','RP9')]
keep_genes = (E.sum(0) > 0).A.squeeze() #* rp_ix * mt_ix
keep_genes[rp_ix] = 0
keep_genes[mt_ix] = 0
print sum(keep_genes), '/', len(keep_genes)
else:
keep_genes = (E.sum(0) > 0).A.squeeze()
print sum(keep_genes), '/', len(keep_genes)
# normalize by total counts
E = E[:,keep_genes]
E = tot_counts_norm_sparse(E)[0]
print shape(E)
gene_list = gene_list[keep_genes]
gene_id = gene_id[keep_genes]
merged_list=[x+"_"+gene_id[i] for i,x in enumerate(gene_list)]
Save base directory files
main_spring_dir=os.getcwd()
if not os.path.exists(main_spring_dir):
os.makedirs(main_spring_dir)
# Option to save the barcode information for later use
BCR = 1
if BCR:
barcode_dir = main_spring_dir + "/barcodes/"
if not os.path.exists(barcode_dir):
os.makedirs(barcode_dir)
for s in sample_list:
print '--- %s ---' %s
np.savetxt(barcode_dir + s, D[s]['cell_index'], fmt='%s')
# option to save mitochondrial percentages for later use
MT = 1
if MT:
MT_dir = main_spring_dir + "/mito/"
if not os.path.exists(MT_dir):
os.makedirs(MT_dir)
for s in sample_list:
print '--- %s ---' %s
np.savetxt(MT_dir + s, D[s]['mito_frac'], fmt='%s')
gene_list_new=np.array(merged_list)
np.savetxt(main_spring_dir + '/genes.txt', merged_list, fmt='%s')
# save master expression matrix in hdf5 format
import h5py
print 'Saving hdf5 file for fast gene loading...'
E = E.tocsc()
hf = h5py.File(main_spring_dir + '/counts_norm_sparse_genes.hdf5', 'w')
counts_group = hf.create_group('counts')
cix_group = hf.create_group('cell_ix')
hf.attrs['ncells'] = E.shape[0]
hf.attrs['ngenes'] = E.shape[1]
for iG, g in enumerate(merged_list):
if iG % 3000 == 0:
print g, iG, '/', len(merged_list)
counts = E[:,iG].A.squeeze()
cell_ix = np.nonzero(counts)[0]
counts = counts[cell_ix]
counts_group.create_dataset(g, data = counts)
cix_group.create_dataset(g, data = cell_ix)
hf.close()
##############
print 'Saving hdf5 file for fast cell loading...'
E = E.tocsr()
hf = h5py.File(main_spring_dir + '/counts_norm_sparse_cells.hdf5', 'w')
counts_group = hf.create_group('counts')
gix_group = hf.create_group('gene_ix')
hf.attrs['ncells'] = E.shape[0]
hf.attrs['ngenes'] = E.shape[1]
for iC in range(E.shape[0]):
if iC % 3000 == 0:
print iC, '/', E.shape[0]
counts = E[iC,:].A.squeeze()
gene_ix = np.nonzero(counts)[0]
counts = counts[gene_ix]
counts_group.create_dataset(str(iC), data = counts)
gix_group.create_dataset(str(iC), data = gene_ix)
hf.close()
寫入矩陣
scipy.io.mmwrite(main_spring_dir + '/matrix.mtx', E)
Save merged samples
merge_setup = {'/Dataset_v1':sample_list}
for s, smerge in merge_setup.items():
print smerge
t0 = time.time()
print t0
print '________________', s
cell_ix = np.in1d(samp_id_flat, smerge)
run_all_spring_1_6(E[cell_ix,:],
list(merged_list), s, main_spring_dir, normalize=False, tot_counts_final = total_counts[cell_ix],
min_counts = 3, min_cells = 3, min_vscore_pctl = 90,
show_vscore_plot = True, num_pc = 60, pca_method = '', k_neigh=4, use_approxnn = False,
output_spring = True, num_force_iter = 2000,
cell_groupings = {'Sample': list(samp_id_flat[cell_ix])})
print time.time() - t0
np.save(main_spring_dir + s + '/cell_filter.npy', np.nonzero(cell_ix)[0])
np.savetxt(main_spring_dir + s + '/cell_filter.txt', np.nonzero(cell_ix)[0], fmt='%i')
print time.time() - t0
SignacX包進(jìn)行分群和標(biāo)簽生成
#install.packages('SignacX')
library(SignacX)
# dir is the subdirectory generated by the Jupyter notebook; it is the directory that contains the 'categorical_coloring_data.json' file.
dir = "./Dataset_v1/"
# load the expression data
E = CID.LoadData(dir)
# generate cellular phenotype labels
labels = Signac(E, spring.dir = dir, num.cores = 14)
# alternatively, if you're in a hurry, use:
# labels = SignacFast(E, spring.dir = dir, num.cores = 4)
celltypes = GenerateLabels(labels, E = E, spring.dir = dir)
# write cell types and Louvain clusters to SPRING
dat <- CID.writeJSON(celltypes, spring.dir = dir)
添加自己的標(biāo)簽(略)
網(wǎng)頁可視化
python -m SimpleHTTPServer 8080 &