scanpy 是一個(gè)用于分析單細(xì)胞轉(zhuǎn)錄組(single cell rna sequencing)數(shù)據(jù)的python庫(kù)治专,文章2018發(fā)表在Genome Biology沼撕。其實(shí)它的許多分析思路借鑒了以seurat為中心的R語(yǔ)言單細(xì)胞轉(zhuǎn)錄數(shù)據(jù)分析生態(tài)的儡嘶,scanpy以一己之力在python生態(tài)構(gòu)建了單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析框架喇聊。我相信借助python的工業(yè)應(yīng)用實(shí)力,其擴(kuò)展性大于R語(yǔ)言分析工具蹦狂。當(dāng)然誓篱,選擇走一遍scanpy的原因,不是因?yàn)樗膹?qiáng)大凯楔,只是因?yàn)橄矚g窜骄。
在我搬這塊磚之前,中文世界關(guān)于scanpy的介紹已經(jīng)很多了摆屯,這當(dāng)然是好事邻遏。不知道誰(shuí)會(huì)以怎樣的方式遇見(jiàn)誰(shuí),所以虐骑,還是讓我們開(kāi)始吧准验。
所做的第一步就是配置好python環(huán)境,我建議是用conda來(lái)構(gòu)建廷没,這樣軟件管理起來(lái)很方便糊饱。然后是安裝scanpy這個(gè)庫(kù),當(dāng)然可能會(huì)遇到一些問(wèn)題颠黎,但是花點(diǎn)時(shí)間總是可以Google掉的另锋。在Windows、mac狭归、linux平臺(tái)scanpy都是可以運(yùn)行的夭坪。
在學(xué)習(xí)新的庫(kù)時(shí),文檔是不可不看的过椎。有統(tǒng)計(jì)表明室梅,程序員讀代碼的時(shí)間一般三倍于寫(xiě)代碼的時(shí)間。所以這基本上是一次閱讀體驗(yàn)。我們假裝可愛(ài)的讀者朋友竞惋,已經(jīng)配置好scanpy的環(huán)境柜去,也許花了兩三天的時(shí)間。會(huì)不會(huì)python沒(méi)關(guān)系拆宛,英語(yǔ)不好沒(méi)關(guān)系嗓奢,安裝個(gè)某道詞典就可以了。
三天后浑厚。股耽。。
我們打開(kāi)scanpy的教程官網(wǎng):https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html 钳幅,對(duì)物蝙,就是這么直白。開(kāi)始在我們的編輯器里復(fù)制粘貼教程代碼敢艰,并嘗試運(yùn)行诬乞,并理解結(jié)果。
首先我們導(dǎo)入需要的庫(kù)钠导,讀入10X標(biāo)準(zhǔn)的數(shù)據(jù)filtered_feature_bc_matrix震嫉,當(dāng)然也可以是h5格式的文件,或者一個(gè)二維的表達(dá)譜牡属。
讀入數(shù)據(jù)
# -*- coding: utf-8 -*-
"""
Spyder Editor
https://scanpy.readthedocs.io/en/stable/
This is a temporary script file.
"""
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80)
sc.logging.print_versions()
scanpy==1.4.5.1
anndata==0.7.1
umap==0.3.10
numpy==1.16.5
scipy==1.3.1
pandas==0.25.1
scikit-learn==0.21.3
statsmodels==0.10.1
python-igraph==0.8.0
不用擔(dān)心這些庫(kù)票堵,我們都會(huì)慢慢熟悉的,這里python-igraph要注意一下逮栅,不是igraph悴势。這些庫(kù)在《利用python進(jìn)行數(shù)據(jù)分析》這本書(shū)里面都有介紹,特別是pandas和numpy措伐,以后的日子里我們會(huì)看到它們幾乎是構(gòu)成scanpy數(shù)據(jù)結(jié)構(gòu)的核心特纤。
results_file = 'E:\learnscanpy\write\pbmc3k.h5ad'
help(sc.read_10x_mtx)
adata = sc.read_10x_mtx(
'E:/learnscanpy/data/filtered_feature_bc_matrix', # the directory with the `.mtx` file
var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
cache=True)
adata.var_names_make_unique() # this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
這樣就把數(shù)據(jù)讀進(jìn)來(lái)了,他的名字叫做adata废士,那么這是個(gè)什么東西呢叫潦?我認(rèn)為搞清楚這個(gè)基本上學(xué)會(huì)一半scanpy了。
adata
Out[21]:
AnnData object with n_obs × n_vars = 5025 × 33694
var: 'gene_ids', 'feature_types'
adata.var["gene_ids"]
Out[22]:
RP11-34P13.3 ENSG00000243485
FAM138A ENSG00000237613
OR4F5 ENSG00000186092
RP11-34P13.7 ENSG00000238009
RP11-34P13.8 ENSG00000239945
AC233755.2 ENSG00000277856
AC233755.1 ENSG00000275063
AC240274.1 ENSG00000271254
AC213203.1 ENSG00000277475
FAM231B ENSG00000268674
Name: gene_ids, Length: 33694, dtype: object
adata.var["feature_types"]
Out[23]:
RP11-34P13.3 Gene Expression
FAM138A Gene Expression
OR4F5 Gene Expression
RP11-34P13.7 Gene Expression
RP11-34P13.8 Gene Expression
AC233755.2 Gene Expression
AC233755.1 Gene Expression
AC240274.1 Gene Expression
AC213203.1 Gene Expression
FAM231B Gene Expression
Name: feature_types, Length: 33694, dtype: object
adata.to_df()
Out[24]:
RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
AAACCCAAGCGTATGG-1 0.0 0.0 ... 0.0 0.0
AAACCCAGTCCTACAA-1 0.0 0.0 ... 0.0 0.0
AAACCCATCACCTCAC-1 0.0 0.0 ... 0.0 0.0
AAACGCTAGGGCATGT-1 0.0 0.0 ... 0.0 0.0
AAACGCTGTAGGTACG-1 0.0 0.0 ... 0.0 0.0
... ... ... ... ...
TTTGTTGCAGGTACGA-1 0.0 0.0 ... 0.0 0.0
TTTGTTGCAGTCTCTC-1 0.0 0.0 ... 0.0 0.0
TTTGTTGGTAATTAGG-1 0.0 0.0 ... 0.0 0.0
TTTGTTGTCCTTGGAA-1 0.0 0.0 ... 0.0 0.0
TTTGTTGTCGCACGAC-1 0.0 0.0 ... 0.0 0.0
[5025 rows x 33694 columns]
第一句AnnData object with n_obs × n_vars = 5025 × 33694 ,告訴我們adata是一個(gè)AnnData 對(duì)象,就像seurat也是一個(gè)對(duì)象一樣丁存。什么叫對(duì)象呢椰棘?對(duì)象就是一個(gè)實(shí)體、物體,它是一種存在而不是一種動(dòng)作。當(dāng)然,我們可以對(duì)它做一些操作卿操,一個(gè)對(duì)象可以通過(guò)具體的屬性為人們感知警检。
具體地,anndata是這樣一種構(gòu)象:
單細(xì)胞轉(zhuǎn)錄組的核心就是一個(gè)cell X gene的二維表害淤,但是分群后要存儲(chǔ)cell的分群結(jié)果扇雕,特征選擇是要記錄gene的信息,降維后要存儲(chǔ)降維后的結(jié)果窥摄。所以镶奉,這張表.X的對(duì)象cell相關(guān)的信息記錄在.obs中,屬性gene的信息記錄在.var中崭放,其他的信息在.uns中哨苛。那么每一部分是什么呢?
type(adata.var["gene_ids"])
Out[205]: pandas.core.series.Series
哦币砂,原來(lái)是pandas的Series建峭,下面是這個(gè)數(shù)據(jù)結(jié)構(gòu)的詳細(xì)介紹,這個(gè)數(shù)據(jù)結(jié)構(gòu)能存儲(chǔ)的信息一點(diǎn)也不亞于seurat啊决摧。
Type: AnnData
String form:
AnnData object with n_obs × n_vars = 5025 × 33694
var: 'gene_ids', 'feature_types'
Length: 5025
File: d:\program files (x86)\anconda\lib\site-packages\anndata\_core\anndata.py
Docstring:
An annotated data matrix.
:class:`~anndata.AnnData` stores a data matrix :attr:`X` together with annotations
of observations :attr:`obs` (:attr:`obsm`, :attr:`obsp`),
variables :attr:`var` (:attr:`varm`, :attr:`varp`),
and unstructured annotations :attr:`uns`.
.. figure:: https://falexwolf.de/img/scanpy/anndata.svg
:width: 350px
An :class:`~anndata.AnnData` object `adata` can be sliced like a
:class:`~pandas.DataFrame`,
for instance `adata_subset = adata[:, list_of_variable_names]`.
:class:`~anndata.AnnData`’s basic structure is similar to R’s ExpressionSet
[Huber15]_. If setting an `.h5ad`-formatted HDF5 backing file `.filename`,
data remains on the disk but is automatically loaded into memory if needed.
See this `blog post`_ for more details.
.. _blog post: http://falexwolf.de/blog/171223_AnnData_indexing_views_HDF5-backing/
Parameters
----------
X
A #observations × #variables data matrix. A view of the data is used if the
data type matches, otherwise, a copy is made.
obs
Key-indexed one-dimensional observations annotation of length #observations.
var
Key-indexed one-dimensional variables annotation of length #variables.
uns
Key-indexed unstructured annotation.
obsm
Key-indexed multi-dimensional observations annotation of length #observations.
If passing a :class:`~numpy.ndarray`, it needs to have a structured datatype.
varm
Key-indexed multi-dimensional variables annotation of length #variables.
If passing a :class:`~numpy.ndarray`, it needs to have a structured datatype.
layers
Key-indexed multi-dimensional arrays aligned to dimensions of `X`.
dtype
Data type used for storage.
shape
Shape tuple (#observations, #variables). Can only be provided if `X` is `None`.
filename
Name of backing file. See :class:`File`.
filemode
Open mode of backing file. See :class:`File`.
See Also
--------
read_h5ad
read_csv
read_excel
read_hdf
read_loom
read_zarr
read_mtx
read_text
read_umi_tools
Notes
-----
:class:`~anndata.AnnData` stores observations (samples) of variables/features
in the rows of a matrix.
This is the convention of the modern classics of statistics [Hastie09]_
and machine learning [Murphy12]_,
the convention of dataframes both in R and Python and the established statistics
and machine learning packages in Python (statsmodels_, scikit-learn_).
Single dimensional annotations of the observation and variables are stored
in the :attr:`obs` and :attr:`var` attributes as :class:`~pandas.DataFrame`\ s.
This is intended for metrics calculated over their axes.
Multi-dimensional annotations are stored in :attr:`obsm` and :attr:`varm`,
which are aligned to the objects observation and variable dimensions respectively.
Square matrices representing graphs are stored in :attr:`obsp` and :attr:`varp`,
with both of their own dimensions aligned to their associated axis.
Additional measurements across both observations and variables are stored in
:attr:`layers`.
Indexing into an AnnData object can be performed by relative position
with numeric indices (like pandas’ :attr:`~pandas.DataFrame.iloc`),
or by labels (like :attr:`~pandas.DataFrame.loc`).
To avoid ambiguity with numeric indexing into observations or variables,
indexes of the AnnData object are converted to strings by the constructor.
Subsetting an AnnData object by indexing into it will also subset its elements
according to the dimensions they were aligned to.
This means an operation like `adata[list_of_obs, :]` will also subset :attr:`obs`,
:attr:`obsm`, and :attr:`layers`.
Subsetting an AnnData object returns a view into the original object,
meaning very little additional memory is used upon subsetting.
This is achieved lazily, meaning that the constituent arrays are subset on access.
Copying a view causes an equivalent “real” AnnData object to be generated.
Attempting to modify a view (at any attribute except X) is handled
in a copy-on-modify manner, meaning the object is initialized in place.
Here’s an example::
batch1 = adata[adata.obs["batch"] == "batch1", :]
batch1.obs["value"] = 0 # This makes batch1 a “real” AnnData object
At the end of this snippet: `adata` was not modified,
and `batch1` is its own AnnData object with its own data.
Similar to Bioconductor’s `ExpressionSet` and :mod:`scipy.sparse` matrices,
subsetting an AnnData object retains the dimensionality of its constituent arrays.
Therefore, unlike with the classes exposed by :mod:`pandas`, :mod:`numpy`,
and `xarray`, there is no concept of a one dimensional AnnData object.
AnnDatas always have two inherent dimensions, :attr:`obs` and :attr:`var`.
Additionally, maintaining the dimensionality of the AnnData object allows for
consistent handling of :mod:`scipy.sparse` matrices and :mod:`numpy` arrays.
.. _statsmodels: http://www.statsmodels.org/stable/index.html
.. _scikit-learn: http://scikit-learn.org/
繪制高表達(dá)的基因:
sc.pl.highest_expr_genes(adata, n_top=20)
本著本質(zhì)的好奇心亿蒸,我們不禁要問(wèn)一句:這張圖是用什么畫(huà)的?一定要養(yǎng)成閱讀文檔的習(xí)慣:
help(sc.pl.highest_expr_genes)
Help on function highest_expr_genes in module scanpy.plotting._qc:
highest_expr_genes(adata: anndata._core.anndata.AnnData, n_top: int = 30, show: Union[bool, NoneType] = None, save: Union[str, bool, NoneType] = None, ax: Union[matplotlib.axes._axes.Axes, NoneType] = None, gene_symbols: Union[str, NoneType] = None, log: bool = False, **kwds)
Fraction of counts assigned to each gene over all cells.
Computes, for each gene, the fraction of counts assigned to that gene within
a cell. The `n_top` genes with the highest mean fraction over all cells are
plotted as boxplots.
This plot is similar to the `scater` package function `plotHighestExprs(type
= "highest-expression")`, see `here
<https://bioconductor.org/packages/devel/bioc/vignettes/scater/inst/doc/vignette-qc.html>`__. Quoting
from there:
*We expect to see the “usual suspects”, i.e., mitochondrial genes, actin,
ribosomal protein, MALAT1. A few spike-in transcripts may also be
present here, though if all of the spike-ins are in the top 50, it
suggests that too much spike-in RNA was added. A large number of
pseudo-genes or predicted genes may indicate problems with alignment.*
-- Davis McCarthy and Aaron Lun
Parameters
----------
adata
Annotated data matrix.
n_top
Number of top
show
Show the plot, do not return axis.
save
If `True` or a `str`, save the figure.
A string is appended to the default filename.
Infer the filetype if ending on {`'.pdf'`, `'.png'`, `'.svg'`}.
ax
A matplotlib axes object. Only works if plotting a single component.
gene_symbols
Key for field in .var that stores gene symbols if you do not want to use .var_names.
log
Plot x-axis in log scale
**kwds
Are passed to :func:`~seaborn.boxplot`.
Returns
-------
If `show==False` a :class:`~matplotlib.axes.Axes`.
可以看出這個(gè)功能借鑒了知名R包scater的一個(gè)函數(shù):plotHighestExprs蜜徽,繪圖用的是python里面兩個(gè)響當(dāng)當(dāng)?shù)膕eaborn和matplotlib庫(kù)祝懂。那我們不禁要自己微調(diào)一下了。
current_palette = sns.color_palette("dark")
sns.palplot(current_palette)
sc.pl.highest_expr_genes(adata, n_top=20,palette="dark" )
normalizing counts per cell
finished (0:00:00)
借助seaborn的繪圖系統(tǒng)圖我們還可以這樣設(shè)置顏色:
with sns.color_palette("PuBuGn_d"):
sc.pl.highest_expr_genes(adata, n_top=20, )
help(sns.set_style)
sns.set_style("white")
sc.pl.highest_expr_genes(adata, n_top=20, )
cell QC
提到cell QC 拘鞋,我們需要明白在10的體系里面什么是cell。有的同學(xué)說(shuō)了矢门,我上機(jī)6000個(gè)細(xì)胞盆色,為什么cellranger檢出的細(xì)胞不是6000,有時(shí)候差的還不小祟剔。因?yàn)樵赾ellranger的邏輯里隔躲,cell不過(guò)是每個(gè)umi對(duì)barcode的支持程度。它說(shuō)的cell是基于序列比對(duì)的物延,盡管這個(gè)理應(yīng)和上機(jī)的cell保持一致宣旱,但是這種一致性是通過(guò)序列比對(duì)來(lái)保持的。其實(shí)我們說(shuō)的cell QC指的是對(duì)barcode的質(zhì)控叛薯,使留下的barcode能表征下游分析的cell浑吟。
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=0)
mito_genes = adata.var_names.str.startswith('MT-')
# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1
參數(shù)的意思就是字面的意思,有不明白的可以直接help相應(yīng)的函數(shù)耗溜,看它的說(shuō)明文檔组力。我基本上就是靠說(shuō)明文檔來(lái)學(xué)習(xí)的。下面讓我們看看過(guò)濾后的adata發(fā)生了哪些變化抖拴。
adata
Out[9]:
AnnData object with n_obs × n_vars = 4960 × 33694
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'feature_types', 'n_cells'
adata.obs['percent_mito']
Out[11]:
AAACCCAAGCGTATGG-1 0.106588
AAACCCAGTCCTACAA-1 0.056101
AAACCCATCACCTCAC-1 0.532847
AAACGCTAGGGCATGT-1 0.106913
AAACGCTGTAGGTACG-1 0.078654
TTTGTTGCAGGTACGA-1 0.071226
TTTGTTGCAGTCTCTC-1 0.055394
TTTGTTGGTAATTAGG-1 0.183441
TTTGTTGTCCTTGGAA-1 0.081037
TTTGTTGTCGCACGAC-1 0.070987
Name: percent_mito, Length: 4960, dtype: float32
adata.var['n_cells']
Out[12]:
RP11-34P13.3 0
FAM138A 0
OR4F5 0
RP11-34P13.7 43
RP11-34P13.8 0
..
AC233755.2 1
AC233755.1 1
AC240274.1 83
AC213203.1 0
FAM231B 0
Name: n_cells, Length: 33694, dtype: int32
細(xì)胞屬性obs中有了percent_mito燎字;有了基因?qū)傩缘膙ar。
像seurat一樣,我們看看cell三個(gè)屬性的小提琴圖:
sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='n_counts', y='percent_mito')
sc.pl.scatter(adata, x='n_counts', y='n_genes')
顯然這是分別畫(huà)的兩張圖候衍,我們能不能把它們組合到一起呢笼蛛?顯然是可以的:
help(sns.FacetGrid)
adata.obs['percent_mito1'] = adata.obs['percent_mito']*10000
adata.obs.melt(id_vars=['n_counts','percent_mito'], var_name='myVariable', value_name='myValue')
import matplotlib.pyplot as plt
sns.set(style="ticks")
g = sns.FacetGrid(adata.obs.melt(id_vars=['n_counts','percent_mito'], var_name='myVariable', value_name='myValue'),col = 'myVariable')
g.map(plt.scatter,'n_counts',"myValue", alpha=.7)
g.add_legend();
有不少人問(wèn)這兩張圖怎么看。就拿n_counts與n_genes的散點(diǎn)圖(scatter)來(lái)說(shuō)吧:每個(gè)點(diǎn)代表一個(gè)細(xì)胞蛉鹿,斜率代表隨著count的增加gene的增加程度伐弹。count和gene一般呈現(xiàn)線性關(guān)系,斜率越大也就是較少的count就可以檢出較多的gene榨为,說(shuō)明這批細(xì)胞基因的豐度偏低(普遍貧窮)惨好;反之,斜率較小随闺,說(shuō)明這批細(xì)胞基因豐度較高(少數(shù)富有)日川。有的時(shí)候不是一條可擬合的線,或者是兩條可擬合的線矩乐,也反映出細(xì)胞的異質(zhì)性龄句。總之散罕,他就是一個(gè)散點(diǎn)圖分歇,描述的是兩個(gè)變量的關(guān)系。
根據(jù)統(tǒng)計(jì)的n_genes 和線粒體基因含量percent_mito 進(jìn)一步過(guò)濾:
adata = adata[adata.obs.n_genes < 2500, :]
adata = adata[adata.obs.percent_mito < 0.1, :]
過(guò)濾掉基因數(shù)大于2500欧漱,線粒體基因含量大于0.1的barcode职抡。這個(gè)過(guò)濾方法就像filter一樣,操作的是adata.obs這張表误甚。
adata.obs
Out[82]:
n_genes percent_mito n_counts percent_mito1
AAACGCTGTGTGATGG-1 2348 0.099821 6131.0 998.205811
AAACGCTTCTAGGCAT-1 2470 0.065564 8465.0 655.640869
AAAGAACCACCGTCTT-1 420 0.021944 638.0 219.435730
AAAGAACTCACCTGGG-1 2479 0.074067 10801.0 740.672119
AAAGAACTCGGAACTT-1 2318 0.079713 6837.0 797.133240
... ... ... ...
TTTGGTTTCCGGTAAT-1 2277 0.079324 9291.0 793.240723
TTTGTTGCAGGTACGA-1 1986 0.071226 8101.0 712.257751
TTTGTTGCAGTCTCTC-1 2485 0.055394 9604.0 553.935852
TTTGTTGTCCTTGGAA-1 1998 0.081037 5479.0 810.366882
TTTGTTGTCGCACGAC-1 2468 0.070987 7438.0 709.868225
[2223 rows x 4 columns]
特征提取
cellQC可以看做是對(duì)細(xì)胞的過(guò)濾缚甩,即保留可用的細(xì)胞。但是每個(gè)細(xì)胞又有成千的基因(細(xì)胞的特征)窑邦,所以一般需要做特征的選擇和提壬猛(也就是降維)。什么意思呢冈钦?就是比如每個(gè)人都有很多特征郊丛,年齡、性別瞧筛、身高厉熟、國(guó)籍、學(xué)歷驾窟、生日庆猫、愛(ài)好、血型绅络。月培。嘁字。我要對(duì)一百個(gè)人分類(lèi),所有的特征都拿來(lái)比較杉畜,可能得不到準(zhǔn)確的劃分纪蜒。而用幾個(gè)關(guān)鍵的特征可以分出某個(gè)國(guó)家某個(gè)年齡段的人。特征的選擇是選擇一部分特征此叠,比如選擇高變基因纯续;特征提取是提取主要的特征結(jié)構(gòu),如主成分分析灭袁。
在特征提取之前要保證細(xì)胞之間是有可比性的猬错,一般用的是歸一化的方法,得到高變基因之后茸歧,為了使同一個(gè)基因在不同細(xì)胞之間具有可比性采用標(biāo)準(zhǔn)化倦炒。
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata
adata
AnnData object with n_obs × n_vars = 2223 × 33694
obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
var: 'gene_ids', 'feature_types', 'n_cells'
uns: 'log1p'
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
adata = adata[:, adata.var.highly_variable]
#時(shí)刻關(guān)注我們數(shù)據(jù)的變化
adata
Out[117]:
View of AnnData object with n_obs × n_vars = 2223 × 2208
obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p'
回歸每個(gè)細(xì)胞的總計(jì)數(shù)和線粒體基因表達(dá)百分比的影響,將數(shù)據(jù)縮放到單位方差软瞎。
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
adata
Out[121]:
AnnData object with n_obs × n_vars = 2223 × 2208
obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'pca'
obsm: 'X_pca'
varm: 'PCs
adata.obsm['X_pca']
adata.varm['PCs']
sc.pl.pca(adata, color='CST3')
sc.pl.pca_variance_ratio(adata, log=True)
adata.write(results_file)
特征選擇我們用高變基因逢唤,特征提取我們用pca。這里就有一個(gè)都的問(wèn)題:如何選擇高變基因以及pc的維度涤浇?
聚類(lèi)
scanpy提供leiden和louvain兩種圖聚類(lèi)算法鳖藕,圖聚類(lèi)在開(kāi)始之前要先找鄰居:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
adata
Out[128]:
AnnData object with n_obs × n_vars = 2223 × 2208
obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'pca', 'neighbors'
obsm: 'X_pca'
varm: 'PCs'
sc.tl.leiden(adata)
adata.obs['leiden']
Out[188]:
AAACGCTGTGTGATGG-1 10
AAACGCTTCTAGGCAT-1 6
AAAGAACCACCGTCTT-1 4
AAAGAACTCACCTGGG-1 2
AAAGAACTCGGAACTT-1 6
..
TTTGGTTTCCGGTAAT-1 3
TTTGTTGCAGGTACGA-1 1
TTTGTTGCAGTCTCTC-1 3
TTTGTTGTCCTTGGAA-1 7
TTTGTTGTCGCACGAC-1 3
Name: leiden, Length: 2223, dtype: category
Categories (12, object): [0, 1, 2, 3, ..., 8, 9, 10, 11]
#sc.tl.louvain(adata)
#sc.tl.paga(adata)
和Seurat等人一樣,scanpy推薦Traag *等人(2018)提出的Leiden graph-clustering方法(基于優(yōu)化模塊化的社區(qū)檢測(cè))只锭。注意著恩,Leiden集群直接對(duì)cell的鄰域圖進(jìn)行聚類(lèi),我們?cè)趕c.pp.neighbors已經(jīng)計(jì)算過(guò)了纹烹。
讀一下leiden的說(shuō)明文檔:
help(sc.tl.leiden)
Help on function leiden in module scanpy.tools._leiden:
leiden(adata: anndata._core.anndata.AnnData, resolution: float = 1, *, restrict_to: Union[Tuple[str, Sequence[str]], NoneType] = None, random_state: Union[int, mtrand.RandomState, NoneType] = 0, key_added: str = 'leiden', adjacency: Union[scipy.sparse.base.spmatrix, NoneType] = None, directed: bool = True, use_weights: bool = True, n_iterations: int = -1, partition_type: Union[Type[leidenalg.VertexPartition.MutableVertexPartition], NoneType] = None, copy: bool = False, **partition_kwargs) -> Union[anndata._core.anndata.AnnData, NoneType]
Cluster cells into subgroups [Traag18]_.
Cluster cells using the Leiden algorithm [Traag18]_,
an improved version of the Louvain algorithm [Blondel08]_.
It has been proposed for single-cell analysis by [Levine15]_.
This requires having ran :func:`~scanpy.pp.neighbors` or
:func:`~scanpy.external.pp.bbknn` first.
Parameters
----------
adata
The annotated data matrix.
resolution
A parameter value controlling the coarseness of the clustering.
Higher values lead to more clusters.
Set to `None` if overriding `partition_type`
to one that doesn’t accept a `resolution_parameter`.
random_state
Change the initialization of the optimization.
restrict_to
Restrict the clustering to the categories within the key for sample
annotation, tuple needs to contain `(obs_key, list_of_categories)`.
key_added
`adata.obs` key under which to add the cluster labels.
adjacency
Sparse adjacency matrix of the graph, defaults to
`adata.uns['neighbors']['connectivities']`.
directed
Whether to treat the graph as directed or undirected.
use_weights
If `True`, edge weights from the graph are used in the computation
(placing more emphasis on stronger edges).
n_iterations
How many iterations of the Leiden clustering algorithm to perform.
Positive values above 2 define the total number of iterations to perform,
-1 has the algorithm run until it reaches its optimal clustering.
partition_type
Type of partition to use.
Defaults to :class:`~leidenalg.RBConfigurationVertexPartition`.
For the available options, consult the documentation for
:func:`~leidenalg.find_partition`.
copy
Whether to copy `adata` or modify it inplace.
**partition_kwargs
Any further arguments to pass to `~leidenalg.find_partition`
(which in turn passes arguments to the `partition_type`).
Returns
-------
`adata.obs[key_added]`
Array of dim (number of samples) that stores the subgroup id
(`'0'`, `'1'`, ...) for each cell.
`adata.uns['leiden']['params']`
A dict with the values for the parameters `resolution`, `random_state`,
and `n_iterations`.
seurat的findclusters中聚類(lèi)的算法亦是有Leiden的页滚,關(guān)于leiden和louvain兩者的區(qū)別,可以看看From Louvain to Leiden: guaranteeing well-connected communities
algorithm
Algorithm for modularity optimization (
1 = original Louvain algorithm;
2 = Louvain algorithm with multilevel refinement;
3 = SLM algorithm;
4 = Leiden algorithm). Leiden requires the leidenalg python.
為了可視化聚類(lèi)的結(jié)果我們還是先做一下umap降維吧铺呵,然后看看分群結(jié)果。
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])
adata
Out[139]:
AnnData object with n_obs × n_vars = 2223 × 2208
obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1', 'leiden'
var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'pca', 'neighbors', 'leiden', 'paga', 'leiden_sizes', 'umap', 'leiden_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
差異分析
分群后我們得到以數(shù)值為標(biāo)識(shí)的亞群編號(hào)隧熙,我們急于知道每一個(gè)數(shù)值背后的含義片挂,意義還得從基因上找。差異分析就是為了這目的的贞盯。
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
ranking genes
D:\Program Files (x86)\anconda\lib\site-packages\scanpy\tools\_rank_genes_groups.py:252: RuntimeWarning: invalid value encountered in log2
rankings_gene_logfoldchanges.append(np.log2(foldchanges[global_indices]))
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
為了使我們的文章圖文并茂一些音念,來(lái)看看't-test'檢驗(yàn)的每個(gè)亞群的差異基因排序:
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
sc.settings.verbosity = 2 # reduce the verbosity
help(sc.tl.rank_genes_groups)
Help on function rank_genes_groups in module scanpy.tools._rank_genes_groups:
rank_genes_groups(adata: anndata._core.anndata.AnnData, groupby: str, use_raw: bool = True, groups: Union[scanpy._compat.Literal_, Iterable[str]] = 'all', reference: str = 'rest', n_genes: int = 100, rankby_abs: bool = False, key_added: Union[str, NoneType] = None, copy: bool = False, method: scanpy._compat.Literal_ = 't-test_overestim_var', corr_method: scanpy._compat.Literal_ = 'benjamini-hochberg', layer: Union[str, NoneType] = None, **kwds) -> Union[anndata._core.anndata.AnnData, NoneType]
Rank genes for characterizing groups.
Parameters
----------
adata
Annotated data matrix.
groupby
The key of the observations grouping to consider.
use_raw
Use `raw` attribute of `adata` if present.
layer
Key from `adata.layers` whose value will be used to perform tests on.
groups
Subset of groups, e.g. [`'g1'`, `'g2'`, `'g3'`], to which comparison
shall be restricted, or `'all'` (default), for all groups.
reference
If `'rest'`, compare each group to the union of the rest of the group.
If a group identifier, compare with respect to this group.
n_genes
The number of genes that appear in the returned tables.
method
The default 't-test_overestim_var' overestimates variance of each group,
`'t-test'` uses t-test, `'wilcoxon'` uses Wilcoxon rank-sum,
`'logreg'` uses logistic regression. See [Ntranos18]_,
`here <https://github.com/theislab/scanpy/issues/95>`__ and `here
<http://www.nxn.se/valent/2018/3/5/actionable-scrna-seq-clusters>`__,
for why this is meaningful.
corr_method
p-value correction method.
Used only for `'t-test'`, `'t-test_overestim_var'`, and `'wilcoxon'`.
rankby_abs
Rank genes by the absolute value of the score, not by the
score. The returned scores are never the absolute values.
key_added
The key in `adata.uns` information is saved to.
**kwds
Are passed to test methods. Currently this affects only parameters that
are passed to :class:`sklearn.linear_model.LogisticRegression`.
For instance, you can pass `penalty='l1'` to try to come up with a
minimal set of genes that are good predictors (sparse solution meaning
few non-zero fitted coefficients).
Returns
-------
**names** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
Structured array to be indexed by group id storing the gene
names. Ordered according to scores.
**scores** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
Structured array to be indexed by group id storing the z-score
underlying the computation of a p-value for each gene for each
group. Ordered according to scores.
**logfoldchanges** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
Structured array to be indexed by group id storing the log2
fold change for each gene for each group. Ordered according to
scores. Only provided if method is 't-test' like.
Note: this is an approximation calculated from mean-log values.
**pvals** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
p-values.
**pvals_adj** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
Corrected p-values.
Notes
-----
There are slight inconsistencies depending on whether sparse
or dense data are passed. See `here <https://github.com/theislab/scanpy/blob/master/scanpy/tests/test_rank_genes_groups.py>`__.
Examples
--------
import scanpy as sc
adata = sc.datasets.pbmc68k_reduced()
sc.tl.rank_genes_groups(adata, 'bulk_labels', method='wilcoxon')
# to visualize the results
sc.pl.rank_genes_groups(adata)
scanpy差異分析的方法沒(méi)有seurat豐富了,除了t-test躏敢,還有wilcoxon和logreg闷愤。
查看差異分析的結(jié)果:
sc.get.rank_genes_groups_df(adata, group="0")
Out[148]:
scores names logfoldchanges pvals pvals_adj
0 15.179968 IL7R NaN 1.472705e-43 1.711438e-42
1 13.268551 RGS10 NaN 6.462721e-35 5.595956e-34
2 11.459023 LRRN3 NaN 9.208903e-27 5.465930e-26
3 10.671810 CD7 NaN 3.213348e-24 1.665510e-23
4 8.615908 INTS6 NaN 1.053249e-16 3.856673e-16
.. ... ... ... ... ...
95 1.755937 DUSP16 NaN 7.978160e-02 1.094148e-01
96 1.724632 GALM NaN 8.523017e-02 1.163811e-01
97 1.713933 KCNQ1OT1 NaN 8.719264e-02 1.188403e-01
98 1.706563 PASK NaN 8.851861e-02 1.202764e-01
99 1.696579 SUCO NaN 9.047560e-02 1.227089e-01
[100 rows x 5 columns]
指定哪兩組進(jìn)行差異分析:
sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
sc.get.rank_genes_groups_df(adata, group="0")
Out[169]:
scores names logfoldchanges pvals pvals_adj
0 19.284122 EFNB2 2.265986 7.301753e-83 4.030568e-81
1 19.205927 C11orf96 NaN 3.301754e-82 1.350050e-80
2 19.158936 NPDC1 -2.627818 8.152369e-82 2.686632e-80
3 18.702019 EGR4 NaN 4.766577e-78 6.190942e-77
4 18.405581 FBXL8 -0.640331 1.185097e-75 1.104091e-74
.. ... ... ... ... ...
95 11.469543 AP4S1 -3.958649 1.876464e-30 3.019849e-30
96 11.459700 FBXO33 3.057788 2.102432e-30 3.378580e-30
97 11.423295 LDLRAD4 0.052481 3.198732e-30 5.121683e-30
98 11.368316 H1F0 NaN 6.013658e-30 9.587116e-30
99 11.310737 ZCWPW1 0.455119 1.161100e-29 1.836468e-29
[100 rows x 5 columns]
差異基因的可視化
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
{group + '_' + key[:1]: result[key][group]
for group in groups for key in ['names', 'pvals']}).head(5)
0_n 0_p
0 EFNB2 7.301753e-83
1 C11orf96 3.301754e-82
2 NPDC1 8.152369e-82
3 EGR4 4.766577e-78
4 FBXL8 1.185097e-75
sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')
new_cluster_names = [
'CD4 T', 'CD14 Monocytes',
'B', 'CD8 T',
'NK', 'FCGR3A Monocytes',
'Dendritic', 'Megakaryocytes',
'NewCluster1','NewCluster2','NewCluster3','NewCluster4' ]
adata.rename_categories('leiden', new_cluster_names)
\#Omitting rank_genes_groups/scores as old categories do not match.
\#Omitting rank_genes_groups/names as old categories do not match.
\#Omitting rank_genes_groups/logfoldchanges as old categories do not match.
\#Omitting rank_genes_groups/pvals as old categories do not match.
\#Omitting rank_genes_groups/pvals_adj as old categories do not match.
ax = sc.pl.dotplot(adata, marker_genes, groupby='leiden')
ax = sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90)
像seurat一樣,scanpy也有豐富的數(shù)據(jù)可視化的方式件余,太多以至于需要單獨(dú)再寫(xiě)另一篇文章讥脐,這里就不介紹了遭居。
保存分析結(jié)果以供下一次調(diào)用
adata.write(results_file, compression='gzip') # `compression='gzip'` saves disk space, but slows down writing and subsequent reading
adata.X = None
adata.write('E:\learnscanpy\write\pbmc3k_withoutX.h5ad', compression='gzip')
adata.obs[['n_counts', 'leiden']].to_csv(
'E:\learnscanpy\write\pbmc3k_corrected_lei_groups.csv')
\# Export single fields of the annotation of observations
\# adata.obs[['n_counts', 'louvain_groups']].to_csv(
\# './write/pbmc3k_corrected_louvain_groups.csv')
\# Export single columns of the multidimensional annotation
\# adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(
\# './write/pbmc3k_corrected_X_pca.csv')
\# Or export everything except the data using `.write_csvs`.
\# Set `skip_data=False` if you also want to export the data.
\# adata.write_csvs(results_file[:-5], )
今天,我們學(xué)習(xí)了scanpy的一般流程旬渠,我們發(fā)現(xiàn)不管工具如何變,單細(xì)胞轉(zhuǎn)錄組的數(shù)據(jù)分析的大框架是沒(méi)有變化的俱萍,幾個(gè)分析的工具也是相互借鑒的。但是python的就不值得一學(xué)了嗎告丢?
SCANPY: large-scale single-cell gene expression data analysis