單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析|| scanpy教程:預(yù)處理與聚類(lèi)

scanpy 是一個(gè)用于分析單細(xì)胞轉(zhuǎn)錄組(single cell rna sequencing)數(shù)據(jù)的python庫(kù)治专,文章2018發(fā)表在Genome Biology沼撕。其實(shí)它的許多分析思路借鑒了以seurat為中心的R語(yǔ)言單細(xì)胞轉(zhuǎn)錄數(shù)據(jù)分析生態(tài)的儡嘶,scanpy以一己之力在python生態(tài)構(gòu)建了單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析框架喇聊。我相信借助python的工業(yè)應(yīng)用實(shí)力,其擴(kuò)展性大于R語(yǔ)言分析工具蹦狂。當(dāng)然誓篱,選擇走一遍scanpy的原因,不是因?yàn)樗膹?qiáng)大凯楔,只是因?yàn)橄矚g窜骄。

在我搬這塊磚之前,中文世界關(guān)于scanpy的介紹已經(jīng)很多了摆屯,這當(dāng)然是好事邻遏。不知道誰(shuí)會(huì)以怎樣的方式遇見(jiàn)誰(shuí),所以虐骑,還是讓我們開(kāi)始吧准验。

所做的第一步就是配置好python環(huán)境,我建議是用conda來(lái)構(gòu)建廷没,這樣軟件管理起來(lái)很方便糊饱。然后是安裝scanpy這個(gè)庫(kù),當(dāng)然可能會(huì)遇到一些問(wèn)題颠黎,但是花點(diǎn)時(shí)間總是可以Google掉的另锋。在Windows、mac狭归、linux平臺(tái)scanpy都是可以運(yùn)行的夭坪。

在學(xué)習(xí)新的庫(kù)時(shí),文檔是不可不看的过椎。有統(tǒng)計(jì)表明室梅,程序員讀代碼的時(shí)間一般三倍于寫(xiě)代碼的時(shí)間。所以這基本上是一次閱讀體驗(yàn)。我們假裝可愛(ài)的讀者朋友竞惋,已經(jīng)配置好scanpy的環(huán)境柜去,也許花了兩三天的時(shí)間。會(huì)不會(huì)python沒(méi)關(guān)系拆宛,英語(yǔ)不好沒(méi)關(guān)系嗓奢,安裝個(gè)某道詞典就可以了。

三天后浑厚。股耽。。

我們打開(kāi)scanpy的教程官網(wǎng):https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html 钳幅,對(duì)物蝙,就是這么直白。開(kāi)始在我們的編輯器里復(fù)制粘貼教程代碼敢艰,并嘗試運(yùn)行诬乞,并理解結(jié)果。

首先我們導(dǎo)入需要的庫(kù)钠导,讀入10X標(biāo)準(zhǔn)的數(shù)據(jù)filtered_feature_bc_matrix震嫉,當(dāng)然也可以是h5格式的文件,或者一個(gè)二維的表達(dá)譜牡属。

讀入數(shù)據(jù)
# -*- coding: utf-8 -*-
"""
Spyder Editor
https://scanpy.readthedocs.io/en/stable/
This is a temporary script file.
"""
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns 

sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80)
sc.logging.print_versions()
scanpy==1.4.5.1 
anndata==0.7.1 
umap==0.3.10 
numpy==1.16.5 
scipy==1.3.1 
pandas==0.25.1 
scikit-learn==0.21.3 
statsmodels==0.10.1 
python-igraph==0.8.0

不用擔(dān)心這些庫(kù)票堵,我們都會(huì)慢慢熟悉的,這里python-igraph要注意一下逮栅,不是igraph悴势。這些庫(kù)在《利用python進(jìn)行數(shù)據(jù)分析》這本書(shū)里面都有介紹,特別是pandas和numpy措伐,以后的日子里我們會(huì)看到它們幾乎是構(gòu)成scanpy數(shù)據(jù)結(jié)構(gòu)的核心特纤。

results_file = 'E:\learnscanpy\write\pbmc3k.h5ad' 
help(sc.read_10x_mtx)
adata = sc.read_10x_mtx(
    'E:/learnscanpy/data/filtered_feature_bc_matrix',  # the directory with the `.mtx` file
    var_names='gene_symbols',                  # use gene symbols for the variable names (variables-axis index)
    cache=True) 
adata.var_names_make_unique()  # this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`

這樣就把數(shù)據(jù)讀進(jìn)來(lái)了,他的名字叫做adata废士,那么這是個(gè)什么東西呢叫潦?我認(rèn)為搞清楚這個(gè)基本上學(xué)會(huì)一半scanpy了。

adata
Out[21]: 
AnnData object with n_obs × n_vars = 5025 × 33694 
    var: 'gene_ids', 'feature_types'

adata.var["gene_ids"]
Out[22]: 
RP11-34P13.3    ENSG00000243485
FAM138A         ENSG00000237613
OR4F5           ENSG00000186092
RP11-34P13.7    ENSG00000238009
RP11-34P13.8    ENSG00000239945
      
AC233755.2      ENSG00000277856
AC233755.1      ENSG00000275063
AC240274.1      ENSG00000271254
AC213203.1      ENSG00000277475
FAM231B         ENSG00000268674
Name: gene_ids, Length: 33694, dtype: object

adata.var["feature_types"]
Out[23]: 
RP11-34P13.3    Gene Expression
FAM138A         Gene Expression
OR4F5           Gene Expression
RP11-34P13.7    Gene Expression
RP11-34P13.8    Gene Expression
      
AC233755.2      Gene Expression
AC233755.1      Gene Expression
AC240274.1      Gene Expression
AC213203.1      Gene Expression
FAM231B         Gene Expression
Name: feature_types, Length: 33694, dtype: object

adata.to_df()
Out[24]: 
                    RP11-34P13.3  FAM138A  ...  AC213203.1  FAM231B
AAACCCAAGCGTATGG-1           0.0      0.0  ...         0.0      0.0
AAACCCAGTCCTACAA-1           0.0      0.0  ...         0.0      0.0
AAACCCATCACCTCAC-1           0.0      0.0  ...         0.0      0.0
AAACGCTAGGGCATGT-1           0.0      0.0  ...         0.0      0.0
AAACGCTGTAGGTACG-1           0.0      0.0  ...         0.0      0.0
                         ...      ...  ...         ...      ...
TTTGTTGCAGGTACGA-1           0.0      0.0  ...         0.0      0.0
TTTGTTGCAGTCTCTC-1           0.0      0.0  ...         0.0      0.0
TTTGTTGGTAATTAGG-1           0.0      0.0  ...         0.0      0.0
TTTGTTGTCCTTGGAA-1           0.0      0.0  ...         0.0      0.0
TTTGTTGTCGCACGAC-1           0.0      0.0  ...         0.0      0.0

[5025 rows x 33694 columns]

第一句AnnData object with n_obs × n_vars = 5025 × 33694 ,告訴我們adata是一個(gè)AnnData 對(duì)象,就像seurat也是一個(gè)對(duì)象一樣丁存。什么叫對(duì)象呢椰棘?對(duì)象就是一個(gè)實(shí)體、物體,它是一種存在而不是一種動(dòng)作。當(dāng)然,我們可以對(duì)它做一些操作卿操,一個(gè)對(duì)象可以通過(guò)具體的屬性為人們感知警检。

具體地,anndata是這樣一種構(gòu)象:

單細(xì)胞轉(zhuǎn)錄組的核心就是一個(gè)cell X gene的二維表害淤,但是分群后要存儲(chǔ)cell的分群結(jié)果扇雕,特征選擇是要記錄gene的信息,降維后要存儲(chǔ)降維后的結(jié)果窥摄。所以镶奉,這張表.X的對(duì)象cell相關(guān)的信息記錄在.obs中,屬性gene的信息記錄在.var中崭放,其他的信息在.uns中哨苛。那么每一部分是什么呢?

type(adata.var["gene_ids"])
Out[205]: pandas.core.series.Series

哦币砂,原來(lái)是pandas的Series建峭,下面是這個(gè)數(shù)據(jù)結(jié)構(gòu)的詳細(xì)介紹,這個(gè)數(shù)據(jù)結(jié)構(gòu)能存儲(chǔ)的信息一點(diǎn)也不亞于seurat啊决摧。

Type:        AnnData
String form:
AnnData object with n_obs × n_vars = 5025 × 33694 
    var: 'gene_ids', 'feature_types'
Length:      5025
File:        d:\program files (x86)\anconda\lib\site-packages\anndata\_core\anndata.py
Docstring:  
An annotated data matrix.

:class:`~anndata.AnnData` stores a data matrix :attr:`X` together with annotations
of observations :attr:`obs` (:attr:`obsm`, :attr:`obsp`),
variables :attr:`var` (:attr:`varm`, :attr:`varp`),
and unstructured annotations :attr:`uns`.

.. figure:: https://falexwolf.de/img/scanpy/anndata.svg
   :width: 350px

An :class:`~anndata.AnnData` object `adata` can be sliced like a
:class:`~pandas.DataFrame`,
for instance `adata_subset = adata[:, list_of_variable_names]`.
:class:`~anndata.AnnData`’s basic structure is similar to R’s ExpressionSet
[Huber15]_. If setting an `.h5ad`-formatted HDF5 backing file `.filename`,
data remains on the disk but is automatically loaded into memory if needed.
See this `blog post`_ for more details.

.. _blog post: http://falexwolf.de/blog/171223_AnnData_indexing_views_HDF5-backing/

Parameters
----------
X
    A #observations × #variables data matrix. A view of the data is used if the
    data type matches, otherwise, a copy is made.
obs
    Key-indexed one-dimensional observations annotation of length #observations.
var
    Key-indexed one-dimensional variables annotation of length #variables.
uns
    Key-indexed unstructured annotation.
obsm
    Key-indexed multi-dimensional observations annotation of length #observations.
    If passing a :class:`~numpy.ndarray`, it needs to have a structured datatype.
varm
    Key-indexed multi-dimensional variables annotation of length #variables.
    If passing a :class:`~numpy.ndarray`, it needs to have a structured datatype.
layers
    Key-indexed multi-dimensional arrays aligned to dimensions of `X`.
dtype
    Data type used for storage.
shape
    Shape tuple (#observations, #variables). Can only be provided if `X` is `None`.
filename
    Name of backing file. See :class:`File`.
filemode
    Open mode of backing file. See :class:`File`.

See Also
--------
read_h5ad
read_csv
read_excel
read_hdf
read_loom
read_zarr
read_mtx
read_text
read_umi_tools

Notes
-----
:class:`~anndata.AnnData` stores observations (samples) of variables/features
in the rows of a matrix.
This is the convention of the modern classics of statistics [Hastie09]_
and machine learning [Murphy12]_,
the convention of dataframes both in R and Python and the established statistics
and machine learning packages in Python (statsmodels_, scikit-learn_).

Single dimensional annotations of the observation and variables are stored
in the :attr:`obs` and :attr:`var` attributes as :class:`~pandas.DataFrame`\ s.
This is intended for metrics calculated over their axes.
Multi-dimensional annotations are stored in :attr:`obsm` and :attr:`varm`,
which are aligned to the objects observation and variable dimensions respectively.
Square matrices representing graphs are stored in :attr:`obsp` and :attr:`varp`,
with both of their own dimensions aligned to their associated axis.
Additional measurements across both observations and variables are stored in
:attr:`layers`.

Indexing into an AnnData object can be performed by relative position
with numeric indices (like pandas’ :attr:`~pandas.DataFrame.iloc`),
or by labels (like :attr:`~pandas.DataFrame.loc`).
To avoid ambiguity with numeric indexing into observations or variables,
indexes of the AnnData object are converted to strings by the constructor.

Subsetting an AnnData object by indexing into it will also subset its elements
according to the dimensions they were aligned to.
This means an operation like `adata[list_of_obs, :]` will also subset :attr:`obs`,
:attr:`obsm`, and :attr:`layers`.

Subsetting an AnnData object returns a view into the original object,
meaning very little additional memory is used upon subsetting.
This is achieved lazily, meaning that the constituent arrays are subset on access.
Copying a view causes an equivalent “real” AnnData object to be generated.
Attempting to modify a view (at any attribute except X) is handled
in a copy-on-modify manner, meaning the object is initialized in place.
Here’s an example::

    batch1 = adata[adata.obs["batch"] == "batch1", :]
    batch1.obs["value"] = 0  # This makes batch1 a “real” AnnData object

At the end of this snippet: `adata` was not modified,
and `batch1` is its own AnnData object with its own data.

Similar to Bioconductor’s `ExpressionSet` and :mod:`scipy.sparse` matrices,
subsetting an AnnData object retains the dimensionality of its constituent arrays.
Therefore, unlike with the classes exposed by :mod:`pandas`, :mod:`numpy`,
and `xarray`, there is no concept of a one dimensional AnnData object.
AnnDatas always have two inherent dimensions, :attr:`obs` and :attr:`var`.
Additionally, maintaining the dimensionality of the AnnData object allows for
consistent handling of :mod:`scipy.sparse` matrices and :mod:`numpy` arrays.

.. _statsmodels: http://www.statsmodels.org/stable/index.html
.. _scikit-learn: http://scikit-learn.org/

繪制高表達(dá)的基因:

sc.pl.highest_expr_genes(adata, n_top=20)

本著本質(zhì)的好奇心亿蒸,我們不禁要問(wèn)一句:這張圖是用什么畫(huà)的?一定要養(yǎng)成閱讀文檔的習(xí)慣:

help(sc.pl.highest_expr_genes)
Help on function highest_expr_genes in module scanpy.plotting._qc:

highest_expr_genes(adata: anndata._core.anndata.AnnData, n_top: int = 30, show: Union[bool, NoneType] = None, save: Union[str, bool, NoneType] = None, ax: Union[matplotlib.axes._axes.Axes, NoneType] = None, gene_symbols: Union[str, NoneType] = None, log: bool = False, **kwds)
   Fraction of counts assigned to each gene over all cells.
   
   Computes, for each gene, the fraction of counts assigned to that gene within
   a cell. The `n_top` genes with the highest mean fraction over all cells are
   plotted as boxplots.
   
   This plot is similar to the `scater` package function `plotHighestExprs(type
   = "highest-expression")`, see `here
   <https://bioconductor.org/packages/devel/bioc/vignettes/scater/inst/doc/vignette-qc.html>`__. Quoting
   from there:
   
       *We expect to see the “usual suspects”, i.e., mitochondrial genes, actin,
       ribosomal protein, MALAT1. A few spike-in transcripts may also be
       present here, though if all of the spike-ins are in the top 50, it
       suggests that too much spike-in RNA was added. A large number of
       pseudo-genes or predicted genes may indicate problems with alignment.*
       -- Davis McCarthy and Aaron Lun
   
   Parameters
   ----------
   adata
       Annotated data matrix.
   n_top
       Number of top
   show
        Show the plot, do not return axis.
   save
       If `True` or a `str`, save the figure.
       A string is appended to the default filename.
       Infer the filetype if ending on {`'.pdf'`, `'.png'`, `'.svg'`}.
   ax
       A matplotlib axes object. Only works if plotting a single component.
   gene_symbols
       Key for field in .var that stores gene symbols if you do not want to use .var_names.
   log
       Plot x-axis in log scale
   **kwds
       Are passed to :func:`~seaborn.boxplot`.
   
   Returns
   -------
   If `show==False` a :class:`~matplotlib.axes.Axes`.

可以看出這個(gè)功能借鑒了知名R包scater的一個(gè)函數(shù):plotHighestExprs蜜徽,繪圖用的是python里面兩個(gè)響當(dāng)當(dāng)?shù)膕eaborn和matplotlib庫(kù)祝懂。那我們不禁要自己微調(diào)一下了。

current_palette = sns.color_palette("dark")
sns.palplot(current_palette)
 sc.pl.highest_expr_genes(adata, n_top=20,palette="dark" )
normalizing counts per cell
    finished (0:00:00)

借助seaborn的繪圖系統(tǒng)圖我們還可以這樣設(shè)置顏色:

with sns.color_palette("PuBuGn_d"):
    sc.pl.highest_expr_genes(adata, n_top=20, )

help(sns.set_style)
sns.set_style("white")
sc.pl.highest_expr_genes(adata, n_top=20, )
cell QC

提到cell QC 拘鞋,我們需要明白在10的體系里面什么是cell。有的同學(xué)說(shuō)了矢门,我上機(jī)6000個(gè)細(xì)胞盆色,為什么cellranger檢出的細(xì)胞不是6000,有時(shí)候差的還不小祟剔。因?yàn)樵赾ellranger的邏輯里隔躲,cell不過(guò)是每個(gè)umi對(duì)barcode的支持程度。它說(shuō)的cell是基于序列比對(duì)的物延,盡管這個(gè)理應(yīng)和上機(jī)的cell保持一致宣旱,但是這種一致性是通過(guò)序列比對(duì)來(lái)保持的。其實(shí)我們說(shuō)的cell QC指的是對(duì)barcode的質(zhì)控叛薯,使留下的barcode能表征下游分析的cell浑吟。

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=0)
mito_genes = adata.var_names.str.startswith('MT-')
# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
    adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1

參數(shù)的意思就是字面的意思,有不明白的可以直接help相應(yīng)的函數(shù)耗溜,看它的說(shuō)明文檔组力。我基本上就是靠說(shuō)明文檔來(lái)學(xué)習(xí)的。下面讓我們看看過(guò)濾后的adata發(fā)生了哪些變化抖拴。

adata
Out[9]: 
AnnData object with n_obs × n_vars = 4960 × 33694 
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'feature_types', 'n_cells'

 adata.obs['percent_mito']
Out[11]: 
AAACCCAAGCGTATGG-1    0.106588
AAACCCAGTCCTACAA-1    0.056101
AAACCCATCACCTCAC-1    0.532847
AAACGCTAGGGCATGT-1    0.106913
AAACGCTGTAGGTACG-1    0.078654
  
TTTGTTGCAGGTACGA-1    0.071226
TTTGTTGCAGTCTCTC-1    0.055394
TTTGTTGGTAATTAGG-1    0.183441
TTTGTTGTCCTTGGAA-1    0.081037
TTTGTTGTCGCACGAC-1    0.070987
Name: percent_mito, Length: 4960, dtype: float32

adata.var['n_cells']
Out[12]: 
RP11-34P13.3     0
FAM138A          0
OR4F5            0
RP11-34P13.7    43
RP11-34P13.8     0
                ..
AC233755.2       1
AC233755.1       1
AC240274.1      83
AC213203.1       0
FAM231B          0
Name: n_cells, Length: 33694, dtype: int32

細(xì)胞屬性obs中有了percent_mito燎字;有了基因?qū)傩缘膙ar。

像seurat一樣,我們看看cell三個(gè)屬性的小提琴圖:

sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
             jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='n_counts', y='percent_mito')
sc.pl.scatter(adata, x='n_counts', y='n_genes')

顯然這是分別畫(huà)的兩張圖候衍,我們能不能把它們組合到一起呢笼蛛?顯然是可以的:

help(sns.FacetGrid)
adata.obs['percent_mito1'] = adata.obs['percent_mito']*10000
adata.obs.melt(id_vars=['n_counts','percent_mito'], var_name='myVariable', value_name='myValue')
import matplotlib.pyplot as plt
sns.set(style="ticks")
g = sns.FacetGrid(adata.obs.melt(id_vars=['n_counts','percent_mito'], var_name='myVariable', value_name='myValue'),col = 'myVariable')
g.map(plt.scatter,'n_counts',"myValue", alpha=.7)
g.add_legend();

有不少人問(wèn)這兩張圖怎么看。就拿n_counts與n_genes的散點(diǎn)圖(scatter)來(lái)說(shuō)吧:每個(gè)點(diǎn)代表一個(gè)細(xì)胞蛉鹿,斜率代表隨著count的增加gene的增加程度伐弹。count和gene一般呈現(xiàn)線性關(guān)系,斜率越大也就是較少的count就可以檢出較多的gene榨为,說(shuō)明這批細(xì)胞基因的豐度偏低(普遍貧窮)惨好;反之,斜率較小随闺,說(shuō)明這批細(xì)胞基因豐度較高(少數(shù)富有)日川。有的時(shí)候不是一條可擬合的線,或者是兩條可擬合的線矩乐,也反映出細(xì)胞的異質(zhì)性龄句。總之散罕,他就是一個(gè)散點(diǎn)圖分歇,描述的是兩個(gè)變量的關(guān)系。

根據(jù)統(tǒng)計(jì)的n_genes 和線粒體基因含量percent_mito 進(jìn)一步過(guò)濾:

adata = adata[adata.obs.n_genes < 2500, :]
adata = adata[adata.obs.percent_mito < 0.1, :]

過(guò)濾掉基因數(shù)大于2500欧漱,線粒體基因含量大于0.1的barcode职抡。這個(gè)過(guò)濾方法就像filter一樣,操作的是adata.obs這張表误甚。

adata.obs
Out[82]: 
                    n_genes  percent_mito  n_counts  percent_mito1
AAACGCTGTGTGATGG-1     2348      0.099821    6131.0     998.205811
AAACGCTTCTAGGCAT-1     2470      0.065564    8465.0     655.640869
AAAGAACCACCGTCTT-1      420      0.021944     638.0     219.435730
AAAGAACTCACCTGGG-1     2479      0.074067   10801.0     740.672119
AAAGAACTCGGAACTT-1     2318      0.079713    6837.0     797.133240
                    ...           ...       ...            ...
TTTGGTTTCCGGTAAT-1     2277      0.079324    9291.0     793.240723
TTTGTTGCAGGTACGA-1     1986      0.071226    8101.0     712.257751
TTTGTTGCAGTCTCTC-1     2485      0.055394    9604.0     553.935852
TTTGTTGTCCTTGGAA-1     1998      0.081037    5479.0     810.366882
TTTGTTGTCGCACGAC-1     2468      0.070987    7438.0     709.868225

[2223 rows x 4 columns]
特征提取

cellQC可以看做是對(duì)細(xì)胞的過(guò)濾缚甩,即保留可用的細(xì)胞。但是每個(gè)細(xì)胞又有成千的基因(細(xì)胞的特征)窑邦,所以一般需要做特征的選擇和提壬猛(也就是降維)。什么意思呢冈钦?就是比如每個(gè)人都有很多特征郊丛,年齡、性別瞧筛、身高厉熟、國(guó)籍、學(xué)歷驾窟、生日庆猫、愛(ài)好、血型绅络。月培。嘁字。我要對(duì)一百個(gè)人分類(lèi),所有的特征都拿來(lái)比較杉畜,可能得不到準(zhǔn)確的劃分纪蜒。而用幾個(gè)關(guān)鍵的特征可以分出某個(gè)國(guó)家某個(gè)年齡段的人。特征的選擇是選擇一部分特征此叠,比如選擇高變基因纯续;特征提取是提取主要的特征結(jié)構(gòu),如主成分分析灭袁。

在特征提取之前要保證細(xì)胞之間是有可比性的猬错,一般用的是歸一化的方法,得到高變基因之后茸歧,為了使同一個(gè)基因在不同細(xì)胞之間具有可比性采用標(biāo)準(zhǔn)化倦炒。

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata

adata
AnnData object with n_obs × n_vars = 2223 × 33694 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
    var: 'gene_ids', 'feature_types', 'n_cells'
    uns: 'log1p'
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
adata = adata[:, adata.var.highly_variable]

#時(shí)刻關(guān)注我們數(shù)據(jù)的變化
adata
Out[117]: 
View of AnnData object with n_obs × n_vars = 2223 × 2208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p'

回歸每個(gè)細(xì)胞的總計(jì)數(shù)和線粒體基因表達(dá)百分比的影響,將數(shù)據(jù)縮放到單位方差软瞎。

sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')


adata
Out[121]: 
AnnData object with n_obs × n_vars = 2223 × 2208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs

adata.obsm['X_pca']
adata.varm['PCs']
sc.pl.pca(adata, color='CST3')
sc.pl.pca_variance_ratio(adata, log=True)
adata.write(results_file)

特征選擇我們用高變基因逢唤,特征提取我們用pca。這里就有一個(gè)都的問(wèn)題:如何選擇高變基因以及pc的維度涤浇?

聚類(lèi)

scanpy提供leiden和louvain兩種圖聚類(lèi)算法鳖藕,圖聚類(lèi)在開(kāi)始之前要先找鄰居:

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
adata

Out[128]: 
AnnData object with n_obs × n_vars = 2223 × 2208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'pca', 'neighbors'
    obsm: 'X_pca'
    varm: 'PCs'

sc.tl.leiden(adata)

adata.obs['leiden']
Out[188]: 
AAACGCTGTGTGATGG-1    10
AAACGCTTCTAGGCAT-1     6
AAAGAACCACCGTCTT-1     4
AAAGAACTCACCTGGG-1     2
AAAGAACTCGGAACTT-1     6
                      ..
TTTGGTTTCCGGTAAT-1     3
TTTGTTGCAGGTACGA-1     1
TTTGTTGCAGTCTCTC-1     3
TTTGTTGTCCTTGGAA-1     7
TTTGTTGTCGCACGAC-1     3
Name: leiden, Length: 2223, dtype: category
Categories (12, object): [0, 1, 2, 3, ..., 8, 9, 10, 11]
#sc.tl.louvain(adata)
#sc.tl.paga(adata)

和Seurat等人一樣,scanpy推薦Traag *等人(2018)提出的Leiden graph-clustering方法(基于優(yōu)化模塊化的社區(qū)檢測(cè))只锭。注意著恩,Leiden集群直接對(duì)cell的鄰域圖進(jìn)行聚類(lèi),我們?cè)趕c.pp.neighbors已經(jīng)計(jì)算過(guò)了纹烹。
讀一下leiden的說(shuō)明文檔:

 help(sc.tl.leiden)
Help on function leiden in module scanpy.tools._leiden:

leiden(adata: anndata._core.anndata.AnnData, resolution: float = 1, *, restrict_to: Union[Tuple[str, Sequence[str]], NoneType] = None, random_state: Union[int, mtrand.RandomState, NoneType] = 0, key_added: str = 'leiden', adjacency: Union[scipy.sparse.base.spmatrix, NoneType] = None, directed: bool = True, use_weights: bool = True, n_iterations: int = -1, partition_type: Union[Type[leidenalg.VertexPartition.MutableVertexPartition], NoneType] = None, copy: bool = False, **partition_kwargs) -> Union[anndata._core.anndata.AnnData, NoneType]
    Cluster cells into subgroups [Traag18]_.
    
    Cluster cells using the Leiden algorithm [Traag18]_,
    an improved version of the Louvain algorithm [Blondel08]_.
    It has been proposed for single-cell analysis by [Levine15]_.
    
    This requires having ran :func:`~scanpy.pp.neighbors` or
    :func:`~scanpy.external.pp.bbknn` first.
    
    Parameters
    ----------
    adata
        The annotated data matrix.
    resolution
        A parameter value controlling the coarseness of the clustering.
        Higher values lead to more clusters.
        Set to `None` if overriding `partition_type`
        to one that doesn’t accept a `resolution_parameter`.
    random_state
        Change the initialization of the optimization.
    restrict_to
        Restrict the clustering to the categories within the key for sample
        annotation, tuple needs to contain `(obs_key, list_of_categories)`.
    key_added
        `adata.obs` key under which to add the cluster labels.
    adjacency
        Sparse adjacency matrix of the graph, defaults to
        `adata.uns['neighbors']['connectivities']`.
    directed
        Whether to treat the graph as directed or undirected.
    use_weights
        If `True`, edge weights from the graph are used in the computation
        (placing more emphasis on stronger edges).
    n_iterations
        How many iterations of the Leiden clustering algorithm to perform.
        Positive values above 2 define the total number of iterations to perform,
        -1 has the algorithm run until it reaches its optimal clustering.
    partition_type
        Type of partition to use.
        Defaults to :class:`~leidenalg.RBConfigurationVertexPartition`.
        For the available options, consult the documentation for
        :func:`~leidenalg.find_partition`.
    copy
        Whether to copy `adata` or modify it inplace.
    **partition_kwargs
        Any further arguments to pass to `~leidenalg.find_partition`
        (which in turn passes arguments to the `partition_type`).
    
    Returns
    -------
    `adata.obs[key_added]`
        Array of dim (number of samples) that stores the subgroup id
        (`'0'`, `'1'`, ...) for each cell.
    `adata.uns['leiden']['params']`
        A dict with the values for the parameters `resolution`, `random_state`,
        and `n_iterations`.

seurat的findclusters中聚類(lèi)的算法亦是有Leiden的页滚,關(guān)于leiden和louvain兩者的區(qū)別,可以看看From Louvain to Leiden: guaranteeing well-connected communities

algorithm   
Algorithm for modularity optimization (
1 = original Louvain algorithm;
2 = Louvain algorithm with multilevel refinement; 
3 = SLM algorithm;
4 = Leiden algorithm). Leiden requires the leidenalg python.

為了可視化聚類(lèi)的結(jié)果我們還是先做一下umap降維吧铺呵,然后看看分群結(jié)果。

sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])
adata
Out[139]: 
AnnData object with n_obs × n_vars = 2223 × 2208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1', 'leiden'
    var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'pca', 'neighbors', 'leiden', 'paga', 'leiden_sizes', 'umap', 'leiden_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
差異分析

分群后我們得到以數(shù)值為標(biāo)識(shí)的亞群編號(hào)隧熙,我們急于知道每一個(gè)數(shù)值背后的含義片挂,意義還得從基因上找。差異分析就是為了這目的的贞盯。

 sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
ranking genes
D:\Program Files (x86)\anconda\lib\site-packages\scanpy\tools\_rank_genes_groups.py:252: RuntimeWarning: invalid value encountered in log2
  rankings_gene_logfoldchanges.append(np.log2(foldchanges[global_indices]))
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)

為了使我們的文章圖文并茂一些音念,來(lái)看看't-test'檢驗(yàn)的每個(gè)亞群的差異基因排序:

sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
sc.settings.verbosity = 2  # reduce the verbosity
help(sc.tl.rank_genes_groups)
Help on function rank_genes_groups in module scanpy.tools._rank_genes_groups:

rank_genes_groups(adata: anndata._core.anndata.AnnData, groupby: str, use_raw: bool = True, groups: Union[scanpy._compat.Literal_, Iterable[str]] = 'all', reference: str = 'rest', n_genes: int = 100, rankby_abs: bool = False, key_added: Union[str, NoneType] = None, copy: bool = False, method: scanpy._compat.Literal_ = 't-test_overestim_var', corr_method: scanpy._compat.Literal_ = 'benjamini-hochberg', layer: Union[str, NoneType] = None, **kwds) -> Union[anndata._core.anndata.AnnData, NoneType]
    Rank genes for characterizing groups.
    
    Parameters
    ----------
    adata
        Annotated data matrix.
    groupby
        The key of the observations grouping to consider.
    use_raw
        Use `raw` attribute of `adata` if present.
    layer
        Key from `adata.layers` whose value will be used to perform tests on.
    groups
        Subset of groups, e.g. [`'g1'`, `'g2'`, `'g3'`], to which comparison
        shall be restricted, or `'all'` (default), for all groups.
    reference
        If `'rest'`, compare each group to the union of the rest of the group.
        If a group identifier, compare with respect to this group.
    n_genes
        The number of genes that appear in the returned tables.
    method
        The default 't-test_overestim_var' overestimates variance of each group,
        `'t-test'` uses t-test, `'wilcoxon'` uses Wilcoxon rank-sum,
        `'logreg'` uses logistic regression. See [Ntranos18]_,
        `here <https://github.com/theislab/scanpy/issues/95>`__ and `here
        <http://www.nxn.se/valent/2018/3/5/actionable-scrna-seq-clusters>`__,
        for why this is meaningful.
    corr_method
        p-value correction method.
        Used only for `'t-test'`, `'t-test_overestim_var'`, and `'wilcoxon'`.
    rankby_abs
        Rank genes by the absolute value of the score, not by the
        score. The returned scores are never the absolute values.
    key_added
        The key in `adata.uns` information is saved to.
    **kwds
        Are passed to test methods. Currently this affects only parameters that
        are passed to :class:`sklearn.linear_model.LogisticRegression`.
        For instance, you can pass `penalty='l1'` to try to come up with a
        minimal set of genes that are good predictors (sparse solution meaning
        few non-zero fitted coefficients).
    
    Returns
    -------
    **names** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        Structured array to be indexed by group id storing the gene
        names. Ordered according to scores.
    **scores** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        Structured array to be indexed by group id storing the z-score
        underlying the computation of a p-value for each gene for each
        group. Ordered according to scores.
    **logfoldchanges** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        Structured array to be indexed by group id storing the log2
        fold change for each gene for each group. Ordered according to
        scores. Only provided if method is 't-test' like.
        Note: this is an approximation calculated from mean-log values.
    **pvals** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        p-values.
    **pvals_adj** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        Corrected p-values.
    
    Notes
    -----
    There are slight inconsistencies depending on whether sparse
    or dense data are passed. See `here <https://github.com/theislab/scanpy/blob/master/scanpy/tests/test_rank_genes_groups.py>`__.
    
    Examples
    --------
import scanpy as sc
adata = sc.datasets.pbmc68k_reduced()
sc.tl.rank_genes_groups(adata, 'bulk_labels', method='wilcoxon')
    
    # to visualize the results
sc.pl.rank_genes_groups(adata)

scanpy差異分析的方法沒(méi)有seurat豐富了,除了t-test躏敢,還有wilcoxon和logreg闷愤。

查看差異分析的結(jié)果:

sc.get.rank_genes_groups_df(adata, group="0")
Out[148]: 
       scores     names  logfoldchanges         pvals     pvals_adj
0   15.179968      IL7R             NaN  1.472705e-43  1.711438e-42
1   13.268551     RGS10             NaN  6.462721e-35  5.595956e-34
2   11.459023     LRRN3             NaN  9.208903e-27  5.465930e-26
3   10.671810       CD7             NaN  3.213348e-24  1.665510e-23
4    8.615908     INTS6             NaN  1.053249e-16  3.856673e-16
..        ...       ...             ...           ...           ...
95   1.755937    DUSP16             NaN  7.978160e-02  1.094148e-01
96   1.724632      GALM             NaN  8.523017e-02  1.163811e-01
97   1.713933  KCNQ1OT1             NaN  8.719264e-02  1.188403e-01
98   1.706563      PASK             NaN  8.851861e-02  1.202764e-01
99   1.696579      SUCO             NaN  9.047560e-02  1.227089e-01

[100 rows x 5 columns]

指定哪兩組進(jìn)行差異分析:

sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
sc.get.rank_genes_groups_df(adata, group="0")
Out[169]: 
       scores     names  logfoldchanges         pvals     pvals_adj
0   19.284122     EFNB2        2.265986  7.301753e-83  4.030568e-81
1   19.205927  C11orf96             NaN  3.301754e-82  1.350050e-80
2   19.158936     NPDC1       -2.627818  8.152369e-82  2.686632e-80
3   18.702019      EGR4             NaN  4.766577e-78  6.190942e-77
4   18.405581     FBXL8       -0.640331  1.185097e-75  1.104091e-74
..        ...       ...             ...           ...           ...
95  11.469543     AP4S1       -3.958649  1.876464e-30  3.019849e-30
96  11.459700    FBXO33        3.057788  2.102432e-30  3.378580e-30
97  11.423295   LDLRAD4        0.052481  3.198732e-30  5.121683e-30
98  11.368316      H1F0             NaN  6.013658e-30  9.587116e-30
99  11.310737    ZCWPW1        0.455119  1.161100e-29  1.836468e-29

[100 rows x 5 columns]
差異基因的可視化
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']

pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'pvals']}).head(5)

        0_n           0_p
0     EFNB2  7.301753e-83
1  C11orf96  3.301754e-82
2     NPDC1  8.152369e-82
3      EGR4  4.766577e-78
4     FBXL8  1.185097e-75
sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')
new_cluster_names = [
    'CD4 T', 'CD14 Monocytes',
    'B', 'CD8 T',
    'NK', 'FCGR3A Monocytes',
    'Dendritic', 'Megakaryocytes',
   'NewCluster1','NewCluster2','NewCluster3','NewCluster4' ]
adata.rename_categories('leiden', new_cluster_names)

\#Omitting rank_genes_groups/scores as old categories do not match.
\#Omitting rank_genes_groups/names as old categories do not match.
\#Omitting rank_genes_groups/logfoldchanges as old categories do not match.
\#Omitting rank_genes_groups/pvals as old categories do not match.
\#Omitting rank_genes_groups/pvals_adj as old categories do not match.

ax = sc.pl.dotplot(adata, marker_genes, groupby='leiden')
ax = sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90)

像seurat一樣,scanpy也有豐富的數(shù)據(jù)可視化的方式件余,太多以至于需要單獨(dú)再寫(xiě)另一篇文章讥脐,這里就不介紹了遭居。

保存分析結(jié)果以供下一次調(diào)用
adata.write(results_file, compression='gzip')  # `compression='gzip'` saves disk space, but slows down writing and subsequent reading

adata.X = None
adata.write('E:\learnscanpy\write\pbmc3k_withoutX.h5ad', compression='gzip')
adata.obs[['n_counts', 'leiden']].to_csv(
     'E:\learnscanpy\write\pbmc3k_corrected_lei_groups.csv')

\# Export single fields of the annotation of observations
\# adata.obs[['n_counts', 'louvain_groups']].to_csv(
\#     './write/pbmc3k_corrected_louvain_groups.csv')

\# Export single columns of the multidimensional annotation
\# adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(
\#     './write/pbmc3k_corrected_X_pca.csv')

\# Or export everything except the data using `.write_csvs`.
\# Set `skip_data=False` if you also want to export the data.
\# adata.write_csvs(results_file[:-5], )

今天,我們學(xué)習(xí)了scanpy的一般流程旬渠,我們發(fā)現(xiàn)不管工具如何變,單細(xì)胞轉(zhuǎn)錄組的數(shù)據(jù)分析的大框架是沒(méi)有變化的俱萍,幾個(gè)分析的工具也是相互借鑒的。但是python的就不值得一學(xué)了嗎告丢?



SCANPY: large-scale single-cell gene expression data analysis

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末枪蘑,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子岖免,更是在濱河造成了極大的恐慌,老刑警劉巖栅炒,帶你破解...
    沈念sama閱讀 216,324評(píng)論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件赢赊,死亡現(xiàn)場(chǎng)離奇詭異释移,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī)同诫,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,356評(píng)論 3 392
  • 文/潘曉璐 我一進(jìn)店門(mén)秩贰,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人想际,你說(shuō)我怎么就攤上這事∮驯牵” “怎么了?”我有些...
    開(kāi)封第一講書(shū)人閱讀 162,328評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵胸梆,是天一觀的道長(zhǎng)。 經(jīng)常有香客問(wèn)我秽荤,道長(zhǎng)牍氛,這世上最難降的妖魔是什么? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,147評(píng)論 1 292
  • 正文 為了忘掉前任楔敌,我火速辦了婚禮,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,160評(píng)論 6 388
  • 文/花漫 我一把揭開(kāi)白布萧朝。 她就那樣靜靜地躺著,像睡著了一般剪勿。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上厕吉,一...
    開(kāi)封第一講書(shū)人閱讀 51,115評(píng)論 1 296
  • 那天项钮,我揣著相機(jī)與錄音,去河邊找鬼羞延。 笑死,一個(gè)胖子當(dāng)著我的面吹牛脾还,可吹牛的內(nèi)容都是我干的嗤谚。 我是一名探鬼主播,決...
    沈念sama閱讀 40,025評(píng)論 3 417
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼泥张,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼呵恢!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起媚创,我...
    開(kāi)封第一講書(shū)人閱讀 38,867評(píng)論 0 274
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤渗钉,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后钞钙,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體鳄橘,經(jīng)...
    沈念sama閱讀 45,307評(píng)論 1 310
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,528評(píng)論 2 332
  • 正文 我和宋清朗相戀三年芒炼,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了瘫怜。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 39,688評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡本刽,死狀恐怖鲸湃,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情子寓,我是刑警寧澤暗挑,帶...
    沈念sama閱讀 35,409評(píng)論 5 343
  • 正文 年R本政府宣布,位于F島的核電站斜友,受9級(jí)特大地震影響炸裆,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜鲜屏,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,001評(píng)論 3 325
  • 文/蒙蒙 一烹看、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧洛史,春花似錦惯殊、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 31,657評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至,卻和暖如春浪漠,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背霎褐。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 32,811評(píng)論 1 268
  • 我被黑心中介騙來(lái)泰國(guó)打工址愿, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人冻璃。 一個(gè)月前我還...
    沈念sama閱讀 47,685評(píng)論 2 368
  • 正文 我出身青樓响谓,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親省艳。 傳聞我的和親對(duì)象是個(gè)殘疾皇子娘纷,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,573評(píng)論 2 353

推薦閱讀更多精彩內(nèi)容