--- 大師毅哗,大師貌嫡,我想學(xué)習(xí)單細(xì)胞
··· 閉上眼睛跟我來(lái)
單細(xì)胞測(cè)序有著漫長(zhǎng)的過去,卻只有短暫的歷史---誰(shuí)說(shuō)的哮肚!
說(shuō)她漫長(zhǎng)是因?yàn)榈饺缃褚灿惺畮啄甑臍v史了停局,說(shuō)她段短暫是因?yàn)獒槍?duì)單細(xì)胞的分析工具越來(lái)越有意義開發(fā)周期卻越來(lái)越短很钓。一般生物信息流程主要由軟件(安裝與參數(shù))、數(shù)據(jù)庫(kù)(結(jié)構(gòu)和生物學(xué)意義)和數(shù)據(jù)分析(統(tǒng)計(jì)學(xué)和編程)組成董栽,目前單細(xì)胞分析用到的軟件主要是FastQC码倦、Cellranger和R包Seurat、monocle锭碳;數(shù)據(jù)庫(kù)有相應(yīng)物種的參考基因組袁稽、KEGG、GO擒抛;數(shù)據(jù)分析部分主要基于count矩陣和差異表達(dá)數(shù)據(jù)用R或者Python來(lái)做推汽。
cellranger是10X genomics公司為單細(xì)胞RNA測(cè)序分析量身打造的數(shù)據(jù)分析軟件,可以直接輸入Illumina 原始數(shù)據(jù)(raw base call 歧沪,BCL)輸出表達(dá)定量矩陣歹撒、降維(pca),聚類(Graph-based& K-Means)以及可視化(t-SNE)結(jié)果诊胞,結(jié)合配套的Loupe Cell Browser給予研究者更多探索單細(xì)胞數(shù)據(jù)的機(jī)會(huì)暖夭。cellranger的高度集成化,使得單細(xì)胞測(cè)序數(shù)據(jù)探索變得更加簡(jiǎn)單,研究者有更多的時(shí)間來(lái)做生物學(xué)意義的挖掘迈着。
今天小編就要給大家介紹這下這個(gè)可能成為行業(yè)標(biāo)本的數(shù)據(jù)分析軟件——cellranger竭望。在這之前,還是先來(lái)了解一下10X genomics單細(xì)胞測(cè)序的一般原理吧裕菠。
10X genomics
一個(gè)油滴=一個(gè)單細(xì)胞=一個(gè)凝膠微珠=一個(gè)RNA-Seq市框,可以說(shuō)這就是10X的基本技術(shù)原理。
V2試劑盒產(chǎn)生的文庫(kù)結(jié)構(gòu):
V3試劑盒產(chǎn)生的文庫(kù)結(jié)構(gòu):
- reads 1 :主要用來(lái)定量(barcode糕韧、UMI以及reads的來(lái)源)
- reads 2 :與基因組比對(duì)
- Barcode:標(biāo)記細(xì)胞
- UMI (Unique Molecular Identifier):標(biāo)記基因
- PolyT :真核生物
10X genomics單細(xì)胞測(cè)序通過Barcode來(lái)標(biāo)記細(xì)胞,UMI 來(lái)標(biāo)記轉(zhuǎn)錄本喻圃,這樣與參考基因比對(duì)后就可以定量細(xì)胞以及基因的數(shù)量萤彩。
在2019年3月7日10x官方網(wǎng)站對(duì)“單細(xì)胞基因表達(dá)入門”的直播視頻中提到(只是一場(chǎng)直播提到的信息,僅供參考): 由于10x單細(xì)胞測(cè)序的重復(fù)性較高斧拍,因此如無(wú)特殊原因雀扶,做生物學(xué)重復(fù)的意義不大;如果細(xì)胞大小不一致肆汹,但直徑符合上機(jī)要求愚墓,對(duì)捕獲效率沒有明顯影響;細(xì)胞重懸清洗后要保證鈣昂勉、鎂離子濃度較低浪册,從而不影響反轉(zhuǎn)錄;非常規(guī)形態(tài)或直徑較大的細(xì)胞可以采用抽核的方法進(jìn)行檢測(cè)岗照。當(dāng)被問及測(cè)序時(shí)需不需要加入標(biāo)準(zhǔn)物(如ERCC)的時(shí)候村象,官方給出的建議是不建議加ERCC(考慮到成本和影響細(xì)胞和基因的計(jì)數(shù))。
cell ranger pipeline
cellranger單細(xì)胞分析流程主要分為:數(shù)據(jù)拆分(cellranger mkfastq攒至、細(xì)胞定量cellranger count厚者、組合分析cellranger aggr、參數(shù)調(diào)整cellranger reanalyze迫吐。還有一些用戶可能會(huì)用到的功能:mat2csv库菲、mkgtf、mkref志膀、vdj熙宇、mkvdjref、testrun溉浙、upload奇颠、sitecheck。
本文主要介紹常用的count aggr以及reanlyze放航。
- 拆分
封裝了Illumina's bcl2fastq軟件烈拒,用來(lái)拆分Illumina 原始數(shù)據(jù)(raw base call (BCL)),輸出 FASTQ 文件。
cellranger-cs/3.0.0 -h
cellranger mkfastq (3.0.0)
Copyright (c) 2018 10x Genomics, Inc. All rights reserved.
-------------------------------------------------------------------------------
Run Illumina demultiplexer on sample sheets that contain 10x-specific sample
index sets, and generate 10x-specific quality metrics after the demultiplex.
Any bcl2fastq argument will work (except a few that are set by the pipeline
to ensure proper trimming and sample indexing). The FASTQ output generated
will be the same as when running bcl2fastq directly.
These bcl2fastq arguments are overridden by this pipeline:
--fastq-cluster-count
--minimum-trimmed-read-length
--mask-short-adapter-reads
Usage:
cellranger mkfastq --run=PATH [options]
cellranger mkfastq -h | --help | --version
Required:
--run=PATH Path of Illumina BCL run folder.
Optional:
# Sample Sheet
--id=NAME Name of the folder created by mkfastq. If not
supplied, will default to the name of the
flowcell referred to by the --run argument.
--csv=PATH
--samplesheet=PATH
--sample-sheet=PATH Path to the sample sheet. The sample sheet can either
be a simple CSV with lane, sample and index columns,
or an Illumina Experiment Manager-compatible
sample sheet. Sample sheet indexes can refer to
10x sample index set names (e.g., SI-GA-A12).
--simple-csv=PATH Deprecated. Same meaning as --csv.
--ignore-dual-index On a dual-indexed flowcell, ignore the second sample
index, if the second sample index was not used
for the 10x sample.
--qc Calculate both sequencing and 10x-specific metrics,
including per-sample barcode matching rate. Will
not be performed unless this flag is specified.
# bcl2fastq Pass-Through
--lanes=NUMS Comma-delimited series of lanes to demultiplex.
Shortcut for the --tiles argument.
--use-bases-mask=MASK Same as bcl2fastq; override the read lengths as
specified in RunInfo.xml. See Illumina bcl2fastq
documentation for more information.
--delete-undetermined Delete the Undetermined FASTQ files left by
bcl2fastq. Useful if your sample sheet is only
expected to match a subset of the flowcell.
--output-dir=PATH Same as in bcl2fastq. Folder where FASTQs, reports
and stats will be generated.
--project=NAME Custom project name, to override the samplesheet or
to use in conjunction with the --csv argument.
# Martian Runtime
--jobmode=MODE Job manager to use. Valid options:
local (default), sge, lsf, or a .template file
--localcores=NUM Set max cores the pipeline may request at one time.
Only applies when --jobmode=local.
--localmem=NUM Set max GB the pipeline may request at one time.
Only applies when --jobmode=local.
--mempercore=NUM Set max GB each job may use at one time.
Only applies in cluster jobmodes.
--maxjobs=NUM Set max jobs submitted to cluster at one time.
Only applies in cluster jobmodes.
--jobinterval=NUM Set delay between submitting jobs to cluster, in ms.
Only applies in cluster jobmodes.
--overrides=PATH The path to a JSON file that specifies stage-level
overrides for cores and memory. Finer-grained
than --localcores, --mempercore and --localmem.
Consult the 10x support website for an example
override file.
--uiport=PORT Serve web UI at http://localhost:PORT
--disable-ui Do not serve the UI.
--noexit Keep web UI running after pipestance completes or fails.
--nopreflight Skip preflight checks.
-h --help Show this message.
--version Show version.
Note: 'cellranger mkfastq' can be called in two ways, depending on how you
demultiplexed your BCL data into FASTQ files.
1. If you demultiplexed with 'cellranger mkfastq' or directly with
Illumina bcl2fastq, then set --fastqs to the project folder containing
FASTQ files. In addition, set --sample to the name prefixed to the FASTQ
files comprising your sample. For example, if your FASTQs are named:
subject1_S1_L001_R1_001.fastq.gz
then set --sample=subject1
2. If you demultiplexed with 'cellranger demux', then set --fastqs to a
demux output folder containing FASTQ files. Use the --lanes and --indices
options to specify which FASTQ reads comprise your sample.
This method is deprecated. Please use 'cellranger mkfastq' going forward.
有以下兩種使用方式
$ cellranger mkfastq --id=tiny-bcl \
--run=/path/to/tiny_bcl \
--csv=cellranger-tiny-bcl-simple-1.2.0.csv
cellranger mkfastq
Copyright (c) 2017 10x Genomics, Inc. All rights reserved.
-------------------------------------------------------------------------------
Martian Runtime - 3.0.2-v3.2.0
Running preflight checks (please wait)...
2017-08-09 16:33:54 [runtime] (ready) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
2017-08-09 16:33:57 [runtime] (split_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
2017-08-09 16:33:57 [runtime] (run:local) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET.fork0.chnk0.main
2017-08-09 16:34:00 [runtime] (chunks_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
... # 只有3列荆几,第一列指定lane ID, 第二列指定樣本名稱吓妆,第三列指定index的名稱,10X genomics的每個(gè)index代表4條具體的oligo序列
$ cellranger mkfastq --id=tiny-bcl \
--run=/path/to/tiny_bcl \
--samplesheet=cellranger-tiny-bcl-samplesheet-1.2.0.csv
cellranger mkfastq
Copyright (c) 2017 10x Genomics, Inc. All rights reserved.
-------------------------------------------------------------------------------
Martian Runtime - 3.0.2-v3.2.0
Running preflight checks (please wait)...
2017-08-09 16:25:49 [runtime] (ready) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
2017-08-09 16:25:52 [runtime] (split_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
2017-08-09 16:25:52 [runtime] (run:local) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET.fork0.chnk0.main
2017-08-09 16:25:58 [runtime] (chunks_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
如果使用samplesheet文件吨铸,需要調(diào)整[Reads]中的序列長(zhǎng)度行拢,而使用簡(jiǎn)化版的csv文件,cell ranger可以識(shí)別所用試劑盒版本诞吱,然后自動(dòng)化的調(diào)整reads長(zhǎng)度舟奠。
拆分好之后的目錄結(jié)構(gòu)如下所示
├── fastq_path
│ ├── H35KCBCXY
│ │ └── test_sample
│ │ ├── test_sample_S1_L001_I1_001.fastq.gz #index序列
│ │ ├── test_sample_S1_L001_R1_001.fastq.gz
│ │ └── test_sample_S1_L001_R2_001.fastq.gz
- QC 其實(shí)這部分并不能叫做Quality control而應(yīng)該叫做Quality check.
--qc 參數(shù)輸出序列質(zhì)量情況,保存在outs/qc_summary.json 中:
"sample_qc": {
"Sample1": {
"5": {
"barcode_exact_match_ratio": 0.9336158258904611,
"barcode_q30_base_ratio": 0.9611993091728814,
"bc_on_whitelist": 0.9447542078230667,
"mean_barcode_qscore": 37.770630795934,
"number_reads": 2748155,
"read1_q30_base_ratio": 0.8947676653366835,
"read2_q30_base_ratio": 0.7771883245304577
},
"all": {
"barcode_exact_match_ratio": 0.9336158258904611,
"barcode_q30_base_ratio": 0.9611993091728814,
"bc_on_whitelist": 0.9447542078230667,
"mean_barcode_qscore": 37.770630795934,
"number_reads": 2748155,
"read1_q30_base_ratio": 0.8947676653366835,
"read2_q30_base_ratio": 0.7771883245304577
}
}
}
cellranger count
count是cellranger最主要也是最重要的功能:完成細(xì)胞和基因的定量房维,也就是產(chǎn)生了我們用來(lái)做各種分析的基因表達(dá)矩陣沼瘫。
cellranger count -h
cellranger count (3.0.0)
Copyright (c) 2018 10x Genomics, Inc. All rights reserved.
-------------------------------------------------------------------------------
'cellranger count' quantifies single-cell gene expression.
The commands below should be preceded by 'cellranger':
Usage:
count
--id=ID
[--fastqs=PATH]
[--sample=PREFIX]
--transcriptome=DIR
[options]
count <run_id> <mro> [options]
count -h | --help | --version
Arguments:
id A unique run id, used to name output folder [a-zA-Z0-9_-]+.
fastqs Path of folder created by mkfastq or bcl2fastq.
sample Prefix of the filenames of FASTQs to select.
transcriptome Path of folder containing 10x-compatible reference.
Options:
# Single Cell Gene Expression
--description=TEXT Sample description to embed in output files.
--libraries=CSV CSV file declaring input library data sources.
--expect-cells=NUM Expected number of recovered cells.
--force-cells=NUM Force pipeline to use this number of cells, bypassing
the cell detection algorithm.
--feature-ref=CSV Feature reference CSV file, declaring feature-barcode
constructs and associated barcodes.
--nosecondary Disable secondary analysis, e.g. clustering. Optional.
--r1-length=NUM Hard trim the input Read 1 to this length before
analysis.
--r2-length=NUM Hard trim the input Read 2 to this length before
analysis.
--chemistry=CHEM Assay configuration. NOTE: by default the assay
configuration is detected automatically, which
is the recommened mode. You usually will not need
to specify a chemistry. Options are: 'auto' for
autodetection, 'threeprime' for Single Cell 3',
'fiveprime' for Single Cell 5', 'SC3Pv1' or
'SC3Pv2' or 'SC3Pv3' for Single Cell 3' v1/v2/v3,
'SC5P-PE' or 'SC5P-R2' for Single Cell 5'.
paired-end/R2-only. Default: auto.
--no-libraries Proceed with processing using a --feature-ref but no
feature-barcoding data specified with the
'libraries' flag.
--lanes=NUMS Comma-separated lane numbers.
--indices=INDICES Comma-separated sample index set "SI-001" or sequences.
--project=TEXT Name of the project folder within a mkfastq or
bcl2fastq-generated folder to pick FASTQs from.
# Martian Runtime
--jobmode=MODE Job manager to use. Valid options:
local (default), sge, lsf, or a .template file
--localcores=NUM Set max cores the pipeline may request at one time.
Only applies when --jobmode=local.
--localmem=NUM Set max GB the pipeline may request at one time.
Only applies when --jobmode=local.
--mempercore=NUM Set max GB each job may use at one time.
Only applies in cluster jobmodes.
--maxjobs=NUM Set max jobs submitted to cluster at one time.
Only applies in cluster jobmodes.
--jobinterval=NUM Set delay between submitting jobs to cluster, in ms.
Only applies in cluster jobmodes.
--overrides=PATH The path to a JSON file that specifies stage-level
overrides for cores and memory. Finer-grained
than --localcores, --mempercore and --localmem.
Consult the 10x support website for an example
override file.
--uiport=PORT Serve web UI at http://localhost:PORT
--disable-ui Do not serve the UI.
--noexit Keep web UI running after pipestance completes or fails.
--nopreflight Skip preflight checks.
-h --help Show this message.
--version Show version.
Note: 'cellranger count' can be called in two ways, depending on how you
demultiplexed your BCL data into FASTQ files.
1. If you demultiplexed with 'cellranger mkfastq' or directly with
Illumina bcl2fastq, then set --fastqs to the project folder containing
FASTQ files. In addition, set --sample to the name prefixed to the FASTQ
files comprising your sample. For example, if your FASTQs are named:
subject1_S1_L001_R1_001.fastq.gz
then set --sample=subject1
2. If you demultiplexed with 'cellranger demux', then set --fastqs to a
demux output folder containing FASTQ files. Use the --lanes and --indices
options to specify which FASTQ reads comprise your sample.
This method is deprecated. Please use 'cellranger mkfastq' going forward.
--nosecondary (optional) Add this flag to skip secondary analysis of the feature-barcode matrix (dimensionality reduction, clustering and visualization). Set this if you plan to use cellranger reanalyze or your own custom analysis.
- Output Files
.outs
├── analysis【數(shù)據(jù)分析文件夾】
│ ├── clustering【聚類,圖聚類和k-means聚類】
│ ├── diffexp【差異分析】
│ ├── pca【主成分分析線性降維】
│ └── tsne【非線性降維信息】
├── cloupe.cloupe【Loupe Cell Browser 輸入文件】
├── filtered_feature_bc_matrix【過濾掉的barcode信息】
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── filtered_feature_bc_matrix.h5【過濾掉的barcode信息HDF5 format】
├── metrics_summary.csv【CSV format數(shù)據(jù)摘要】
├── molecule_info.h5【UMI信息咙俩,aggregate的時(shí)候會(huì)用到的文件】
├── raw_feature_bc_matrix【原始barcode信息】
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── possorted_genome_bam.bam【比對(duì)文件】
├── possorted_genome_bam.bam.bai【索引文件】
├── raw_feature_bc_matrix.h5【原始barcode信息HDF5 format】
├── web_summary.html【網(wǎng)頁(yè)簡(jiǎn)版報(bào)告以及可視化】
└── *_gene_bar.csv_temp【過程文件】
cellranger aggr
When doing large studies involving multiple GEM wells, run cellranger count on FASTQ data from each of the GEM wells individually, and then pool the results using cellranger aggr, as described here
$ cd /home/jdoe/runs
$ cellranger aggr --id=AGG123 \
--csv=AGG123_libraries.csv \
--normalize=mapped
library_id,molecule_h5
LV123,/opt/runs/LV123/outs/molecule_info.h5
LB456,/opt/runs/LB456/outs/molecule_info.h5
LP789,/opt/runs/LP789/outs/molecule_info.h5
library_id,molecule_h5,batch
LV123,/opt/runs/LV123/outs/molecule_info.h5,v2_lib
LB456,/opt/runs/LB456/outs/molecule_info.h5,v3_lib
LP789,/opt/runs/LP789/outs/molecule_info.h5,v3_lib
Understanding GEM Wells
Each GEM well is a physically distinct set of GEM partitions, but draws barcode sequences randomly from the pool of valid barcodes, known as the barcode whitelist. To keep the barcodes unique when aggregating multiple libraries, we append a small integer identifying the GEM well to the barcode nucleotide sequence, and use that nucleotide sequence plus ID as the unique identifier in the feature-barcode matrix. For example, AGACCATTGAGACTTA-1
and AGACCATTGAGACTTA-2
are distinct cell barcodes from different GEM wells, despite having the same barcode nucleotide sequence.
This number, which tells us which GEM well this barcode sequence came from, is called the GEM well suffix. The numbering of the GEM wells will reflect the order that the GEM wells were provided in the Aggregation CSV.
Outputs:
- Aggregation metrics summary HTML: /home/jdoe/runs/AGG123/outs/web_summary.html
- Aggregation metrics summary JSON: /home/jdoe/runs/AGG123/outs/summary.json
- Secondary analysis output CSV: /home/jdoe/runs/AGG123/outs/analysis
- Filtered feature-barcode matrices MEX: /home/jdoe/runs/AGG123/outs/filtered_feature_bc_matrix
- Filtered feature-barcode matrices HDF5: /home/jdoe/runs/AGG123/outs/filtered_feature_bc_matrix.h5
- Unfiltered feature-barcode matrices MEX: /home/jdoe/runs/AGG123/outs/raw_feature_bc_matrix
- Unfiltered feature-barcode matrices HDF5: /home/jdoe/runs/AGG123/outs/raw_feature_bc_matrix.h5
- Unfiltered molecule-level info: /home/jdoe/runs/AGG123/outs/raw_molecules.h5
- Barcodes of cell-containing partitions: /home/jdoe/runs/AGG123/outs/cell_barcodes.csv
- Copy of the input aggregation CSV: /home/jdoe/runs/AGG123/outs/aggregation.csv
- Loupe Cell Browser file: /home/jdoe/runs/AGG123/outs/cloupe.cloupe
cellranger reanalyze
相比于count和aggr耿戚,reanalyze接受更多的可選的參數(shù),可以說(shuō)count和aggr只是屬于探索性分析阿趁,這里的reanalyze才是真正開始接觸故事的真相膜蛔。
$ cellranger reanalyze -h
cellranger reanalyze (3.0.0)
Copyright (c) 2018 10x Genomics, Inc. All rights reserved.
-------------------------------------------------------------------------------
'cellranger reanalyze' performs secondary analysis (dimensionality
reduction, clustering and differential expression) on a feature-barcode
matrix generated by the 'cellranger count' or 'cellranger aggr' pipelines.
The analysis takes parameter settings from a CSV file. Please see the following
URL for details on the CSV format:
support.10xgenomics.com/single-cell/software
This pipeline does not support multi-genome samples.
The commands below should be preceded by 'cellranger':
Usage:
reanalyze
--id=ID
--matrix=MATRIX_H5
[options]
reanalyze <run_id> <mro> [options]
reanalyze -h | --help | --version
Arguments:
id A unique run id, used to name output folder [a-zA-Z0-9_-]+.
matrix A feature-barcode matrix containing data for one genome. Should
be the filtered version, unless using --force-cells.
Options:
# Analysis
--description=TEXT Sample description to embed in output files.
--params=PARAMS_CSV A CSV file specifying analysis parameters.
Optional.
--barcodes=BARCODE_CSV A CSV file containing a list of cell barcodes to
use for reanalysis, e.g. barcodes exported from
Loupe Cell Browser. Optional.
--genes=GENES_CSV A CSV file containing a list of feature IDs to
use for reanalysis. For gene expression, this
should correspond to the gene_id field in the
reference GTF should be (e.g. ENSG... for
ENSEMBL-based references).
Optional.
--exclude-genes=GENES_CSV A CSV file containing a list of feature IDs to
exclude for reanalysis. For gene expression,
this should correspond to the gene_id field in
the reference GTF (e.g., ENSG... for
ENSEMBL-based references).
The exclusion is applied after --genes.
Optional.
--agg=AGGREGATION_CSV If the input matrix was produced by
'cellranger aggr', you may pass the same
aggregation CSV in order to retain
per-library tag information in the
resulting .cloupe file. This argument is
required to enable chemistry batch correction
Optional.
--force-cells=NUM Force pipeline to use this number of cells,
bypassing the cell detection algorithm.
Optional.
# Martian Runtime
--jobmode=MODE Job manager to use. Valid options:
local (default), sge, lsf, or a .template file
--localcores=NUM Set max cores the pipeline may request at one time.
Only applies when --jobmode=local.
--localmem=NUM Set max GB the pipeline may request at one time.
Only applies when --jobmode=local.
--mempercore=NUM Set max GB each job may use at one time.
Only applies in cluster jobmodes.
--maxjobs=NUM Set max jobs submitted to cluster at one time.
Only applies in cluster jobmodes.
--jobinterval=NUM Set delay between submitting jobs to cluster, in ms.
Only applies in cluster jobmodes.
--overrides=PATH The path to a JSON file that specifies stage-level
overrides for cores and memory. Finer-grained
than --localcores, --mempercore and --localmem.
Consult the 10x support website for an example
override file.
--uiport=PORT Serve web UI at http://localhost:PORT
--disable-ui Do not serve the UI.
--noexit Keep web UI running after pipestance completes or fails.
--nopreflight Skip preflight checks.
-h --help Show this message.
--version Show version.
Note: 'cellranger reanalyze' can be called in two ways, depending on how you
demultiplexed your BCL data into FASTQ files.
1. If you demultiplexed with 'cellranger mkfastq' or directly with
Illumina bcl2fastq, then set --fastqs to the project folder containing
FASTQ files. In addition, set --sample to the name prefixed to the FASTQ
files comprising your sample. For example, if your FASTQs are named:
subject1_S1_L001_R1_001.fastq.gz
then set --sample=subject1
2. If you demultiplexed with 'cellranger demux', then set --fastqs to a
demux output folder containing FASTQ files. Use the --lanes and --indices
options to specify which FASTQ reads comprise your sample.
This method is deprecated. Please use 'cellranger mkfastq' going forward.
$ cd /home/jdoe/runs
$ ls -1 AGG123/outs/*.h5 # verify the input file exists
AGG123/outs/filtered_feature_bc_matrix.h5
AGG123/outs/filtered_molecules.h5
AGG123/outs/raw_feature_bc_matrix.h5
AGG123/outs/raw_molecules.h5
$ cellranger reanalyze --id=AGG123_reanalysis \
--matrix=AGG123/outs/filtered_feature_bc_matrix.h5 \
--params=AGG123_reanalysis.csv
Outputs:
- Secondary analysis output CSV: /home/jdoe/runs/AGG123_reanalysis/outs/analysis_csv
- Secondary analysis web summary: /home/jdoe/runs/AGG123_reanalysis/outs/web_summary.html
- Copy of the input parameter CSV: /home/jdoe/runs/AGG123_reanalysis/outs/params_csv.csv
- Copy of the input aggregation CSV: /home/jdoe/runs/AGG123_reanalysis/outs/aggregation_csv.csv
- Loupe Cell Browser file: /home/jdoe/runs/AGG123_reanalysis/outs/cloupe.cloupe
數(shù)據(jù)分析概述
Cell Ranger是由10x genomic公司官方提供的專門用于其單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析的軟件包。Cell Ranger將前面產(chǎn)生的fastq測(cè)序數(shù)據(jù)比對(duì)到參考基因組上脖阵,然后進(jìn)行基因表達(dá)定量皂股,生成細(xì)胞-基因表達(dá)矩陣,并基于此進(jìn)行細(xì)胞聚類和差異表達(dá)分析命黔。
比對(duì)
基因組比對(duì)
Cell Ranger使用star比對(duì)軟件將reads比對(duì)到參考基因組上后使用GTF注釋文件進(jìn)行校正屑墨,并區(qū)分出外顯子區(qū)、內(nèi)含子區(qū)纷铣、基因間區(qū)卵史。具體的區(qū)分規(guī)則(mapping QC)為:至少50% 比對(duì)到外顯子上reads記為外顯子區(qū)、將比對(duì)到非外顯子區(qū)且與內(nèi)含子區(qū)有交集的reads記為內(nèi)含子區(qū)搜立,除此之外均為基因間區(qū)以躯。
- MAPQ 校正
對(duì)于比對(duì)到單個(gè)外顯子位點(diǎn)但同時(shí)比對(duì)到1個(gè)或多個(gè)非外顯子位點(diǎn)的reads,對(duì)外顯子位點(diǎn)進(jìn)行優(yōu)先排序啄踊,并將reads記為帶有MAPQ 255的外顯子位點(diǎn)忧设。
- 轉(zhuǎn)錄組比對(duì)
Cell Ranger進(jìn)一步將外顯子reads與參考轉(zhuǎn)錄本比對(duì),尋找兼容性颠通。注釋到單個(gè)基因信息的reads認(rèn)為是一個(gè)特異的轉(zhuǎn)錄本址晕,只有注釋到轉(zhuǎn)錄本的reads才用于UMI計(jì)數(shù)。
參考基因組目錄結(jié)構(gòu):
├── fasta
│ └── genome.fa
├── genes
│ └── genes.gtf
├── pickle
│ └── genes.pickle
├── README.BEFORE.MODIFYING
├── reference.json
├── star
│ ├── chrLength.txt
│ ├── chrNameLength.txt
│ ├── chrName.txt
│ ├── chrStart.txt
│ ├── exonGeTrInfo.tab
│ ├── exonInfo.tab
│ ├── geneInfo.tab
│ ├── Genome
│ ├── genomeParameters.txt
│ ├── SA
│ ├── SAindex
│ ├── sjdbInfo.txt
│ ├── sjdbList.fromGTF.out.tab
│ ├── sjdbList.out.tab
│ └── transcriptInfo.tab
└── version
細(xì)胞計(jì)數(shù)(cell QC)
Cell Ranger 3.0引入了一種改進(jìn)的細(xì)胞計(jì)數(shù)算法顿锰,該算法能夠更好地識(shí)別低RNA含量的細(xì)胞群體谨垃,特別是當(dāng)?shù)蚏NA含量的細(xì)胞與高RNA含量的細(xì)胞混合時(shí)启搂。該算法分為兩步:
在第一步中,使用之前的Cell Ranger細(xì)胞計(jì)數(shù)算法識(shí)別高RNA含量細(xì)胞的主要模式刘陶,使用基于每個(gè)barcode的UMI總數(shù)的cutoff值胳赌。Cell Ranger將期望捕獲的細(xì)胞數(shù)量N作為輸入(see --expect-cells)。然后將barcodes按照各自的UMI總數(shù)由高到低進(jìn)行排序匙隔,取前N個(gè)UMI數(shù)值的99%分位數(shù)為最大估算UMI總數(shù)(m)疑苫,將UMI數(shù)目超過m/10的barcodes用于細(xì)胞計(jì)數(shù)。
在第二步中纷责,選擇一組具有低UMI計(jì)數(shù)的barcode捍掺,這些barcode可能表示“空的”GEM分區(qū),建立RNA圖譜背景模型再膳。利用Simple Good-Turing smoothing平滑算法挺勿,對(duì)典型空GEM集合中未觀測(cè)到的基因進(jìn)行非零模型估計(jì)。最后饵史,將第一步中未作為細(xì)胞計(jì)數(shù)的barcode RNA圖譜與背景模型進(jìn)行比較,其RNA譜與背景模型存在較大差異的barcode用于細(xì)胞計(jì)數(shù)胜榔。
多態(tài)率估計(jì)(Estimating Multiplet Rates)
當(dāng)有多個(gè)參考基因組(如人H和鼠M)時(shí)胳喷,Cell Ranger可以通過多基因組分析區(qū)分多物種混合建庫(kù)的樣品,主要根據(jù)barcode內(nèi)每個(gè)物種對(duì)應(yīng)的UMI數(shù)量進(jìn)行區(qū)分夭织,將其分成H和M兩類吭露。最后還會(huì)根據(jù)H,M各自UMI的分布和最大似然估計(jì)法估計(jì)多細(xì)胞比例(multiplet rate)尊惰,即(H,M)讲竿、(H,H)、(M,M)三種類型的多細(xì)胞占比弄屡。
基因表達(dá)分析(Secondary Analysis of Gene Expression)
盡管我們的表達(dá)矩陣經(jīng)過重重QC题禀,但是單細(xì)胞高通量的數(shù)據(jù)還是表現(xiàn)出高緯度、稀疏性膀捷、非正態(tài)分布的特點(diǎn)迈嘹,每一點(diǎn)都是對(duì)傳統(tǒng)數(shù)據(jù)分析方法的挑戰(zhàn)。于是越來(lái)越多的新方法被開發(fā)出來(lái)全庸,主要借鑒多元分析和機(jī)器學(xué)習(xí)等傳統(tǒng)生物統(tǒng)計(jì)教材很少教授的方法秀仲。越來(lái)越多的機(jī)器學(xué)習(xí)方法應(yīng)用的高通量數(shù)據(jù)分析中來(lái),那么我們就需要了解機(jī)器學(xué)習(xí)的三個(gè)要素壶笼;
公式(方法) = 模型 (目的) + 策略(評(píng)價(jià)) + 算法(實(shí)現(xiàn))
- 降維分析(Dimensionality Reduction)
流形學(xué)習(xí)(manifold learning)是機(jī)器學(xué)習(xí)神僵、模式識(shí)別中的一種方法,在維數(shù)約簡(jiǎn)方面具有廣泛的應(yīng)用覆劈。它的主要思想是將高維的數(shù)據(jù)映射到低維保礼,使該低維的數(shù)據(jù)能夠反映原高維數(shù)據(jù)的某些本質(zhì)結(jié)構(gòu)特征沛励。流形學(xué)習(xí)的前提是有一種假設(shè),即某些高維數(shù)據(jù)氓英,實(shí)際是一種低維的流形結(jié)構(gòu)嵌入在高維空間中侯勉。流形學(xué)習(xí)的目的是將其映射回低維空間中,揭示其本質(zhì)铝阐。
針對(duì)單細(xì)胞測(cè)速數(shù)據(jù)特點(diǎn)址貌,一般為了提取基因表達(dá)矩陣最重要的特征采用降維分析將多維數(shù)據(jù)的差異投影在低維度上,進(jìn)而揭示復(fù)雜數(shù)據(jù)背后的規(guī)律徘键。Cell Ranger先采用基于IRLBA算法的主成分分析(Principal Components Analysis练对,PCA)將數(shù)據(jù)集的維數(shù)從(Cell x genes)改變?yōu)?Cell x M,M是主成分?jǐn)?shù)量)吹害。然后采用非線性降維算法t-SNE將降維后的數(shù)據(jù)展示在2維或三維空間中螟凭,細(xì)胞之間的基因表達(dá)模式越相似,在t-SNE圖中的距離也越接近它呀。
- 聚類分析(Clustering)
聚類是把相似的對(duì)象通過靜態(tài)分類的方法分成不同的組別或者更多的不連續(xù)子集(subset),這樣讓在同一個(gè)子集中的成員對(duì)象都有相似的一些屬性纵穿。在單細(xì)胞研究中下隧,聚類分析是識(shí)別細(xì)胞異質(zhì)性(heterogeneity)常用的算法。
在PCA空間中谓媒,Cell Ranger分別采用Graph-based和K-Means兩種不同的聚類方法對(duì)細(xì)胞進(jìn)行聚類:
- Graph-based
圖聚類算法包括兩步:首先用PCA降維的數(shù)據(jù)構(gòu)建一個(gè)細(xì)胞間的k近鄰稀疏矩陣淆院,即將一個(gè)細(xì)胞與其歐式距離上最近的k個(gè)細(xì)胞聚為一類,然后在此基礎(chǔ)上用Louvain算法進(jìn)行模塊優(yōu)化(Blondel, Guillaume, Lambiotte, & Lefebvre, 2008)句惯,旨在找到圖中高度連接的模塊土辩。最后通過層次聚類將位于同一區(qū)域內(nèi)沒有差異表達(dá)基因(B-H adjusted p-value 低于0.05)的cluster進(jìn)一步融合,重復(fù)該過程直到?jīng)]有clusters可以合并抢野。
- k-means
K-Means聚類是無(wú)監(jiān)督的機(jī)器學(xué)習(xí)算法拷淘。在PCA降維的空間中隨機(jī)選取的k個(gè)初始質(zhì)心點(diǎn),將每個(gè)點(diǎn)劃分到最近的質(zhì)心指孤,形成K個(gè)簇 辕棚,然后對(duì)于每一個(gè)cluster重新計(jì)算質(zhì)心并再次進(jìn)行劃分,重復(fù)該過程直到收斂邓厕。與圖聚類算法的k意義不同逝嚎,這里的k是事先給定的亞群數(shù)目。
差異分析(Differential Expression)
為了找到不同細(xì)胞亞群之間的差異基因详恼,Cell Ranger使用改進(jìn)的sSeq方法(基于負(fù)二項(xiàng)檢驗(yàn); Yu, Huber, & Vitek, 2013)补君。當(dāng)UMI counts數(shù)值較大時(shí),為加快分析速度昧互,Cell Ranger會(huì)自動(dòng)切換到edgeR進(jìn)行beta檢驗(yàn)(Robinson & Smyth, 2007)挽铁。通過樣品的一個(gè)亞群與該樣品的所有其他亞群進(jìn)行比較伟桅,得到該亞群細(xì)胞與其他亞群細(xì)胞之間的差異基因列表。
Cell Ranger's implementation differs slightly from that in the paper: in the sSeq paper, the authors recommend using DESeq's geometric mean-based definition of library size. Cell Ranger instead computes relative library size as the total UMI counts for each cell divided by the median UMI counts per cell. As with sSeq, normalization is implicit in that the per-cell library-size parameter is incorporated as a factor in the exact-test probability calculations.
化學(xué)批次校正(Chemistry Batch Correction)
為了校正V2V3化學(xué)試劑之間的批次效應(yīng)叽掘,Cell Ranger采用一種基于mutual nearest neighbors (MNN;(Haghverdi et al, 2018)的算法來(lái)識(shí)別批次之間的相似細(xì)胞亞群楣铁。MNN定義為來(lái)自彼此最近鄰集合中包含的兩個(gè)不同批次的細(xì)胞群。使用批次之間匹配的細(xì)胞亞群更扁,將多個(gè)批次合并在一起(Hie et al, 2018)盖腕。MNN對(duì)中細(xì)胞間表達(dá)值的差異提供了批次效應(yīng)的估計(jì)。每個(gè)細(xì)胞校正向量的加權(quán)平均值用來(lái)估計(jì)批次效應(yīng)浓镜。
批次效應(yīng)得分(batch effect score )被定義為定量測(cè)量校正前后的批次效應(yīng)溃列。對(duì)于每個(gè)細(xì)胞,計(jì)算其k個(gè)最近的細(xì)胞(nearest-neighbors)中有m個(gè)屬于同一批次膛薛,并在沒有批次效應(yīng)時(shí)將其標(biāo)準(zhǔn)化為相同批次細(xì)胞的期望值M听隐。批次效應(yīng)得分計(jì)算為隨機(jī)抽取10%的細(xì)胞總數(shù)S中的上述度量的平均值。如果沒有批次效應(yīng)哄啄,我們可以預(yù)期每個(gè)單元格的最近鄰居將在所有批次中均勻共享雅任,批次效應(yīng)得分接近1往史。
最近單細(xì)胞的小伙伴都在看的一篇文章国拇,不知道你看了沒:單細(xì)胞測(cè)序(scRNA-seq)數(shù)據(jù)處理必知必會(huì)
全文完
cellranger-3.0.0/cellranger-cs/3.0.0/bin
cellranger (3.0.0)
Copyright (c) 2018 10x Genomics, Inc. All rights reserved.
-------------------------------------------------------------------------------
Usage:
cellranger mkfastq
cellranger count
cellranger aggr
cellranger reanalyze
cellranger mat2csv
cellranger mkgtf
cellranger mkref
cellranger vdj
cellranger mkvdjref
cellranger testrun
cellranger upload
cellranger sitecheck
What is Cell Ranger?
使用cell ranger拆分10X單細(xì)胞轉(zhuǎn)錄組原始數(shù)據(jù)
使用cell ranger進(jìn)行單細(xì)胞轉(zhuǎn)錄組定量分析
專門分析10x genomic公司的單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)的軟件套件
10X Genomics 單細(xì)胞 RNA-Seq
10X genomics單細(xì)胞數(shù)據(jù)集探索
cellranger
System Requirements
Getting started with Cell Ranger
Cell Ranger 2.1.0 Gene Expression
CG000201_TechNote_Chromium_Single_Cell_3___v3_Reagent__Workflow___Software_Updates_RevA (1)
bcl2fastq Conversion
實(shí)驗(yàn)記錄2:Cellranger count整理序列數(shù)據(jù)
Single-Library Analysis with Cell Ranger
Aggregating Multiple GEM Groups with cellranger aggr
單個(gè)細(xì)胞的測(cè)序?(part 2)
PIPELINES
Gene Expression Algorithms Overview
單細(xì)胞(single cell)分選平臺(tái)比較(10X Genomics阁最,BD Rhapsody)
mnnCorrect: Mutual nearest neighbors correction
Gephi網(wǎng)絡(luò)圖極簡(jiǎn)教程