單細(xì)胞實(shí)戰(zhàn)(3)：STAR分析單細(xì)胞數(shù)據(jù)

前言

在利用cellranger比對(duì)單細(xì)胞reads時(shí)仔拟，可以發(fā)現(xiàn)有STAR的進(jìn)程夾雜在里面合是，那么STAR可以用來(lái)比對(duì)單細(xì)胞數(shù)據(jù)嗎了罪？在STAR的2.7版本中（2.7.6a）出現(xiàn)了STARsolo，可以進(jìn)行單細(xì)胞數(shù)據(jù)的比對(duì)聪全，由此可見(jiàn)STAR的強(qiáng)大

Cellranger輸出結(jié)果

在使用STAR之前泊藕，先看一下cellranger的輸出結(jié)果

.
├── analysis
│   ├── clustering
│   ├── diffexp
│   ├── pca
│   ├── tsne
│   └── umap
├── cloupe.cloupe
├── filtered_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── filtered_feature_bc_matrix.h5
├── metrics_summary.csv
├── molecule_info.h5
├── possorted_genome_bam.bam
├── possorted_genome_bam.bam.bai
├── raw_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── raw_feature_bc_matrix.h5
└── web_summary.html

為方便查看，cellranger提供了一個(gè)網(wǎng)頁(yè)端的結(jié)果难礼，我們主要觀察細(xì)胞和基因數(shù)目的評(píng)估即可娃圆，后續(xù)的聚類工作由seurat完成

在結(jié)果目錄，可以看到如下兩個(gè)目錄

raw_feature_bc_matrix

filtered_gene_bc_matrices

raw目錄下是所有的barcode信息蛾茉，包含了細(xì)胞相關(guān)的barcoed和背景barcode,而filter目錄下只包含細(xì)胞相關(guān)的barcode信息讼呢，內(nèi)容如下

│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz

后綴為mtx的文件記錄的就是基因的表達(dá)量信息，可以導(dǎo)入R或者python中查看谦炬，barcodes相當(dāng)于一個(gè)細(xì)胞悦屏，features代表不同的基因，barcodes文件在STARsolo中會(huì)用到，這就是為什么我要先說(shuō)明一下cellranger的輸出結(jié)果

利用STAR分析10X數(shù)據(jù)

STARsolo被設(shè)計(jì)為替代10X CellRanger基因定量比對(duì)軟件础爬。而且STARsolo的速度是cellranger的十倍（具體怎么樣我也不知道）

建立索引：

STAR --runMode genomeGenerate --genomeDir ghg38/ --genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile Homo_sapiens.GRCh38.93.filtered.gtf

其實(shí)利用cellranger構(gòu)建的索引原則上也能用散劫，在GRCh38/star/下，但是由于STAR版本問(wèn)題可能會(huì)導(dǎo)致無(wú)法識(shí)別幕帆，因?yàn)閏ellranger用的STAR構(gòu)建的索引和我們自己用的STAR版本是不一致的获搏。

STARsolo與普通的轉(zhuǎn)錄組比對(duì)區(qū)別在于你需要在比對(duì)時(shí)加上whitelist，whitelist文件格式在10X官網(wǎng)有寫(xiě),我們可以利用cellranger的barcodes.tsv.gz文件獲得

zcat barcodes.tsv.gz>whitelist
sed -i "s\-1\\g" whitelist

需要注意ReadFilesIn先讀入測(cè)序數(shù)據(jù)失乾，再讀入barcode+UMI文件,即先讀入R2再讀入R1

STAR  --genomeDir ghg38/ --readFilesCommand zcat --readFilesIn SRR7722939/SRR7722939_S1_L001_R2_001.fastq.gz SRR7722939/SRR7722939_S1_L001_R1_001.fastq.gz --soloType CB_UMI_Simple --soloCBwhitelist whitelist --runThreadN 8

結(jié)果默認(rèn)保存在Solo.out文件中常熙，8線程只用了10min左右，確實(shí)要快一點(diǎn)

├── Barcodes.stats
└── Gene
    ├── Features.stats
    ├── filtered
    │   ├── barcodes.tsv
    │   ├── features.tsv
    │   └── matrix.mtx
    ├── raw
    │   ├── barcodes.tsv
    │   ├── features.tsv
    │   └── matrix.mtx
    ├── Summary.csv
    └── UMIperCellSorted.txt

看一下Summary里是啥

Number of Reads,23095815
Reads With Valid Barcodes,0.979732
Sequencing Saturation,0.529367
Q30 Bases in CB+UMI,0.991351
Q30 Bases in RNA read,0.842989
Reads Mapped to Genome: Unique+Multiple,0.959769
Reads Mapped to Genome: Unique,0.874368
Reads Mapped to Transcriptome: Unique+Multipe Genes,0.63797
Reads Mapped to Transcriptome: Unique Genes,0.613655
Estimated Number of Cells,2048
Reads in Cells Mapped to Unique Genes,12815372
Fraction of Reads in Cells,0.904219
Mean Reads per Cell,6257
Median Reads per Cell,5544
UMIs in Cells,5978718
Mean UMI per Cell,2919
Median UMI per Cell,2577
Mean Genes per Cell,915
Median Genes per Cell,873
Total Genes Detected,16265

可以看到比起cellranger碱茁，STAR捕獲到的細(xì)胞數(shù)少一點(diǎn)裸卫，而且每個(gè)細(xì)胞的reads要低一點(diǎn)，其他差不多纽竣，后續(xù)將使用Seurat包對(duì)兩組數(shù)據(jù)進(jìn)行比較

STARsolo其他命令

--soloType
default: None
string(s): type of single-cell RNA-seq
CB_UMI_Simple
(a.k.a. Droplet) one UMI and one Cell Barcode of xed length in
read2, e.g. Drop-seq and 10X Chromium.
CB_UMI_Complex
one UMI of xed length, but multiple Cell Barcodes of varying length,
as well as adapters sequences are allowed in read2 only, e.g. inDrop.
CB_samTagOut
output Cell Barcode as CR and/or CB SAm tag. No UMI counting.
{readFilesIn cDNA read1 [cDNA read2 if paired-end]
CellBarcode read . Requires {outSAMtype BAM Unsorted [and/or
SortedByCoordinate]
SmartSeq
Smart-seq: each cell in a separate FASTQ (paired- or single-end),
barcodes are corresponding read-groups, no UMI sequences,
alignments deduplicated according to alignment start and end (after
extending soft-clipped bases)

--soloCBwhitelist
default: -
string(s): le(s) with whitelist(s) of cell barcodes. Only {soloType
CB UMI Complex allows more than one whitelist le.
None
no whitelist: all cell barcodes are allowed

--soloCBstart
default: 1
int>0: cell barcode start base

--soloCBlen
default: 16
int>0: cell barcode length
--soloUMIstart
default: 17
int>0: UMI start base

--soloUMIlen
default: 10
int>0: UMI length

--soloBarcodeReadLength
default: 1
int: length of the barcode read
1
equal to sum of soloCBlen+soloUMIlen
0
not dened, do not check
--soloCBposition
default: -
strings(s) position of Cell Barcode(s) on the barcode read.
Presently only works with {soloType CB UMI Complex, and barcodes are
assumed to be on Read2.
Format for each barcode: startAnchor startPosition endAnchor endPosition
start(end)Anchor denes the Anchor Base for the CB: 0: read start; 1: read
end; 2: adapter start; 3: adapter end
start(end)Position is the 0-based position with of the CB start(end) with
respect to the Anchor Base
String for di?erent barcodes are separated by space.
Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
{soloCBposition 0 0 2 -1 3 1 3 8
--soloUMIposition
default: -
string position of the UMI on the barcode read, same as soloCBposition
--soloAdapterSequence
default: -
string: adapter sequence to anchor barcodes.
--soloAdapterMismatchesNmax

default: 1
int>0: maximum number of mismatches allowed in adapter sequence.
--soloCBmatchWLtype
default: 1MM multi
string: matching the Cell Barcodes to the WhiteList
Exact
only exact matches allowed
1MM
only one match in whitelist with 1 mismatched base allowed. Allowed
CBs have to have at least one read with exact match.
1MM_multi
multiple matches in whitelist with 1 mismatched base allowed,
posterior probability calculation is used choose one of the matches.
Allowed CBs have to have at least one read with exact match. Similar to
CellRanger 2.2.0
1MM_multi_pseudocounts
same as 1MM Multi, but pseudocounts of 1 are added to all whitelist
barcodes.
Similar to CellRanger 3.x.x
--soloStrand
default: Forward
string: strandedness of the solo libraries:
Unstranded
no strand information
Forward
read strand same as the original RNA molecule
Reverse
read strand opposite to the original RNA molecule
--soloFeatures
default: Gene
string(s): genomic features for which the UMI counts per Cell Barcode are
collected

Gene
genes: reads match the gene transcript
SJ
splice junctions: reported in SJ.out.tab
GeneFull
full genes: count all reads overlapping genes' exons and introns
--soloUMIdedup
default: 1MM_All
string(s): type of UMI deduplication (collapsing) algorithm
1MM_All
all UMIs with 1 mismatch distance to each other are collapsed (i.e.
counted once)
1MM_Directional
follows the "directional" method from the UMI-tools by Smith, Heger
and Sudbery (Genome Research 2017).
Exact
only exactly matching UMIs are collapsed
NoDedup
no deduplication of UMIs, count all reads. Allowed for --soloType
SmartSeq

--soloUMIfiltering
default: -
string(s) type of UMI ltering
-

basic ltering: remove UMIs with N and homopolymers (similar to
CellRanger 2.2.0)
MultiGeneUMI
remove lower-count UMIs that map to more than one gene
(introduced in CellRanger 3.x.x)

--soloOutFileNames
default: Solo.out/ features.tsv barcodes.tsv matrix.mtx
string(s) le names for STARsolo output:
le name prex gene names barcode sequences cell feature count matrix

--soloCellFilter
default: CellRanger2.2 3000 0.99 10

string(s): cell ltering type and parameters
CellRanger2.2
simple ltering of CellRanger 2.2, followed by three numbers: number
of expected cells, robust maximum percentile for UMI count,
maximum to minimum ratio for UMI count
TopCells
only report top cells by UMI count, followed by the exact number of
cells
None
do not output ltered cells

--soloOutFormatFeaturesGeneField3
default: "Gene Expression"
string(s): eld 3 in the Gene features.tsv le. If "-", then no 3rd eld is output.

轉(zhuǎn)載請(qǐng)注明：周小釗的博客>>>單細(xì)胞實(shí)戰(zhàn)(3)：STAR分析單細(xì)胞數(shù)據(jù)

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末墓贿，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子蜓氨，更是在濱河造成了極大的恐慌聋袋，老刑警劉巖，帶你破解...
沈念sama閱讀 219,589評(píng)論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件穴吹，死亡現(xiàn)場(chǎng)離奇詭異幽勒，居然都是意外死亡，警方通過(guò)查閱死者的電腦和手機(jī)港令，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,615評(píng)論 3贊 396
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門(mén)啥容，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)，“玉大人顷霹，你說(shuō)我怎么就攤上這事咪惠。” “怎么了淋淀？”我有些...
開(kāi)封第一講書(shū)人閱讀 165,933評(píng)論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵遥昧，是天一觀的道長(zhǎng)。經(jīng)常有香客問(wèn)我绅喉，道長(zhǎng)渠鸽，這世上最難降的妖魔是什么叫乌？我笑而不...
開(kāi)封第一講書(shū)人閱讀 58,976評(píng)論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任柴罐，我火速辦了婚禮，結(jié)果婚禮上憨奸，老公的妹妹穿的比我還像新娘革屠。我一直安慰自己，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 67,999評(píng)論 6贊 393
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布似芝。她就那樣靜靜地躺著那婉，像睡著了一般。火紅的嫁衣襯著肌膚如雪党瓮。梳的紋絲不亂的頭發(fā)上详炬，一...
開(kāi)封第一講書(shū)人閱讀 51,775評(píng)論 1贊 307
城市分裂傳說(shuō)
那天，我揣著相機(jī)與錄音寞奸，去河邊找鬼呛谜。笑死，一個(gè)胖子當(dāng)著我的面吹牛枪萄，可吹牛的內(nèi)容都是我干的隐岛。我是一名探鬼主播，決...
沈念sama閱讀 40,474評(píng)論 3贊 420
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼瓷翻，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼聚凹！你這毒婦竟也來(lái)了？” 一聲冷哼從身側(cè)響起齐帚，我...
開(kāi)封第一講書(shū)人閱讀 39,359評(píng)論 0贊 276
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤妒牙，失蹤者是張志新（化名）和其女友劉穎，沒(méi)想到半個(gè)月后对妄，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體单旁，經(jīng)...
沈念sama閱讀 45,854評(píng)論 1贊 317
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 38,007評(píng)論 3贊 338
?白月光啟示錄
正文我和宋清朗相戀三年饥伊，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了象浑。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 40,146評(píng)論 1贊 351
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡琅豆，死狀恐怖愉豺，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情茫因，我是刑警寧澤蚪拦，帶...
沈念sama閱讀 35,826評(píng)論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站冻押，受9級(jí)特大地震影響驰贷，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜洛巢，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,484評(píng)論 3贊 331
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一括袒、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧稿茉，春花似錦锹锰、人聲如沸芥炭。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 32,029評(píng)論 0贊 22
一樁弒父案恃慧，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)园蝠。三九已至，卻和暖如春痢士，著一層夾襖步出監(jiān)牢的瞬間彪薛，已是汗流浹背。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 33,153評(píng)論 1贊 272
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工怠蹂，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留陪汽，地道東北人。一個(gè)月前我還...
沈念sama閱讀 48,420評(píng)論 3贊 373
代替公主和親
正文我出身青樓褥蚯，卻偏偏與公主長(zhǎng)得像挚冤，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子赞庶，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,107評(píng)論 2贊 356