mRNA-seq轉(zhuǎn)錄組二代測序從raw reads到表達(dá)矩陣：上中游分析pipeline

mRNA-seq上中游分析需要用到Linux系統(tǒng)的terminal！

盡管TCGA數(shù)據(jù)庫中已經(jīng)提供了大量的處理好的表達(dá)矩陣數(shù)據(jù)耳贬，但是無論是自身實驗需要，還是從NCBI的GEO數(shù)據(jù)庫（儲存了許多發(fā)表文章樣本的測序原始數(shù)據(jù)）下載文件晶姊，都會經(jīng)常有處理轉(zhuǎn)錄組測序下機(jī)原始數(shù)據(jù)的需求表锻。因此我嘗試總結(jié)出一個具有一定普適性的mRNA-seq上中游pipeline，可以供有需求的朋友進(jìn)行參考威鹿、學(xué)習(xí)和交流剃斧。第一次寫shell腳本，望包涵忽你！

轉(zhuǎn)錄組應(yīng)該大家都很熟悉了幼东，就不多介紹了，通常我們可以使用針對RNA的NGS（next generation sequencing）來高通量檢測不同條件下細(xì)胞組織內(nèi)的基因（/non coding RNA）的表達(dá)情況科雳，此外更高深度的RNA測序還可以用來檢測融合基因根蟹、可變剪接等。本教程著眼于適用面最廣的表達(dá)量檢測糟秘，嘗試對GEO數(shù)據(jù)庫中的一批二代RNA-seq數(shù)據(jù)進(jìn)行下載和自己重新處理简逮。

從原始文件到表達(dá)矩陣！

mRNA-seq的上中游數(shù)據(jù)處理流程大致可以分為如下步驟：
原始數(shù)據(jù)質(zhì)檢→低質(zhì)量堿基去除+低質(zhì)量reads過濾→（過濾數(shù)據(jù)的再次質(zhì)檢）→reads比對到參考基因組→格式轉(zhuǎn)換→去除多重比對的低質(zhì)量比對結(jié)果和去重復(fù)reads→counts計算（或RPKM/FPKM/TPM計算）

至于這些流程為何被稱為上中游分析尿赚，是相對以表達(dá)矩陣為起點(diǎn)的差異分析散庶、通路富集分析等后續(xù)分析流程命名的~

軟件準(zhǔn)備

由于生信的迅速發(fā)展，現(xiàn)在每一步流程都有不少已經(jīng)開發(fā)好的軟件可供選擇凌净，我這里都是使用了較多人使用的經(jīng)典軟件悲龟。

（1）原始數(shù)據(jù)質(zhì)檢：FastQC

FastQC在reads質(zhì)檢界的地位已經(jīng)是無可撼動的了。關(guān)于FastQC的具體參數(shù)和報告解讀冰寻，字?jǐn)?shù)原因不多贅述须教，推薦簡書教程：利用fastqc檢測原始序列的質(zhì)量

安裝

Anaconda依賴python，其中的bioconda子庫包含了很多常見的生信軟件斩芭，并且通過它下載可以直接添加到環(huán)境變量（調(diào)用時不需要輸入絕對路徑）没卸，非常方便！那么我們需要下載anaconda秒旋，添加bioconda查找路徑，然后就可以愉快地下載這些軟件了诀拭。

下載并配置anaconda3.

#download and install the software.
wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh -P ~/Downloads
bash ~/Downloads/Anaconda3-2020.02-Linux-x86_64.sh

#Then we have to go through a long agreement, and enter yes and ENTER to complete the installation.
#Then anaconda3 has been installed at /home/xuzihan/anaconda3

#We need to add the command to the environment.
vi ~/.bashrc
#Go to the bottom of this file, press "i" to start text insertion. And add this command:
export PATH="$PATH:/home/xuzihan/anaconda3/bin"
#Press esc to exit the texting mode, then press ":", and enter "wq!" to save the modulation and exit the vim texter.

#reload the environment setting.
source ~/.bashrc

#add some databases to anaconda3 for bioinfo software searching and downloading.
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

anaconda安裝過程的截圖

fastqc下載

conda install fastqc
#then press "y" to confirm the installation.

然后就安裝好并已經(jīng)加入環(huán)境變量了迁筛！是不是很方便！

使用

普通用法很簡單，輸入fastq文件细卧，指定輸出目錄即可尉桩。我們可以得到每個樣本的html網(wǎng)頁式的QC報告，里面對這一樣本的測序數(shù)據(jù)各項參數(shù)進(jìn)行了解讀贪庙。

fastqc raw.reads.fastq -o ./fastqc.output.directory

（2）低質(zhì)量堿基去除+低質(zhì)量reads過濾：trimmomatic

Trimmomatic是基于java的使用非常廣泛的去除接頭和質(zhì)量過濾的軟件蜘犁，通過它的reads去除了部分reads中也測到的測序建庫時連接的外源接頭，和可能影響reads序列準(zhǔn)確性的低質(zhì)量堿基止邮，使數(shù)據(jù)質(zhì)量更好这橙。關(guān)于它的工作原理和模式可參考簡述教程：NGS 數(shù)據(jù)過濾之 Trimmomatic 詳細(xì)說明

安裝

從開發(fā)者網(wǎng)站上下載壓縮包解壓后就可以直接得到j(luò)ar文件了。

wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip -P /home/xuzihan/
unzip Trimmomatic-0.39.zip
mv ~/Trimmomatic-0.39 ~/trimmomatic
#Then we will get a folder containing adapters and trimmomatic java package.

使用

本教程使用的測序數(shù)據(jù)是單端測序（Single End）导披，所以使用SE模式就好屈扎。

java -jar /home/xuzihan/trimmomatic/trimmomatic-0.39.jar SE -threads 8 -phred33 input_reads.fastq output_reads.fastq ILLUMINACLIP:/home/xuzihan/trimmomatic/adapters/TruSeq3-SE.fa:4:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

參數(shù)含義：用java執(zhí)行jar文件，開啟8線程撩匕，使用單端模式鹰晨，測序質(zhì)量分?jǐn)?shù)體系為phred33（現(xiàn)在的illumina測序都是），輸入input_reads.fastq止毕，輸出output_reads.fastq模蜡。過濾參數(shù)包括：使用trimmomatic自帶的TruSeq3-SE.fa作為測序接頭序列進(jìn)行剪切；對reads頭尾均以堿基質(zhì)量分?jǐn)?shù)為3作為閾值進(jìn)行剪切扁凛；進(jìn)行滑窗質(zhì)檢忍疾，通過標(biāo)準(zhǔn)為4個堿基構(gòu)成的滑窗內(nèi)平均質(zhì)量分?jǐn)?shù)大于15；剪切后可接受的read最小長度為50.

（3）參考基因組比對：bowtie2

對于有參轉(zhuǎn)錄組測序令漂，我們需要將測到的reads回貼到參考基因組上膝昆，從而鑒定其所屬的基因組位置。這一步的選擇其實挺多的叠必，比如適合短序列的bowtie荚孵，相對更適合50bp以上長序列的bowtie2，還有BWA纬朝，HISAT2藕帜，Tophat等。Bowtie2在運(yùn)行時間上比較有優(yōu)勢惰拱，還是很受歡迎的仪吧。關(guān)于bowtie的詳細(xì)解釋和使用參數(shù)，請移步bowtie2 manual：bowtie2 manual

安裝

基于Anaconda3就可以直接安裝了隅茎，一步到位澄峰。

conda install bowtie2
#Then press "y" to confirm the installation.

使用

第一次進(jìn)行參考基因組比對前，需要對各數(shù)據(jù)庫（ENSEMBL/NCBI/UCSC等）發(fā)布的標(biāo)準(zhǔn)人類基因組數(shù)據(jù)進(jìn)行index辟犀。完成index后我們會獲得6個以bt2為后綴的index文件俏竞，一勞永逸，以后就可以直接使用了。

我個人比較喜歡ENSEMBL數(shù)據(jù)庫的版本魂毁，會進(jìn)行定期更新玻佩，此外版本號比較規(guī)整，每個版本的參考基因組（fasta文件）和基因注釋（gtf文件）會同步發(fā)布席楚，不會怕版本間出現(xiàn)錯誤咬崔。

本教程會使用到的參考基因組是剛剛（4.29.20）發(fā)布的release100中的Homo_sapiens.GRCh38.dna.primary_assembly.fa，以及對應(yīng)的release100中的Homo_sapiens.GRCh38.100.gtf注釋文件（在最后數(shù)counts時才會用到）烦秩。（ftp://ftp.ensembl.org/pub/release-100/）

bowtie2-build /The_directory_containig_downloaded_reference_genome_fasta/Homo_sapiens.GRCh38.dna.primary_assembly.fa The_unified_index_prefix
#Then we can get 6 index files: The_unified_index_prefix.1.bt2 ~.2.bt2 ~3.bt2 ~4.bt2 The_unified_index_prefix.rev.1.bt2 ~.rev.2.bt2

然后就是用測序得到的reads進(jìn)行比對了垮斯。由于要下載的是單端測序數(shù)據(jù)，所以使用-U參數(shù)闻镶。-p為線程數(shù)甚脉，-x定義參考基因組index，-S定義輸出sam文件名铆农。其他更復(fù)雜和相對不常用的參數(shù)使用參見manual牺氨。

bowtie2 -p 8 -x /directory/The_unified_index_prefix -U input_reads.fastq -S output.sam

（4）格式轉(zhuǎn)化：samtools

比對后輸出的格式為sam文件，我們選擇將其壓縮到二進(jìn)制的bam文件中墩剖。我們還需要對比對上的bam格式的reads按照基因組的位置進(jìn)行排序猴凹，方便進(jìn)行基因組注釋。samtools絕對是這一過程的不二選擇岭皂。此外郊霎，在有其他需求的時候，它也含有其他功能爷绘。詳細(xì)參數(shù)解讀可右轉(zhuǎn)至：samtools常用命令詳解

安裝

conda install samtools
#Then press "y" to confirm the installation.

使用

-S參數(shù)指輸入格式為.sam书劝，-b參數(shù)指輸出.bam。-o參數(shù)指輸出文件土至。

#format transformation.
samtools view -bS input.sam > output.bam
#sort the reads by genomic sites.
samtools sort input.bam -o sorted_output.bam

（5）過濾低質(zhì)量比對結(jié)果：sambamba

相比picard购对，好像sambamba這個軟件小眾很多了。不過它的語法挺有意思的陶因，在基于關(guān)鍵詞的基礎(chǔ)上支持一點(diǎn)自然語言骡苞。sambamba的處理速度確實是挺快的！這也是它的一個亮點(diǎn)楷扬。具體參數(shù)解讀見：Sambamba: process your BAM data faster!

安裝

conda install sambamba
#Then press "y" to confirm the installation.

使用

-t參數(shù)指線程解幽；-h指保留bam文件的header；-f指輸入文件類型烘苹；-F配合后面的語法表示過濾的條件躲株。這里是表示過濾多重比對/未比對上/鑒定為PCR復(fù)制的reads。為了減少假陽性镣衡，本教程就將條件設(shè)置地嚴(yán)苛一些徘溢，僅僅只保留比對到參考基因組上一次的非PCR重復(fù)reads吞琐。（sam文件中，XS參數(shù)是指reads比對到多個參考基因組時然爆，除了最佳比對位置外的第二好得分，它存在也就意味著read存在多次比對）

sambamba view -h -t 8 -f bam -F "[XS] == null and not unmapped and not duplicate" ordered.bam > ordered_filtered.bam

（6）從比對后bam文件得到各個樣本的表達(dá)矩陣：htseq

TCGA數(shù)據(jù)庫可供下載的開放數(shù)據(jù)包括htseq-counts黍图，就可以說明這一軟件的使用廣泛程度了曾雕。過濾得到的bam通過htseq處理后得到的counts矩陣就是可以進(jìn)行差異分析的標(biāo)準(zhǔn)形式了！

安裝

conda install htseq
#Then press "y" to confirm the installation.

使用

最常用的就是通過htseq獲得各基因組注釋下的count數(shù)了助被。-s規(guī)定是否為鏈特異性建庫（默認(rèn)為yes剖张，不懂為啥要用這個默認(rèn)...）；-f指輸入文件形式揩环；輸入已排序和過濾的bam文件搔弄，以及能和參考基因組對應(yīng)的基因組注釋（這里就是ENSEMBL-release100的Homo_sapiens.GRCh38.100.gtf）；最后輸出該樣本的counts文件丰滑。

htseq-count -f bam -s no ordered_filtered.bam /directory/Homo_sapiens.GRCh38.100.gtf > count.txt

（7）從GEO數(shù)據(jù)庫下載：sra-toolkit

通過sra-toolkit中的prefetch指令顾犹，我們只需要知道GEO數(shù)據(jù)庫中想下載樣本的SRR序列號，就可以直接下載數(shù)據(jù)了褒墨。GEO數(shù)據(jù)庫的結(jié)構(gòu)和使用不是本教程的重點(diǎn)炫刷，所以大家可以自行補(bǔ)充學(xué)習(xí)！

雖然sra-toolkit可以通過apt-get install命令下載郁妈，也可以通過conda install下載浑玛，但它在安裝后常需要configure（比如重新設(shè)置下載文件的所屬路徑），因此我們需要明確sra-toolkit的安裝路徑噩咪。然而不巧的是好像這倆命令的安裝路徑不是很好找顾彰，所以就還是使用麻煩一點(diǎn)但是更透明的辦法：自行下載安裝包-設(shè)置環(huán)境變量。

安裝

#download the software.
mkdir ~/sratoolkit && cd ~/sratoolkit
wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.5/sratoolkit.2.10.5-ubuntu64.tar.gz -P ~/sratoolkit
tar -xzvf sratoolkit.2.10.5-ubuntu64.tar.gz

#modulate the environment.
vi ~/.bashrc
#add the directory containing commands of sratoolkit into the environment.
export PATH="$PATH:/home/xuzihan/sratoolkit/sratoolkit.2.10.5-ubuntu64/bin"
#restart
source ~/.bashrc

#Configure the software: First change the directory to 'bin' of sratoolkit, then use the command to open an interactive window.
cd /home/xuzihan/sratoolkit/sratoolkit.2.10.5-ubuntu64/bin
vdb-config --interactive

彈出的交互式窗口可以使用鍵盤和鼠標(biāo)結(jié)合操作胃碾。

使用

sratoolkit的默認(rèn)下載路徑是~/ncbi/public涨享。下載的原始數(shù)據(jù)格式為.sra格式。

此外书在，我使用了美國的服務(wù)器灰伟，所以使用起來沒有網(wǎng)速問題。但是據(jù)說國內(nèi)的朋友直接使用prefetch下載數(shù)據(jù)的速度是非常慢的儒旬，解決辦法就是使用aspera協(xié)助下載（本處不提及栏账，網(wǎng)上解決辦法超多的）。

#download a single SRR file.
prefetch SRRxxxxxx

#download multiple files.
prefetch --option-file the_text_containing_one_series_no_per_row.txt

默認(rèn)路徑下載的sra文件放在了~/ncbi/public/sra中栈源。

測序raw數(shù)據(jù)通常為.fastq挡爵，所以我們繼續(xù)用sratoolkit中的fastq-dump對下載得到的.sra文件進(jìn)行格式轉(zhuǎn)換。由于下載的是單端測序數(shù)據(jù)甚垦，所以直接用如下命令就可以在同文件夾中獲得相應(yīng)fastq文件茶鹃，作為一系列分析的起點(diǎn)涣雕。雙端測序的格式轉(zhuǎn)換命令是不同的！那再說吧...

fastq-dump /path/SRRxxxx.sra

開始分析

我把以上的命令集成了寫成了一個shell腳本闭翩，通過輸入GEO SRR_ID挣郭，GEO SRR_ID_phenotype，phenotype的字符數(shù)疗韵，和使用的接頭序列路徑兑障，就可以進(jìn)行一步完整分析了。

單端測序比雙端要簡單一些蕉汪，那就先拿單端練練手吧流译。教程使用的原文件是2013年在ncbi上發(fā)表的卵母細(xì)胞（n=3）和合子（n=3）的轉(zhuǎn)錄組單端測序數(shù)據(jù)。

首先我們需要查詢得到想從GEO下載的原始數(shù)據(jù)編號者疤，然后使用nano文本編輯器編寫成每行含有一個SRR ID的文件（id.txt）福澡，以及對應(yīng)的分組表型文件：我這里自定義了一下，格式為SRR ID_表型分組信息驹马，每行一個（group.txt）革砸。

注：關(guān)于表型分組信息需要特別說明一下：由于腳本編寫時文件的內(nèi)部重命名依賴了分組表型的字符數(shù)，所以運(yùn)行這個腳本需要分組名稱的字符數(shù)相等窥翩。如本例中的分組：oocyte_1/2/3和zygote_1/2/3业岁，都分別是8個字符。所以將8作為第三個參數(shù)輸入寇蚊。

mkdir ~/rna.seq.tutorial && cd ~/rna.seq.tutorial
nano id.txt
#SRR445718 SRR445719 SRR445720 SRR445721 SRR445722 SRR445723
nano group.txt
#SRR445718_oocyte_1 SRR445719_oocyte_2 SRR445720_oocyte_3 SRR445721_zygote_1 SRR445722_zygote_2 SRR445723_zygote_3

兩個輸入文檔的展示

接下來編寫腳本笔时。

這個腳本不需要超級用戶root。它依賴的系統(tǒng)目錄構(gòu)架如下：

系統(tǒng)中的目錄構(gòu)架

所以我們只需要配置好軟件仗岸；把sratoolkit的下載路徑設(shè)置為~/ncbi/public或是默認(rèn)允耿；把rna.seq.tutorial目錄建立，并在其中寫入腳本扒怖，id.txt和group.txt较锡，就可以開始愉快運(yùn)行腳本了。

腳本內(nèi)容如下盗痒，還是建議通讀一下的蚂蕴。

#!/bin/bash
#nano mrna.seq.pipeline.sh
#we can use it by entering: bash mrna_seq_pipeline.sh ~/rna.seq.tutorial/id.txt ~/rna.seq.tutorial/group.txt length_of_the_group_name absolute_directory_of_adapters_for_trimmomatic

#download sra files 
prefetch --option-file $1
cd ../ncbi/public/sra

#get the fastq raw data.
for files in `cat $1`
do 
echo ${files}' is being processed'
fastq-dump ${files}'.sra'
seriesngroup=`grep ${files} $2`
total_length=${#seriesngroup}
group_this_file=${seriesngroup:10:$3} #I postulate the SRR series are 10-digit series numbers.
mv ${files}'.fastq' ${group_this_file}'.fastq'
done
group_names=`ls *.fastq`
echo "fastq raw data of all samples have been all fetched."

#quality control by fastqc.
if [ -d './fastqc.output' ] 
then
    echo 'fastqc.output directory has existed. Good!'
else
    echo 'fastqc.output directory has not been established. Now we will make it!'
    mkdir ./fastqc.output
fi
fastqc *.fastq -o ./fastqc.output
echo "QC on raw reads has been done."

#use trimmomatic to filter the reads with low quality and trim low-quality bases.
if [ -d './trimmed_fastq' ] 
then
    echo 'trimmed_fastq directory has existed. Good!'
else
    echo 'trimmed_fastq directory has not been established. Now we will make it!'
    mkdir ./trimmed_fastq
fi
for files in ${group_names}
do 
echo ${files}' is being filtered and trimmed'
java -jar ~/trimmomatic/trimmomatic-0.39.jar SE -threads 8 -phred33 "./${files}" "./trimmed_fastq/clean_${files}" ILLUMINACLIP:$4:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50
done
echo "Reads trimming has been done."

#perform quality control again on these clean data.
cd ./trimmed_fastq
if [ -d './clean.fastqc.output' ] 
then
    echo 'clean.fastqc.output directory has existed. Good!'
else
    echo 'clean.fastqc.output directory has not been established. Now we will make it!'
    mkdir ./clean.fastqc.output
fi
fastqc *.fastq -o ./clean.fastqc.output
echo "QC on trimmed reads has been done."

#use bowtie2 to align the reads to the reference.
if [ -d '~/hg.38.ref' ] 
then
    cd ~/hg.38.ref
    if [ -e 'Homo_sapiens.GRCh38.dna.primary_assembly.fa' ]
    then
    echo 'the reference genome has existed. Good!'
    else
    echo 'Now we will download the reference genome.'
        wget ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
        gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    fi
else
    echo 'Now we will make the directory and download the reference genome.'
    mkdir ~/hg.38.ref
    wget ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
fi
if [ -d '~/hg.38.ref/hg38.index' ] 
then
    cd ~/hg.38.ref/hg38.index
    if [ -e 'hg.38.1.bt2' ] && [ -e 'hg.38.2.bt2' ] && [ -e 'hg.38.3.bt2' ] && [ -e 'hg.38.4.bt2' ] && [ -e 'hg.38.rev.1.bt2' ] && [ -e 'hg.38.rev.2.bt2' ]
    then
        echo 'The reference genome index has existed. Good!'
    else
        echo 'Bowtie2 is working on reference index establishment.'
        bowtie2-build ~/hg.38.ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa hg.38
    fi
else
    echo 'Bowtie2 is working on reference index establishment.'
    mkdir ~/hg.38.ref/hg38.index && cd ~/hg.38.ref/hg38.index
    bowtie2-build ~/hg.38.ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa hg.38
fi
cd ~/ncbi/public/sra/trimmed_fastq
for trimmed_files in `ls *.fastq`
do
    echo ${trimmed_files}' is being aligned with hg38 reference.'
    bowtie2 -p 8 -x ~/hg.38.ref/hg38.index/hg.38 -U ${trimmed_files} -S "${trimmed_files}.sam" --no-unal
done
echo "The bowtie2 alignment on samples has been done."

#use samtools to transform and order the aligned results.
if [ -d '../bam' ]
then
    echo 'bam directory has existed.'
else
    mkdir ../bam
fi
for sams in `ls *.sam`
do
    cd ../bam
    charlength=$((6+$3))
    bam_file=${sams:0:$charlength}
    samtools view -bS "../trimmed_fastq/$sams" > "${bam_file}.bam"
    samtools sort "${bam_file}.bam" -o "${bam_file}_ordered.bam"
    rm "${bam_file}.bam"
    sambamba view -h -t 8 -f bam -F "[XS] == null and not unmapped and not duplicate" "${bam_file}_ordered.bam" > "${bam_file}_ordered_filtered.bam"
    rm "${bam_file}_ordered.bam"
done
echo "alignment reordering and filtering has been done."

#get the annotation for the genome and transform .gff file to .gtf file for quantification.
if [ -e '~/hg.38.ref/Homo_sapiens.GRCh38.100.gtf' ]
then
    echo "GTF file has all set!"
else
    echo "Now download the GTF annotation file."
    wget ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz -P ~/hg.38.ref
    gunzip Homo_sapiens.GRCh38.100.gtf.gz
fi

#Use HTSeq to assemble the reads and complete the annotation.
if [ -d '../count_matrix' ]
then
    echo 'count_matrix directory has existed.'
else
    mkdir ../count_matrix
fi
cd ~/ncbi/public/sra/bam
for bams in `ls * | grep "filtered.bam"`
do
    cd ../count_matrix
    count_file_name=${bams:6:$3}
    htseq-count -f bam -s no "../bam/$bams" ~/hg.38.ref/Homo_sapiens.GRCh38.100.gtf > "${count_file_name}_count.txt"
done
echo "gene annotation and counts extraction have been done."

最終開始運(yùn)行：共4個參數(shù)，第一個為id.txt俯邓，第二個group.txt骡楼，第三個為表型分組的字符數(shù)，第四個為測序用到的接頭序列fasta稽鞭。

bash ~/rna.seq.tutorial/mrna.seq.pipeline.sh ~/rna.seq.tutorial/id.txt ~/rna.seq.tutorial/group.txt 8 ~/trimmomatic/adapters/TruSeq3-SE.fa

運(yùn)行過程中terminal的一些顯式進(jìn)程大致如下鸟整，我截了一些圖：

腳本運(yùn)行過程中的一些進(jìn)程顯示

80%的修剪后保留率，80%的比對率都提示數(shù)據(jù)的總體質(zhì)量還不錯朦蕴。

我們看看數(shù)據(jù)的兩次fastqc結(jié)果：

兩次fastqc結(jié)果的比較（左-原始數(shù)據(jù)篮条、右-trimmomatic修剪后數(shù)據(jù)）

左邊是原始數(shù)據(jù)的QC report弟头，右邊是trimmomatic處理后的QC report。

（1）堿基質(zhì)量分?jǐn)?shù)：可見原始數(shù)據(jù)中涉茧，隨著reads的延伸赴恨，堿基的測序質(zhì)量快速降低，總體已經(jīng)達(dá)到不能接受的程度降瞳。在過濾修剪后嘱支，每個堿基位置的總體質(zhì)量分?jǐn)?shù)分布均在綠色范圍，說明過濾很好地提升了測序reads的質(zhì)量挣饥。

（2）reads長度：可見原始數(shù)據(jù)中，測到的reads都是100bp長度沛膳。在修剪后扔枫，reads長度出現(xiàn)差異，分布在50-100bp間锹安，說明部分reads中有堿基修剪短荐；這也與我們設(shè)置可接受最小reads長度為50bp相符合。

（3）接頭內(nèi)容：可見原始數(shù)據(jù)中叹哭，reads的3’端檢測到少量的接頭內(nèi)容忍宋，可能是由于mRNA片段本身過短所導(dǎo)致的接頭被測序。通過修剪后风罩，右邊的report顯示已經(jīng)基本檢測不到接頭序列的存在糠排，說明修剪成功。

最后我們可以得到如下文件超升，我在terminal中查看了相關(guān)文件夾中的文件：

Screen Shot 2020-05-11 at 11.10.30 AM.png

count_matrix中的6個count.txt就是每個樣本的表達(dá)矩陣了入宦！如下圖所示，展示了其中一個樣本的表達(dá)矩陣：

一個樣本的表達(dá)矩陣概覽

通過htseq獲得的每個樣本的表達(dá)矩陣有2列室琢，第一列為基因的ENSEMBL ID乾闰，第二列為每個注釋的raw counts。此外盈滴，最后五行描述了htseq計算的結(jié)果涯肩，從__not_aligned和__alignment_not_unique均為0即可知bowtie2和sambamba的過濾是有效的（在腳本中已經(jīng)設(shè)置Bowtie2在比對時不輸出未必對的reads，以及sambamba過濾多次比對的reads）巢钓。

最后我們將所有樣本的表達(dá)矩陣合并成一個總矩陣后病苗，就可以進(jìn)行下游一系列分析了。

有興趣的朋友可以學(xué)習(xí)竿报，歡迎大家相互交流铅乡！覺得有幫助就點(diǎn)個贊吧！