2020-10-24 RNAseq 從fq開始分析全流程

分析流程

image.png

1.上傳四個(gè)樣本原始就文件到服務(wù)器

參考http://www.reibang.com/p/a84cd44bac67從原始數(shù)據(jù)開始分析

參考https://github.com/twbattaglia/RNAseq-workflow

Step 1. Analysing Sequence Quality with FastQC

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Description

"FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis."

The first step before processing any samples is to analyze the quality of the data. Within the fastq file is quality information that refers to the accuracy (% confidence) of each base call. FastQC looks at different aspects of the sample sequences to determine any irregularies or features that make affect your results (adapter contamination, sequence duplication levels, etc.)

Installation

conda install -c bioconda fastqc --yes

Command

# Help
fastqc -h

# Run FastQC
fastqc \
-o results/1_initial_qc/ \
--noextract \
input/sample.fastq

Output

── results/1_initial_qc/
    └──  sample_fastqc.html   <-  HTML file of FastQC fquality analysis figures
    └──  sample_fastqc.zip    <- FastQC report data

2.質(zhì)控

fastqc生成質(zhì)控報(bào)告,multiqc將各個(gè)樣本的質(zhì)控報(bào)告整合為一個(gè)左电。

(rna) cheng@super-Super-Server:~/project$ ls
NALM6-ctrl_combined_R1.filtered.1.fq.gz
NALM6-ctrl_combined_R1.filtered.2.fq.gz
NALM6-treat_combined_R1.filtered.1.fq.gz
NALM6-treat_combined_R1.filtered.2.fq.gz
RS411-ctrl_combined_R1.filtered.1.fq.gz
RS411-ctrl_combined_R1.filtered.2.fq.gz
RS411-treat_combined_R1.filtered.1.fq.gz
RS411-treat_combined_R1.filtered.2.fq.gz
(rna) cheng@super-Super-Server:~/project$ ls *gz | xargs fastqc -t 2

第一步fastqc生成質(zhì)控報(bào)告恋沃,-t設(shè)為2時(shí)同時(shí)分析2個(gè)原始fq數(shù)據(jù),等待約30min完成,因?yàn)楸痉?wù)器沒安裝multiqc稽坤,就暫時(shí)不做整合剩拢。

Step2: filter the bad quality reads and remove adaptors.

  • 運(yùn)行如下代碼,得到名為config的文件够话,包含兩列數(shù)據(jù)
mkdir $wkd/clean 
cd $wkd/clean 
ls /home/data/cheng/project/raw/*1.fq.gz >1
ls /home/data/cheng/project/raw/*2.fq.gz >2
paste 1 2  > config
(rna) cheng@super-Super-Server:~/project$ pwd
/home/cheng/project
(rna) cheng@super-Super-Server:~/project$ mkdir -p clean
(rna) cheng@super-Super-Server:~/project$ cd clean/
(rna) cheng@super-Super-Server:~/project/clean$ ls /home/data/cheng/project/raw/*.1.fq.gz >1
(rna) cheng@super-Super-Server:~/project/clean$ ls /home/data/cheng/project/raw/*.2.fq.gz >2
(rna) cheng@super-Super-Server:~/project/clean$ paste 1 2 > config
(rna) cheng@super-Super-Server:~/project/clean$ cat config
/home/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.1.fq.gz /home/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.2.fq.gz
/home/cheng/project/raw/NALM6-treat_combined_R1.filtered.1.fq.gz    /home/cheng/project/raw/NALM6-treat_combined_R1.filtered.2.fq.gz
/home/cheng/project/raw/RS411-ctrl_combined_R1.filtered.1.fq.gz /home/cheng/project/raw/RS411-ctrl_combined_R1.filtered.2.fq.gz
/home/cheng/project/raw/RS411-treat_combined_R1.filtered.1.fq.gz    /home/cheng/project/raw/RS411-treat_combined_R1.filtered.2.fq.gz
(rna) cheng@super-Super-Server:~/project/clean$ vi qc.sh
(rna) cheng@super-Super-Server:~/project/clean$ bash qc.sh config
  • 打開文件 qc.sh 蓝翰,并且寫入如下內(nèi)容

trim_galore光绕,用于去除低質(zhì)量和接頭數(shù)據(jù)

source activate rna
bin_trim_galore=trim_galore
dir='/home/data/cheng/project/clean'
cat $1 |while read id
do
        arr=(${id})
        fq1=${arr[0]}
        fq2=${arr[1]} 
 $bin_trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -o $dir  $fq1 $fq2 
done 
source deactivate 
  • 運(yùn)行qc.sh
bash qc.sh config #config是傳遞進(jìn)去的參數(shù)
  • 結(jié)果顯示如下
RUN STATISTICS FOR INPUT FILE: /home/cheng/project/raw/RS411-ctrl_combined_R1.filtered.1.fq.gz
=============================================
31046571 sequences processed in total
The length threshold of paired-end sequences gets evaluated later on (in the validation step)

Writing report to '/home/cheng/project/clean/RS411-ctrl_combined_R1.filtered.2.fq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
==========================
Input filename: /home/cheng/project/raw/RS411-ctrl_combined_R1.filtered.2.fq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.4_dev
Cutadapt version: 2.6
Number of cores used for trimming: 1
Quality Phred score cutoff: 25
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 3 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 36 bp
Output file(s) will be GZIP compressed

Cutadapt seems to be fairly up-to-date (version 2.6). Setting -j -j 1
Writing final adapter and quality trimmed output to RS411-ctrl_combined_R1.filtered.2_trimmed.fq.gz


  >>> Now performing quality (cutoff '-q 25') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file /home/cheng/project/raw/RS411-ctrl_combined_R1.filtered.2.fq.gz <<<
Connection to 192.168.1.6 closed by remote host.
Connection to 192.168.1.6 closed.

我的程序被中斷了,因?yàn)閮?nèi)存不足可能

管理員給我發(fā)一個(gè)提醒:

(rna) cheng@super-Super-Server:~$ cat A_Message_from_Admin
your programme almost used all the disk, so I killed it.
Please move all your data to /home/data, this directory is on the other disk and it is big enough to run your programme.

After you read this message, you can delete it by yourself. I changed the owner to you.

更換工作目錄:

(rna) cheng@super-Super-Server:~$ mv project/ /home/data/
(rna) cheng@super-Super-Server:~$ cd /home/data/
(rna) cheng@super-Super-Server:/home/data$ mv project/ cheng/
(rna) cheng@super-Super-Server:/home/data$ cd cheng
(rna) cheng@super-Super-Server:/home/data/cheng$ ls
project

查看目錄下軟件信息

(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/
genome  gtf  hisat  index  trnadb
(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/biosoft/
cellranger-4.0.0          JAGS-4.3.0         ncbi-blast-2.10.1+                   refdata-gex-GRCh38-2020-A.tar.gz  tophat-2.1.1.Linux_x86_64.tar.gz
cellranger-4.0.0.tar.tar  JAGS-4.3.0.tar.gz  ncbi-blast-2.10.1+-x64-linux.tar.gz  RNA-SeQC                          wget.out
(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/biosoft/RNA-SeQC/
gencode.v7.annotation_goodContig.gtf.gz  gencode.v7.gc.txt  Homo_sapiens_assembly19.fasta.gz  rRNA.tar.gz  ThousandReads.bam
(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/genome/
GRCh38_reference_genome  hg19  hg38  M25  mm10
(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/gtf/
ensembl  gencode
(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/hisat/
grcm38.tar.gz  hg19.tar.gz  hg38.tar.gz  mm10.tar.gz
(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/index/
bowtie  bwa  hisat  tophat
(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/trnadb/
hg38  mm10

本服務(wù)器可以用比對(duì)軟件較多:

(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/index/
bowtie  bwa  hisat  tophat

先完成去去接頭:

#確定工作目錄
(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ pwd
/home/data/cheng/project/clean
#構(gòu)建 config
(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ ls /home/data/cheng/project/raw/*1.fq.gz >1
(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ ls /home/data/cheng/project/raw/*2.fq.gz >2
(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ paste 1 2  > config
(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ ls
1  2  config
(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ cat config
/home/data/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.1.fq.gz    /home/data/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.2.fq.gz
/home/data/cheng/project/raw/NALM6-treat_combined_R1.filtered.1.fq.gz   /home/data/cheng/project/raw/NALM6-treat_combined_R1.filtered.2.fq.gz
/home/data/cheng/project/raw/RS411-ctrl_combined_R1.filtered.1.fq.gz    /home/data/cheng/project/raw/RS411-ctrl_combined_R1.filtered.2.fq.gz
/home/data/cheng/project/raw/RS411-treat_combined_R1.filtered.1.fq.gz   /home/data/cheng/project/raw/RS411-treat_combined_R1.filtered.2.fq.gz
#編寫質(zhì)控腳本
(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ vi qc.sh
#將任務(wù)掛到后臺(tái)運(yùn)行
(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ nohup bash qc.sh config &
[1] 16284

step4: alignment

star, hisat2, bowtie2, tophat, bwa, subread都是可以用于比到的軟件

可參考http://www.reibang.com/p/a84cd44bac67

  • 先運(yùn)行一個(gè)樣本畜份,測試一下
mkdir $wkd/test 
cd $wkd/test 
source activate rna
ls $wkd/clean/*gz |while read id;do (zcat ${id}|head -1000>  $(basename ${id} ".gz"));done
id=SRR1039508
hisat2 -p 10 -x /public/reference/index/hisat/hg38/genome -1 ${id}_1_val_1.fq   -2 ${id}_2_val_2.fq  -S ${id}.hisat.sam
subjunc -T 5  -i /public/reference/index/subread/hg38 -r ${id}_1_val_1.fq -R ${id}_2_val_2.fq -o ${id}.subjunc.sam  
bowtie2 -p 10 -x /public/reference/index/bowtie/hg38  -1 ${id}_1_val_1.fq   -2 ${id}_2_val_2.fq  -S ${id}.bowtie.sam
bwa mem -t 5 -M  /public/reference/index/bwa/hg38   ${id}_1_val_1.fq   ${id}_2_val_2.fq > ${id}.bwa.sam

事實(shí)上诞帐,對(duì)RNA-seq數(shù)據(jù)來說,不要使用bwa和bowtie這樣的軟件爆雹,它需要的是能進(jìn)行跨越內(nèi)含子比對(duì)的工具停蕉。所以后面就只先擇hisat2,STAR進(jìn)行下一步钙态,但是本服務(wù)器中hisat2的索引文件不完整慧起,最后只用STAR進(jìn)行了比對(duì)

  • 批量比對(duì)代碼
cd $wkd/clean 
ls *gz|cut -d"_" -f 1 |sort -u |while read id;do
ls -lh ${id}_combined_R1.filtered.1_val_1.fq.gz   ${id}_combined_R1.filtered.2_val_2.fq.gz 
hisat2 -p 10 -x /home/data/reference/index/hisat/hg38/genome -1 ${id}_combined_R1.filtered.1_val_1.fq.gz   -2 ${id}_combined_R1.filtered.2_val_2.fq.gz  -S ${id}.hisat.sam
done 
Note the two inputs for this command are the genome located in the (genome/ folder) and the annotation file located in the (annotation/ folder)
# This can take up to 30 minutes to complete
STAR \
--runMode genomeGenerate \
--genomeDir star_index \
--genomeFastaFiles /home/data/reference/genome/hg38/* \
--sjdbGTFfile /home/data/reference/gtf/ensembl/* \
--runThreadN 4
# Help
STAR -h

# Run STAR (~10min)
STAR \
--genomeDir star_index \
--readFilesIn filtered/sample_filtered.fq  \
--runThreadN 4 \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts

# Move the BAM file into the correct folder
mv -v results/4_aligned_sequences/sampleAligned.sortedByCoord.out.bam results/4_aligned_sequences/aligned_bam/

# Move the logs into the correct folder
mv -v results/4_aligned_sequences/${BN}Log.final.out results/4_aligned_sequences/aligned_logs/
mv -v results/4_aligned_sequences/sample*Log.out results/4_aligned_sequences/aligned_logs/

先構(gòu)建STAR索引

Generating Indexes

Similar to the SortMeRNA step, we must first generate an index of the genome we want to align to, so that there tools can efficently map over millions of sequences. The star_index folder will be the location that we will keep the files necessary to run STAR and due to the nature of the program, it can take up to 30GB of space. This step only needs to be run once and can be used for any subsequent RNAseq alignment analyses.

(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ STAR
Usage: STAR  [options]... --genomeDir /path/to/genome/index/   --readFilesIn R1.fq R2.fq
Spliced Transcripts Alignment to a Reference (c) Alexander Dobin, 2009-2020

STAR version=2.7.5a
STAR compilation time,server,dir=Tue Jun 16 12:17:16 EDT 2020 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
For more details see:
<https://github.com/alexdobin/STAR>
<https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf>

To list all parameters, run STAR --help
(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ cd /home/data/cheng
(rna) cheng@super-Super-Server:/home/data/cheng$ ls
project
(rna) cheng@super-Super-Server:/home/data/cheng$ nohup STAR \
> --runMode genomeGenerate \
> --genomeDir star_index \
> --genomeFastaFiles /home/data/reference/genome/* \
> --sjdbGTFfile /home/data/reference/gtf/* \
> --runThreadN 4 &
[2] 18522
(rna) cheng@super-Super-Server:/home/data/cheng$ nohup: ignoring input and appending output to 'nohup.out'

[2]+  Exit 104                nohup STAR --runMode genomeGenerate --genomeDir star_index --genomeFastaFiles /home/data/reference/genome/* --sjdbGTFfile /home/data/reference/gtf/* --runThreadN 4
(rna) cheng@super-Super-Server:/home/data/cheng$ ls
nohup.out  project  star_index  _STARtmp

參考https://blog.csdn.net/yssxswl/article/details/105703869完成索引構(gòu)建

STAR --runThreadN 20 --runMode genomeGenerate \
--genomeDir /home/data/cheng/star_index \
--genomeFastaFiles /home/data/reference/genome/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa \
--sjdbGTFfile /home/data/reference/gtf/gencode.v35.annotation.gtf \

–runThreadN 是指構(gòu)建是使用的線程數(shù),在沒有其他數(shù)據(jù)在跑的情況下册倒,可以滿線程跑
–runMode genomeGenerate 讓STAR執(zhí)行基因組索引的生成工作
–genomeDir 構(gòu)建好的參考基因組存放的位置蚓挤,最好是單獨(dú)建立的一個(gè)文件夾
–genomeFastaFiles 參考基因組序列文件
–sjdbGTFfile 基因注釋文件

比對(duì)

我們來看一下比對(duì)的代碼

STAR 
--runThreadN 20 \
--genomeDir /home/data/cheng/star_index \
--readFilesCommand gunzip -c \
--readFilesIn /home/data/cheng/ /home/data/cheng/ \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix N052611_Alb \

–runThreadN 運(yùn)行的線程數(shù),根據(jù)自己的服務(wù)器合理選擇
–genomeDir 構(gòu)建的參考基因組位置
–readFilesCommand 對(duì)于gz壓縮的文件驻子,我們可以在后面添加 gunzip -c
–readFilesIn 輸入文件的位置灿意,對(duì)于雙末端測序文件,用空格分隔開就行了
–outSAMtype 默認(rèn)輸出的是sam文件崇呵,我們這里的BAM SortedByCoordinate是讓他輸出為ban文件脾歧,并排序
–outFileNamePrefix 表示的是輸出文件的位置和前綴

然后就是輸出文件的問題,輸出的文件不止一個(gè)演熟,包含了比對(duì)過程中的一些信息

  1. Aligned.out.sam或者Aligned.out.bam
    它指的就是我們的比對(duì)結(jié)果
  2. Log.progress.out
    它是每分鐘記錄一次的對(duì)比情況
  3. Log.out
    它記錄了STAR程序在運(yùn)行中的各種情況鞭执,當(dāng)我們的結(jié)果出現(xiàn)異常時(shí),我們可以查看具體的運(yùn)行情況芒粹,來查找錯(cuò)誤
  4. Log.final.out
    它包含的是對(duì)比完以后的對(duì)比統(tǒng)計(jì)信息
  5. SJ.out.tab
    它包含了剪切的信息
nohup STAR --runThreadN 20 --genomeDir /home/data/cheng/star_index --readFilesCommand gunzip -c --readFilesIn /home/data/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.2.fq.gz --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts &
nohup STAR --runThreadN 4 --genomeDir /home/data/cheng/star_index --readFilesCommand gunzip -c --readFilesIn /home/data/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.2.fq.gz --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix NALM6-ctrl &

nohup STAR --runThreadN 4 --genomeDir /home/data/cheng/star_index --readFilesCommand gunzip -c --readFilesIn /home/data/cheng/project/raw/NALM6-treat_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/NALM6-treat_combined_R1.filtered.2.fq.gz --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix NALM6-treat &

nohup STAR --runThreadN 4 --genomeDir /home/data/cheng/star_index --readFilesCommand gunzip -c --readFilesIn /home/data/cheng/project/raw/RS411-ctrl_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/RS411-ctrl_combined_R1.filtered.2.fq.gz --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix RS411-ctrl &

nohup STAR --runThreadN 4 --genomeDir /home/data/cheng/star_index --readFilesCommand gunzip -c --readFilesIn /home/data/cheng/project/raw/RS411-treat_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/RS411-treat_combined_R1.filtered.2.fq.gz --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix RS411-treat &

# Move the BAM file into the correct folder
mv -v *Aligned.sortedByCoord.out.bam /home/data/cheng/project/results/4_aligned_sequences/aligned_bam/

# Move the logs into the correct folder
mv -v *.out /home/data/cheng/project/results/4_aligned_sequences/aligned_logs/

Step 5. Summarizing Gene Counts with featureCounts

https://www.ncbi.nlm.nih.gov/pubmed/24227677

Description

"featureCounts is a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations. It can be used to count both RNA-seq and genomic DNA-seq reads. featureCounts takes as input SAM/BAM files and an annotation file including chromosomal coordinates of features. It outputs numbers of reads assigned to features (or meta-features). It also outputs stat info for the overall summrization results, including number of successfully assigned reads and number of reads that failed to be assigned due to various reasons (these reasons are included in the stat info)."

Now that we have our .BAM alignment files, we can then proceed to try and summarize these coordinates into genes and abundances. To do this we must summarize the reads using featureCounts or any other read summarizer tool, and produce a table of genes by samples with raw sequence abundances. This table will then be used to perform statistical analysis and find differentially expressed genes.

# Help
featureCounts -h

# Change directory into the aligned .BAM folder
cd /home/data/cheng/project/results/4_aligned_sequences/aligned_bam

# Store list of files as a variable
dirlist=$(ls -t ./*.bam | tr '\n' ' ')
echo $dirlist

# Run featureCounts on all of the samples (~10 minutes)
featureCounts \
-a /home/data/reference/gtf/* \
-o /home/data/cheng/project/results/5_final_counts/final_counts.txt \
-g 'gene_name' \
-T 4 \
$dirlist
(rna) cheng@super-Super-Server:/home/data/cheng/project/results/4_aligned_sequences/aligned_bam$ featureCounts -a /home/data/reference/gtf/* -o /home/data/cheng/project/results/5_final_counts/final_counts.txt -g 'gene_name' -T 4 $dirlist

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      v2.0.1

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 4 BAM files                                      ||
||                           o NALM6-treatAligned.sortedByCoord.out.bam       ||
||                           o RS411-ctrlAligned.sortedByCoord.out.bam        ||
||                           o RS411-treatAligned.sortedByCoord.out.bam       ||
||                           o NALM6-ctrlAligned.sortedByCoord.out.bam        ||
||                                                                            ||
||             Output file : final_counts.txt                                 ||
||                 Summary : final_counts.txt.summary                         ||
||              Annotation : gencode.v35.annotation.gtf (GTF)                 ||
||      Dir for temp files : /home/data/cheng/project/results/5_final_counts  ||
||                                                                            ||
||                 Threads : 4                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file gencode.v35.annotation.gtf ...                        ||
||    Features : 1398443                                                      ||
||    Meta-features : 59609                                                   ||
||    Chromosomes/contigs : 25                                                ||
||                                                                            ||
|| Process BAM file NALM6-treatAligned.sortedByCoord.out.bam...               ||
||    WARNING: Paired-end reads were found.                                   ||
||    Total alignments : 86954860                                             ||
||    Successfully assigned alignments : 37685067 (43.3%)                     ||
||    Running time : 0.57 minutes                                             ||
||                                                                            ||
|| Process BAM file RS411-ctrlAligned.sortedByCoord.out.bam...                ||
||    WARNING: Paired-end reads were found.                                   ||
||    Total alignments : 69733102                                             ||
||    Successfully assigned alignments : 30363294 (43.5%)                     ||
||    Running time : 0.46 minutes                                             ||
||                                                                            ||
|| Process BAM file RS411-treatAligned.sortedByCoord.out.bam...               ||
||    WARNING: Paired-end reads were found.                                   ||
||    Total alignments : 81999795                                             ||
||    Successfully assigned alignments : 34018798 (41.5%)                     ||
||    Running time : 0.56 minutes                                             ||
||                                                                            ||
|| Process BAM file NALM6-ctrlAligned.sortedByCoord.out.bam...                ||
||    WARNING: Paired-end reads were found.                                   ||
||    Total alignments : 70082894                                             ||
||    Successfully assigned alignments : 32230367 (46.0%)                     ||
||    Running time : 0.48 minutes                                             ||
||                                                                            ||
|| Write the final count table.                                               ||
|| Write the read assignment summary.                                         ||
||                                                                            ||
|| Summary of counting results can be found in file "/home/data/cheng/projec  ||
|| t/results/5_final_counts/final_counts.txt.summary"                         ||
||                                                                            ||
\\============================================================================//

下游分析

four_sample_deg

Cheng Liangping

24 十月, 2020

1 Importing Gene Counts into R/RStudio

Once the workflow has completed, you can now use the gene count table as an input into DESeq2 for statistical analysis using the R-programming language. It is highly reccomended to use RStudio when writing R code and generating R-related analyses. You can download RStudio for your system here: https://www.rstudio.com/products/rstudio/download/

1.1 Install required R-libraries

rm(list = ls())
options(BioC_mirror="http://mirrors.tuna.tsinghua.edu.cn/bioconductor/")
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
options()$repos 
##                                         CRAN 
## "https://mirrors.tuna.tsinghua.edu.cn/CRAN/"
options()$BioC_mirror
## [1] "http://mirrors.tuna.tsinghua.edu.cn/bioconductor/"
# install.packages("gage")
# BiocManager::install("org.Hs.eg.db",ask = F,update = F)
library(DESeq2)
library(ggplot2)
library(clusterProfiler)
library(biomaRt)
library(ReactomePA)
library(DOSE)
library(KEGG.db)
library(org.Hs.eg.db)
library(pheatmap)
library(genefilter)
library(RColorBrewer)
library(GO.db)
library(topGO)
library(dplyr)
library(gage)
library(ggsci)

1.1.1 Import featureCounts output

One you have an R environment appropriatley set up, you can begin to import the featureCounts table found within the 5_final_counts folder. This tutorial will use DESeq2 to normalize and perform the statistical analysis between sample groups. Be sure to know the full location of the final_counts.txt file generate from featureCounts.

Note: If you would like to use an example final_counts.txt table, look into the example/ folder.

# Import gene counts table
# - skip first row (general command info)
# - make row names the gene identifiers
countdata <- read.table("final_counts.txt", header = TRUE, skip = 1, row.names = 1)

# Remove .bam + '..' from column identifiers
colnames(countdata) <- gsub(".bam", "", colnames(countdata), fixed = T)
colnames(countdata) <- gsub("Aligned.sortedByCoord.out", "", colnames(countdata), fixed = T)
colnames(countdata) <- gsub("..", "", colnames(countdata), fixed = T)
#save(countdata,file = "raw.Rdata")
# Remove length/char columns
countdata <- countdata[ ,c(-1:-5)]

# Make sure ID's are correct
head(countdata)
##             NALM6.treat RS411.ctrl RS411.treat NALM6.ctrl
## DDX11L1              26          5          20         12
## WASH7P             1119        206         202        435
## MIR6859-1            21          6           4          9
## MIR1302-2HG           2          0           3          1
## MIR1302-2             0          0           0          0
## FAM138A               0          0           0          0

1.1.2 Import metadata text file. The SampleID’s must be the first column.

# Import metadata file
# - make row names the matching sampleID's from the countdata
metadata <- read.delim("metadata.txt", row.names = 1)

# Reorder sampleID's to match featureCounts column order. 
metadata <- metadata[match(colnames(countdata), metadata$sampleid), ]

# Make sure ID's are correct
head(metadata)
##             Group Replicate    sampleid
## NALM6.treat Hiexp      Rep2 NALM6.treat
## RS411.ctrl  Loexp      Rep1  RS411.ctrl
## RS411.treat Loexp      Rep2 RS411.treat
## NALM6.ctrl  Hiexp      Rep1  NALM6.ctrl

1.1.3 Make DESeq2 object from counts and metadata

# - countData : count dataframe
# - colData : sample metadata in the dataframe with row names as sampleID's
# - design : The design of the comparisons to use. 
#            Use (~) before the name of the column variable to compare
ddsMat <- DESeqDataSetFromMatrix(countData = countdata,
                                 colData = metadata,
                                 design = ~Group)

# Find differential expressed genes
ddsMat <- DESeq(ddsMat)

1.1.4 Get basic statisics about the number of significant genes

# Get results from testing with FDR adjust pvalues
results <- results(ddsMat, pAdjustMethod = "fdr", alpha = 0.05)

# Generate summary of testing. 
summary(results)
## 
## out of 37196 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 2345, 6.3%
## LFC < 0 (down)     : 1530, 4.1%
## outliers [1]       : 0, 0%
## low counts [2]     : 11169, 30%
## (mean count < 4)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# Check directionality of the log2 fold changes
## Log2 fold change is set as (Loexp / Hiexp)
## Postive fold changes = Increased in Loexp
## Negative fold changes = Decreased in Loexp
mcols(results, use.names = T)
## DataFrame with 6 rows and 2 columns
##                        type                                  description
##                 <character>                                  <character>
## baseMean       intermediate    mean of normalized counts for all samples
## log2FoldChange      results log2 fold change (MLE): Group Loexp vs Hiexp
## lfcSE               results         standard error: Group Loexp vs Hiexp
## stat                results         Wald statistic: Group Loexp vs Hiexp
## pvalue              results      Wald test p-value: Group Loexp vs Hiexp
## padj                results                        fdr adjusted p-values

1.2 Annotate gene symbols

After alignment and summarization, we only have the annotated gene symbols. To get more information about significant genes, we can use annoated databases to convert gene symbols to full gene names and entrez ID’s for further analysis.

1.2.1 Gather gene annotation information

# Mouse genome database (Select the correct one)
library(org.Hs.eg.db) 

# Add gene full name
results$description <- mapIds(x = org.Hs.eg.db,
                              keys = row.names(results),
                              column = "GENENAME",
                              keytype = "SYMBOL",
                              multiVals = "first")

# Add gene symbol
results$symbol <- row.names(results)

# Add ENTREZ ID
results$entrez <- mapIds(x = org.Hs.eg.db,
                         keys = row.names(results),
                         column = "ENTREZID",
                         keytype = "SYMBOL",
                         multiVals = "first")

# Add ENSEMBL
results$ensembl <- mapIds(x = org.Hs.eg.db,
                          keys = row.names(results),
                          column = "ENSEMBL",
                          keytype = "SYMBOL",
                          multiVals = "first")

# Subset for only significant genes (q < 0.05)
results_sig <- subset(results, padj < 0.05)
head(results_sig)
## log2 fold change (MLE): Group Loexp vs Hiexp 
## Wald test p-value: Group Loexp vs Hiexp 
## DataFrame with 6 rows and 10 columns
##            baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##           <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## MTND1P23  2424.6057       -4.11145  0.678250  -6.06185 1.34561e-09 4.39978e-08
## MTND2P28  1813.7288        3.75228  0.738809   5.07882 3.79791e-07 8.29959e-06
## MTCO1P12  6187.1454       -3.29924  0.634548  -5.19936 1.99981e-07 4.56171e-06
## MTCO2P12   169.0384       -2.08444  0.773877  -2.69351 7.07044e-03 4.79849e-02
## MXRA8      725.5861        2.58245  0.664147   3.88837 1.00920e-04 1.26220e-03
## LINC01770   54.9444        9.28692  1.703909   5.45036 5.02681e-08 1.27891e-06
##                                           description      symbol      entrez
##                                           <character> <character> <character>
## MTND1P23                         MT-ND1 pseudogene 23    MTND1P23   100887749
## MTND2P28                         MT-ND2 pseudogene 28    MTND2P28   100652939
## MTCO1P12                         MT-CO1 pseudogene 12    MTCO1P12   107075141
## MTCO2P12                         MT-CO2 pseudogene 12    MTCO2P12   107075310
## MXRA8                  matrix remodeling associated 8       MXRA8       54587
## LINC01770 long intergenic non-protein coding RNA 1770   LINC01770   102724312
##                   ensembl
##               <character>
## MTND1P23               NA
## MTND2P28               NA
## MTCO1P12               NA
## MTCO2P12               NA
## MXRA8     ENSG00000162576
## LINC01770              NA

1.2.2 Write all the important results to .txt files

# Write normalized gene counts to a .txt file
write.table(x = as.data.frame(counts(ddsMat), normalized = T), 
            file = 'normalized_counts.txt', 
            sep = '\t', 
            quote = F,
            col.names = NA)

# Write significant normalized gene counts to a .txt file
write.table(x = counts(ddsMat[row.names(results_sig)], normalized = T), 
            file = 'normalized_counts_significant.txt', 
            sep = '\t', 
            quote = F, 
            col.names = NA)

# Write the annotated results table to a .txt file
write.table(x = as.data.frame(results), 
            file = "results_gene_annotated.txt", 
            sep = '\t', 
            quote = F,
            col.names = NA)

# Write significant annotated results table to a .txt file
write.table(x = as.data.frame(results_sig), 
            file = "results_gene_annotated_significant.txt", 
            sep = '\t', 
            quote = F,
            col.names = NA)

1.3 Plotting Gene Expression Data

There are multiple ways to plot gene expression data. Below we are only listing a few popular methods, but there are many more resources (Going Further) that will walk through different R commands/packages for plotting.

1.3.1 PCA plot

# Convert all samples to rlog
ddsMat_rlog <- rlog(ddsMat, blind = FALSE)

# Plot PCA by column variable
plotPCA(ddsMat_rlog, intgroup = "Group", ntop = 500) +
  theme_bw() + # remove default ggplot2 theme
  geom_point(size = 5) + # Increase point size
  scale_y_continuous(limits = c(-5, 5)) + # change limits to fix figure dimensions
  ggtitle(label = "Principal Component Analysis (PCA)", 
          subtitle = "Top 500 most variable genes") 
image.png

1.3.2 Heatmap

# Convert all samples to rlog
ddsMat_rlog <- rlog(ddsMat, blind = FALSE)

# Gather 30 significant genes and make matrix
mat <- assay(ddsMat_rlog[row.names(results_sig)])[1:40, ]

# Choose which column variables you want to annotate the columns by.
annotation_col = data.frame(
  Group = factor(colData(ddsMat_rlog)$Group), 
  Replicate = factor(colData(ddsMat_rlog)$Replicate),
  row.names = colData(ddsMat_rlog)$sampleid
)

# Specify colors you want to annotate the columns by.
ann_colors = list(
  Group = c(Loexp = "lightblue", Hiexp = "darkorange"),
  Replicate = c(Rep1 = "darkred", Rep2 = "forestgreen")
)

# Make Heatmap with pheatmap function.
## See more in documentation for customization
pheatmap(mat = mat, 
         color = colorRampPalette(brewer.pal(9, "YlOrBr"))(255), 
         scale = "row", # Scale genes to Z-score (how many standard deviations)
         annotation_col = annotation_col, # Add multiple annotations to the samples
         annotation_colors = ann_colors,# Change the default colors of the annotations
         fontsize = 6.5, # Make fonts smaller
         cellwidth = 55, # Make the cells wider
         show_colnames = F)
image.png

1.3.3 Volcano Plot

# Gather Log-fold change and FDR-corrected pvalues from DESeq2 results
## - Change pvalues to -log10 (1.3 = 0.05)
data <- data.frame(gene = row.names(results),
                   pval = -log10(results$padj), 
                   lfc = results$log2FoldChange)

# Remove any rows that have NA as an entry
data <- na.omit(data)

# Color the points which are up or down
## If fold-change > 0 and pvalue > 1.3 (Increased significant)
## If fold-change < 0 and pvalue > 1.3 (Decreased significant)
data <- mutate(data, color = case_when(data$lfc > 0 & data$pval > 1.3 ~ "Increased",
                                       data$lfc < 0 & data$pval > 1.3 ~ "Decreased",
                                       data$pval < 1.3 ~ "nonsignificant"))

# Make a basic ggplot2 object with x-y values
vol <- ggplot(data, aes(x = lfc, y = pval, color = color))

# Add ggplot2 layers
vol +   
  ggtitle(label = "Volcano Plot", subtitle = "Colored by fold-change direction") +
  geom_point(size = 2.5, alpha = 0.8, na.rm = T) +
  scale_color_manual(name = "Directionality",
                     values = c(Increased = "#008B00", Decreased = "#CD4F39", nonsignificant = "darkgray")) +
  theme_bw(base_size = 14) + # change overall theme
  theme(legend.position = "right") + # change the legend
  xlab(expression(log[2]("Loexp" / "Hiexp"))) + # Change X-Axis label
  ylab(expression(-log[10]("adjusted p-value"))) + # Change Y-Axis label
  geom_hline(yintercept = 1.3, colour = "darkgrey") + # Add p-adj value cutoff line
  scale_y_continuous(trans = "log1p") # Scale yaxis due to large p-values
image.png

1.3.4 MA Plot

https://en.wikipedia.org/wiki/MA_plot

plotMA(results, ylim = c(-5, 5))
image.png

1.3.5 Plot Dispersions

plotDispEsts(ddsMat)
image.png

1.3.6 Single gene plot

# Convert all samples to rlog
ddsMat_rlog <- rlog(ddsMat, blind = FALSE)

# Get gene with highest expression
top_gene <- rownames(results)[which.min(results$log2FoldChange)]

# Plot single gene
plotCounts(dds = ddsMat, 
           gene = top_gene, 
           intgroup = "Group", 
           normalized = T, 
           transform = T)
image.png

1.4 Finding Pathways from Differential Expressed Genes

Pathway enrichment analysis is a great way to generate overall conclusions based on the individual gene changes. Sometimes individiual gene changes are overwheling and are difficult to interpret. But by analyzing the pathways the genes fall into, we can gather a top level view of gene responses. You can find more information about clusterProfiler here: http://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html

1.4.1 Set up matrix to take into account EntrezID’s and fold changes for each gene

# Remove any genes that do not have any entrez identifiers
results_sig_entrez <- subset(results_sig, is.na(entrez) == FALSE)

# Create a matrix of gene log2 fold changes
gene_matrix <- results_sig_entrez$log2FoldChange

# Add the entrezID's as names for each logFC entry
names(gene_matrix) <- results_sig_entrez$entrez

# View the format of the gene matrix
##- Names = ENTREZ ID
##- Values = Log2 Fold changes
head(gene_matrix)
## 100887749 100652939 107075141 107075310     54587 102724312 
## -4.111451  3.752277 -3.299241 -2.084445  2.582450  9.286916
gene_matrix=sort(gene_matrix,decreasing = T)

1.4.2 Enrich genes using the KEGG database

kegg_enrich <- enrichKEGG(gene = names(gene_matrix),
                          organism = 'human',
                          pvalueCutoff = 0.05, 
                          qvalueCutoff = 0.10)
kegg_enrich <- setReadable(kegg_enrich, org.Hs.eg.db, keyType = "ENTREZID")
# Plot results
barplot(kegg_enrich, 
        drop = TRUE, 
        showCategory = 10, 
        title = "KEGG Enrichment Pathways",
        font.size = 8)
image.png

1.4.3 Enrich genes using the Gene Onotlogy

go_enrich <- enrichGO(gene = names(gene_matrix),
                      OrgDb = 'org.Hs.eg.db', 
                      readable = T,
                      ont = "BP",
                      pvalueCutoff = 0.05, 
                      qvalueCutoff = 0.10)

# Plot results
barplot(go_enrich, 
        drop = TRUE, 
        showCategory = 10, 
        title = "GO Biological Pathways",
        font.size = 8)
image.png

1.5 Plotting KEGG Pathways

Pathview is a package that can take KEGG identifier and overlay fold changes to the genes which are found to be significantly different. Pathview also works with other organisms found in the KEGG database and can plot any of the KEGG pathways for the particular organism.

# Load pathview
library(pathview)
## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
## 
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################
# Plot specific KEGG pathways (with fold change) 
## pathway.id : KEGG pathway identifier
pathview(gene.data = gene_matrix, 
         pathway.id = "00020", 
         species = "hsa")
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /Users/cheng/Downloads/TYL1/20201023
## Info: Writing image file hsa00020.pathview.png

1.6 疾病通路

有一顯著差異基因富集到一個(gè)已經(jīng)疾病通路: 相關(guān)疾病為:Lewy body dementia兄纺。

x <- enrichDO(gene          = names(gene_matrix),
              ont           = "DO",
              pvalueCutoff  = 0.05,
              pAdjustMethod = "BH",
              minGSSize     = 5,
              maxGSSize     = 500,
              qvalueCutoff  = 0.05,
              readable      = FALSE)
x <- setReadable(x, org.Hs.eg.db, keyType = "ENTREZID")
head(x)
##                        ID                            Description GeneRatio
## DOID:0060049 DOID:0060049 autoimmune disease of urogenital tract   26/1207
## DOID:12236     DOID:12236              primary biliary cirrhosis   26/1207
## DOID:1037       DOID:1037                 lymphoblastic leukemia   99/1207
## DOID:0060100 DOID:0060100          musculoskeletal system cancer   98/1207
## DOID:9119       DOID:9119                 acute myeloid leukemia   33/1207
## DOID:16           DOID:16           integumentary system disease   88/1207
##               BgRatio       pvalue    p.adjust      qvalue
## DOID:0060049  73/8007 1.097364e-05 0.005426588 0.004509630
## DOID:12236    73/8007 1.097364e-05 0.005426588 0.004509630
## DOID:1037    443/8007 1.839793e-05 0.005426588 0.004509630
## DOID:0060100 439/8007 2.127302e-05 0.005426588 0.004509630
## DOID:9119    107/8007 2.550088e-05 0.005426588 0.004509630
## DOID:16      394/8007 5.539968e-05 0.009824211 0.008164164
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       geneID
## DOID:0060049                                                                                                                                                                                                                                                                                                                                                                                                                                   SNCA/CAV1/NOS2/ICOS/ABCB4/PTPN13/IL7R/ENTPD1/ITGAL/TGFBR2/TNFSF13/TNFSF13B/TLR4/FN1/TNFSF10/TNFSF9/TGFB1/HLA-DRB1/FAS/ICAM1/HGF/CXCL10/CD27/CNR1/FCRL3/ENTPD2
## DOID:12236                                                                                                                                                                                                                                                                                                                                                                                                                                     SNCA/CAV1/NOS2/ICOS/ABCB4/PTPN13/IL7R/ENTPD1/ITGAL/TGFBR2/TNFSF13/TNFSF13B/TLR4/FN1/TNFSF10/TNFSF9/TGFB1/HLA-DRB1/FAS/ICAM1/HGF/CXCL10/CD27/CNR1/FCRL3/ENTPD2
## DOID:1037    NRP1/FLT3/CD44/ERG/CRY1/LAIR1/HSPB1/ADRB2/SLIT2/BLACE/FCMR/ZAP70/FOSL2/IL24/TNFRSF11A/FMOD/NOS2/WT1/CDX2/BGN/IL15/CD34/RAG1/NT5E/F13A1/CSPG4/IGFBP2/MDK/IRF4/IL1B/PTPN6/FLT4/ANPEP/SLC29A2/SDC1/CD33/RUNX1/ZFP36L1/MEF2C/TP73/ARID5B/TNFSF13/FCGR2A/TNFSF13B/LILRB1/CD9/EPHB4/JAK3/CDK6/LMO2/LEF1/CISH/RB1/BCL2/RAPGEF3/CCND3/MYB/DMD/ZMIZ1/CDKN1B/TNFRSF13C/POU2AF1/IGHD/TNFSF10/TGFB1/STAT1/HLA-DRB1/IRF8/BACH2/FAS/CD40/PDE4B/CD83/FCRL5/HGF/LAMA5/FCER2/HLF/ASNS/PRKCB/FCRL2/TNFRSF8/IKZF3/PEG10/BCL2A1/TP63/BIRC3/HES1/TTC12/EFNA5/RHD/FCRL3/MME/CDKN2B/TNFRSF10A/AICDA/MLLT3/CCL22/CDKN2A
## DOID:0060100                       CCN2/PDPN/CD44/ERG/HMGA2/HSPB1/NOG/PROM1/NDRG1/SPRY1/ANGPT1/SLIT2/TIMP1/BMP2/ADAM28/CAV1/LPAR1/F2RL3/FBN2/S100A6/TNFRSF11A/NOS2/HPSE/WT1/NES/CDH11/SPP1/CD34/FSCN1/SALL2/PDGFA/RUNX2/F2R/MCAM/PDGFRB/IL1B/HBEGF/JUP/LGALS1/PLAU/ERBB2/PLAUR/TLE1/KLRK1/CD33/PRUNE2/S100A4/CCL5/PGF/CREB3L1/TP73/TNFSF13B/CD9/CD99/FBN1/DUSP1/FN1/RB1/CAPN2/BCL2/DMD/RASSF1/CDKN1B/IGF2BP2/ASS1/FLI1/TNFRSF10B/PRKCA/URGCP/DYRK1B/OBSCN/TGFB1/LRP1/GLI1/ITGB3/ID1/FAS/ICAM1/HGF/UTS2R/SMO/PRKCG/PAX2/CDH1/TP63/ACKR3/EGFR/IGFBP6/DCN/LUM/GFAP/DNASE1L3/EXT1/CDKN2B/WNT5A/UCHL1/MTAP/CDKN2A
## DOID:9119                                                                                                                                                                                                                                                                                                                                                                                                         NRP1/MEIS1/CD96/PLA2G4A/CAV1/SALL4/LYZ/WT1/CDX2/RAG1/IGFBP2/IL1B/PLAUR/SELL/ITGAL/RUNX1/MGMT/IL3RA/LMO2/RB1/BCL2/CDKN1B/LYL1/IRF8/FAS/CBFA2T3/SKI/RUNX3/CAMP/CDKN2B/TNFRSF10A/KCNH2/CDKN2A
## DOID:16                                                                PDPN/LAMA3/HSPB1/HR/ADRB2/EDA2R/ADAM33/CCL28/FLG/RIN2/SLC16A2/HRH4/TIMP1/VIM/CAV1/CD93/IL24/SERPING1/GHR/CCL1/NLRP3/IL15/CD34/NPY/COL6A5/IL1RL1/IL9R/RETN/F2R/EPX/IRAK3/IL1B/XPNPEP2/IL7R/TBC1D4/PLAU/PLAUR/SELL/ANPEP/SLC29A2/COL7A1/NR3C2/GPX1/NTRK1/TNFRSF1B/CCL5/PGF/TNFSF13/TNFSF13B/TLR4/NOS3/BCL2/SLC29A3/NOP53/HLA-DQA1/TGFB1/HLA-DRB1/ITGB3/ITGA6/FAS/CD40/TNFSF8/ALOX5AP/ICAM1/CXCL10/LAMA5/CYP1A2/C3/ADM2/CCL3L3/CAMP/BCL2A1/TP63/CCL17/EGFR/FBLN5/LMNA/CX3CL1/ALOX5/APP/TNFRSF14/FCRL3/ST14/MME/TRPC1/CDKN2B/CCL22/CDKN2A
##              Count
## DOID:0060049    26
## DOID:12236      26
## DOID:1037       99
## DOID:0060100    98
## DOID:9119       33
## DOID:16         88

1.7 GO數(shù)據(jù)可視化:

1.7.1 barplot、dotplot

用散點(diǎn)圖展示富集到的GO terms

library(enrichplot)
library(clusterProfiler)
library(patchwork)
#(1)條帶圖
barplot(go_enrich,showCategory=10)
image.png
#(2)氣泡圖
dotplot(go_enrich)
image.png

1.7.1.1 GO有向無環(huán)圖

調(diào)用topGO來實(shí)現(xiàn)GO有向無環(huán)圖的繪制,矩形代表富集到的top10個(gè)GO terms, 顏色從黃色過濾到紅色化漆,對(duì)應(yīng)p值從大到小估脆。

plotGOgraph(go_enrich)
image.png
## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 48 
## Number of Edges = 83 
## 
## $complete.dag
## [1] "A graph with 48 nodes."

1.7.1.2 emapplot、goplot座云、cnetplot

#Gene-Concept Network
cnetplot(go_enrich, categorySize="pvalue", foldChange=names(gene_matrix),colorEdge = TRUE)
image.png
cnetplot(go_enrich, foldChange=names(gene_matrix), circular = TRUE, colorEdge = TRUE)
image.png
#Enrichment Map
emapplot(go_enrich)
image.png
#(4)展示通路關(guān)系
goplot(go_enrich)
image.png
#(5)Heatmap-like functional classification
heatplot(go_enrich,foldChange = gene_matrix)
image.png
#太多基因就會(huì)糊疙赠。可通過調(diào)整比例或者減少基因來控制朦拖。

1.7.1.3 04.upsetplot

upsetplot(go_enrich) 
image.png

1.7.2 用所有有entrez的基因做GSEA分析

# Remove any genes that do not have any entrez identifiers
results_entrez <- subset(results, is.na(entrez) == FALSE)

# Create a matrix of gene log2 fold changes
gene_matrix <- results_entrez$log2FoldChange

# Add the entrezID's as names for each logFC entry
names(gene_matrix) <- results_entrez$entrez

gene_matrix=sort(gene_matrix,decreasing = T)
ego_GSE_bp <- gseGO(
      geneList     = gene_matrix,
      OrgDb        = org.Hs.eg.db,
      ont          = "MF",
      nPerm        = 1000,
      minGSSize    = 100,
      maxGSSize    = 500,
      pvalueCutoff = 0.05,
      verbose      = FALSE)

1.7.3 ridgeplot

#gsea
ridgeplot(ego_GSE_bp)
image.png

1.7.4 gseaplot圃阳、gseaplot2及多個(gè)geneset驗(yàn)證結(jié)果

#gesa圖
p15 <- gseaplot(ego_GSE_bp, geneSetID = 1, by = "runningScore",
               title = ego_GSE_bp$Description[1])
p16 <- gseaplot(ego_GSE_bp, geneSetID = 1, by = "preranked", 
               title = ego_GSE_bp$Description[1])
p17 <- gseaplot(ego_GSE_bp, geneSetID = 1, title = ego_GSE_bp$Description[1])

#還有一種展示方式
p18=gseaplot2(ego_GSE_bp, geneSetID = 1, title = ego_GSE_bp$Description[1])

cowplot::plot_grid(p15, p16, p17,p18, ncol=2, labels=LETTERS[1:6])
image.png
#多條、加table(加p值)
gseaplot2(ego_GSE_bp, geneSetID = 1:3, pvalue_table = TRUE)
image.png
#上下3聯(lián)(多個(gè)geneset驗(yàn)證結(jié)果)

pp <- lapply(1:3, function(i) {
    anno <- ego_GSE_bp[i, c("NES", "pvalue", "p.adjust")]
    lab <- paste0(names(anno), "=",  round(anno, 3), collapse="\n")

    gsearank(ego_GSE_bp, i, ego_GSE_bp[i, 2]) + xlab(NULL) +ylab(NULL) +
        annotate("text", 0, ego_GSE_bp[i, "enrichmentScore"] * .9, label = lab, hjust=0, vjust=0)
})
plot_grid(plotlist=pp, ncol=1)
image.png

1.7.5 KEGG數(shù)據(jù)GSEA分析

kk_gse <- gseKEGG(geneList     = gene_matrix,
                  organism     = 'hsa',
                  nPerm        = 1000,
                  minGSSize    = 120,
                  pvalueCutoff = 0.9,
                  verbose      = FALSE)
#使用了所有基因進(jìn)行KEGG的GSEA分析發(fā)現(xiàn)有上調(diào)16個(gè)通路得到驗(yàn)證璧帝,下調(diào)0個(gè)通路顯著捍岳。
down_kegg<-kk_gse[kk_gse$pvalue<0.1 & kk_gse$enrichmentScore < 0,];down_kegg$group=-1
up_kegg<-kk_gse[kk_gse$pvalue<0.05 & kk_gse$enrichmentScore > 0,];up_kegg$group=1
#(4)可視化
source("kegg_plot_function.R")
g_kegg <- kegg_plot(up_kegg,down_kegg)
g_kegg
image.png

1.8 查看kegg富集通路及差異基因可視化

1.8.1 Upsetplot for gene set enrichment analysis.

library(enrichplot)
library(clusterProfiler)
upsetplot(kegg_enrich) 
image.png

1.8.2 可通過瀏覽器打開kegg網(wǎng)站查看具體信號(hào)通路的信息

head(kegg_enrich)[,1:6] 
##                ID                                  Description GeneRatio
## hsa04015 hsa04015                       Rap1 signaling pathway   56/1085
## hsa04060 hsa04060       Cytokine-cytokine receptor interaction   68/1085
## hsa04640 hsa04640                   Hematopoietic cell lineage   30/1085
## hsa04390 hsa04390                      Hippo signaling pathway   39/1085
## hsa05146 hsa05146                                   Amoebiasis   28/1085
## hsa04672 hsa04672 Intestinal immune network for IgA production   17/1085
##           BgRatio       pvalue     p.adjust
## hsa04015 210/8076 1.830708e-07 0.0000583996
## hsa04060 295/8076 3.267347e-06 0.0005211418
## hsa04640  99/8076 8.949786e-06 0.0009516606
## hsa04390 157/8076 7.554828e-05 0.0060249750
## hsa05146 102/8076 1.251732e-04 0.0066839043
## hsa04672  49/8076 1.257161e-04 0.0066839043
#browseKEGG(kk.diff, 'hsa04015')
dotplot(kegg_enrich)
image.png
barplot(kegg_enrich,showCategory=10)
image.png
cnetplot(kegg_enrich, categorySize="pvalue", foldChange=gene_matrix,colorEdge = TRUE)
image.png
cnetplot(kegg_enrich, foldChange=gene_matrix, circular = TRUE, colorEdge = TRUE)
image.png
emapplot(kegg_enrich)
image.png
heatplot(kegg_enrich,foldChange = gene_matrix)
image.png

1.8.3 Going further with RNAseq analysis

You can the links below for a more in depth walk through of RNAseq analysis using R:


1.8.3.1 Citations:

  1. Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc

  2. Martin, Marcel. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, [S.l.], v. 17, n. 1, p. pp. 10-12, may. 2011. ISSN 2226-6089. Available at: http://journal.embnet.org/index.php/embnetjournal/article/view/200. doi:http://dx.doi.org/10.14806/ej.17.1.200.

  3. Kopylova E., Noé L. and Touzet H., “SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data”, Bioinformatics (2012), doi: 10.1093/bioinformatics/bts611

  4. Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15-21. doi:10.1093/bioinformatics/bts635.

  5. Lassmann et al. (2010) “SAMStat: monitoring biases in next generation sequencing data.” Bioinformatics doi:10.1093/bioinformatics/btq614 [PMID: 21088025]

  6. Liao Y, Smyth GK and Shi W (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30.

  7. Love MI, Huber W and Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, pp. 550.

  8. Yu G, Wang L, Han Y and He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), pp. 284-287.

  9. Philip Ewels, M?ns Magnusson, Sverker Lundin and Max K?ller. “MultiQC: Summarize analysis results for multiple tools and samples in a single report” Bioinformatics (2016). doi: 10.1093/bioinformatics/btw354. PMID: 27312411

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
禁止轉(zhuǎn)載,如需轉(zhuǎn)載請(qǐng)通過簡信或評(píng)論聯(lián)系作者。
  • 序言:七十年代末锣夹,一起剝皮案震驚了整個(gè)濱河市页徐,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌银萍,老刑警劉巖变勇,帶你破解...
    沈念sama閱讀 206,126評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異贴唇,居然都是意外死亡搀绣,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,254評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門滤蝠,熙熙樓的掌柜王于貴愁眉苦臉地迎上來豌熄,“玉大人授嘀,你說我怎么就攤上這事物咳。” “怎么了蹄皱?”我有些...
    開封第一講書人閱讀 152,445評(píng)論 0 341
  • 文/不壞的土叔 我叫張陵览闰,是天一觀的道長。 經(jīng)常有香客問我巷折,道長压鉴,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 55,185評(píng)論 1 278
  • 正文 為了忘掉前任锻拘,我火速辦了婚禮油吭,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘署拟。我一直安慰自己婉宰,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,178評(píng)論 5 371
  • 文/花漫 我一把揭開白布推穷。 她就那樣靜靜地躺著心包,像睡著了一般。 火紅的嫁衣襯著肌膚如雪馒铃。 梳的紋絲不亂的頭發(fā)上蟹腾,一...
    開封第一講書人閱讀 48,970評(píng)論 1 284
  • 那天,我揣著相機(jī)與錄音区宇,去河邊找鬼娃殖。 笑死,一個(gè)胖子當(dāng)著我的面吹牛议谷,可吹牛的內(nèi)容都是我干的珊随。 我是一名探鬼主播,決...
    沈念sama閱讀 38,276評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼叶洞!你這毒婦竟也來了鲫凶?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 36,927評(píng)論 0 259
  • 序言:老撾萬榮一對(duì)情侶失蹤衩辟,失蹤者是張志新(化名)和其女友劉穎螟炫,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體艺晴,經(jīng)...
    沈念sama閱讀 43,400評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡昼钻,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,883評(píng)論 2 323
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了封寞。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片然评。...
    茶點(diǎn)故事閱讀 37,997評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖狈究,靈堂內(nèi)的尸體忽然破棺而出碗淌,到底是詐尸還是另有隱情,我是刑警寧澤抖锥,帶...
    沈念sama閱讀 33,646評(píng)論 4 322
  • 正文 年R本政府宣布亿眠,位于F島的核電站,受9級(jí)特大地震影響磅废,放射性物質(zhì)發(fā)生泄漏纳像。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,213評(píng)論 3 307
  • 文/蒙蒙 一拯勉、第九天 我趴在偏房一處隱蔽的房頂上張望竟趾。 院中可真熱鬧,春花似錦宫峦、人聲如沸岔帽。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,204評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽山卦。三九已至,卻和暖如春诵次,著一層夾襖步出監(jiān)牢的瞬間账蓉,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,423評(píng)論 1 260
  • 我被黑心中介騙來泰國打工逾一, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留铸本,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 45,423評(píng)論 2 352
  • 正文 我出身青樓遵堵,卻偏偏與公主長得像箱玷,于是被迫代替她去往敵國和親怨规。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,722評(píng)論 2 345