課程作者是美國Cold Spring Harbor 研究所的Maria Nattestad。這個(gè)課程適合初學(xué)bioinformatics 和 computational biology的同學(xué)纱意。R編程語言非常適合數(shù)據(jù)分析,統(tǒng)計(jì)和科學(xué)制圖态兴。這個(gè)課程本打算是付費(fèi)課程辫樱,后來作者改成免費(fèi)資源簇宽,但是歡迎打賞,我這里是記筆記學(xué)習(xí)婆跑,如果有人覺得打賞過來我會(huì)轉(zhuǎn)捐給原作者此熬,屆時(shí)會(huì)把轉(zhuǎn)錢信息公開。
課程里提到的DATA/腳本下載滑进。鏈接:http://pan.baidu.com/s/1bpaZ9Rx 密碼:c439如果有Youtube看不到的請留言給我發(fā)你其他鏈接摹迷,清晰度沒有Youtube好。
課程內(nèi)容(往期內(nèi)容)
Lesson 1: A quick start guide — From data to plot with a few magic words
課程內(nèi)容(本次課程)
Lesson 2: Importing and downloading data — From Excel, text files, or publicly available data, this lesson covers how to get all of it into R and addresses a number of common problems with data formatting issues.
# ==========================================================
#
# Lesson 2 -- Importing and downloading data
# ? Importing data from Excel
# ? Downloading from UCSC
# ? Downloading from ENSEMBL
# ? Downloading from ENCODE
#
# ==========================================================
# Getting data from Excel
# Get the excel file from this paper: "Gene expression profiling of breast cell lines identifies potential new basal markers". Supplementary table 1
# Go into excel and save it as "Tab Delimited Text (.txt)"
filename <- "Lesson-02/micro_array_results_table1.txt"
my_data <- read.csv(filename, sep="\t", header=TRUE)
head(my_data)
# Where to find publicly available big data
# UCSC -- RefSeq genes from table browser
# Ensembl -- Mouse regulatory features MultiCell
# ENCODE -- HMM: wgEncodeBroadHmmGm12878HMM.bed
genes <- read.csv("Lesson-02/RefSeq_Genes.dms", sep="\t", header=TRUE)
head(genes)
dim(genes)
regulatory_features <- read.csv("Lesson-02/homo_sapiens.GRCh38.Fetal_Muscle_Leg.Regulatory_Build.regulatory_activity.20161111.gff", sep="\t", header=FALSE)
head(regulatory_features)
dim(regulatory_features)
chromHMM <- read.csv("Lesson-02/wgEncodeBroadHmmGm12878HMM.bed", sep="\t", header=FALSE)
head(chromHMM)
dim(chromHMM)
最后補(bǔ)充一下郊供,各個(gè)基因組的版本對應(yīng)關(guān)系峡碉,找了些,感覺生信菜鳥團(tuán)的比較好驮审,如下:
首先是NCBI對應(yīng)UCSC鲫寄,對應(yīng)ENSEMBL數(shù)據(jù)庫:
- GRCh36 (hg18): ENSEMBL release_52.
- GRCh37 (hg19): ENSEMBL release_59/61/64/68/69/75.
- GRCh38 (hg38): ENSEMBL release_76/77/78/80/81/82.
可以看到ENSEMBL的版本特別復(fù)雜<础!地来!很容易搞混戳玫!
但是UCSC的版本就簡單了,就hg18,19,38, 常用的是hg19未斑,但是我推薦大家都轉(zhuǎn)為hg38
看起來NCBI也是很簡單咕宿,就GRCh36,37,38,但是里面水也很深蜡秽!
Feb 13 2014 00:00 Directory April_14_2003
Apr 06 2006 00:00 Directory BUILD.33
Apr 06 2006 00:00 Directory BUILD.34.1
Apr 06 2006 00:00 Directory BUILD.34.2
Apr 06 2006 00:00 Directory BUILD.34.3
Apr 06 2006 00:00 Directory BUILD.35.1
Aug 03 2009 00:00 Directory BUILD.36.1
Aug 03 2009 00:00 Directory BUILD.36.2
Sep 04 2012 00:00 Directory BUILD.36.3
Jun 30 2011 00:00 Directory BUILD.37.1
Sep 07 2011 00:00 Directory BUILD.37.2
Dec 12 2012 00:00 Directory BUILD.37.3
可以看到府阀,有37.1, 37.2, 37.3 等等芽突,不過這種版本一般指的是注釋在更新试浙,基因組序列一般不會(huì)更新!D觥田巴!
反正你記住hg19基因組大小是3G,壓縮后八九百兆即可P印R疾浮!
如果要下載GTF注釋文件艘刚,基因組版本尤為重要9芟!昔脯!
對NCBI:ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/GFF/ ##最新版(hg38)
ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/ ## 其它版本
對于ensembl:
ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
變幻中間的release就可以拿到所有版本信息:ftp://ftp.ensembl.org/pub/
對于UCSC啄糙,那就有點(diǎn)麻煩了:
需要選擇一系列參數(shù):
http://genome.ucsc.edu/cgi-bin/hgTables
Navigate to http://genome.ucsc.edu/cgi-bin/hgTables
Select the following options:clade: Mammalgenome: Humanassembly: Feb. 2009 (GRCh37/hg19)group: Genes and Gene Predictionstrack: UCSC Genestable: knownGeneregion: Select "genome" for the entire genome.output format: GTF - gene transfer formatoutput file: enter a file name to save your results to a file, or leave blank to display results in the browser
Click 'get output'.
現(xiàn)在重點(diǎn)來了笛臣,搞清楚版本關(guān)系了云稚,就要下載呀!
UCSC里面下載非常方便沈堡,只需要根據(jù)基因組簡稱來拼接url即可:
http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz
http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/chromFa.tar.gz
或者用shell腳本指定下載的染色體號:
for i in $(seq 1 22) X Y M;
do echo $i;
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz;
## 這里也可以用NCBI的:ftp://ftp.ncbi.nih.gov/genomes/M_musculus/ARCHIVE/MGSCv3_Release3/Assembled_Chromosomes/chr前綴
done
gunzip *.gz
for i in $(seq 1 22) X Y M;
do cat chr${i}.fa >> hg19.fasta;
done
rm -fr chr*.fasta