R語言基礎(chǔ)知識(shí)的一些檢驗(yàn)喳资,最好是對(duì)照幾本R基礎(chǔ)語法書籍來理解。
全部答案及視頻在:https://github.com/jmzeng1314/R_bilibili
首先做完了周末班工作凡蚜, 包括軟件安裝以及R包安裝:http://www.bio-info-trainee.com/3727.html
1. 打開 Rstudio
告訴我它的工作目錄。
> getwd()
[1] "C:/Users/wlx/Documents"
2.新建6個(gè)向量,基于不同的原子類型
。(重點(diǎn)是字符串柬帕,數(shù)值,邏輯值)
> a <- c(1:10)
> b <- c("a","b","c","d")
> c <- c(T,F,F,T)
> a
[1] 1 2 3 4 5 6 7 8 9 10
> b
[1] "a" "b" "c" "d"
> c
[1] TRUE FALSE FALSE TRUE
> a <- c(1, 2, 5, 3, 6, -2, 4)
> b <- c("one", "two", "three")
> c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
> a
[1] 1 2 5 3 6 -2 4
> b
[1] "one" "two" "three"
> c
[1] TRUE TRUE TRUE FALSE TRUE FALSE
>
3 告訴我在你打開的rstudio里面 getwd()
代碼運(yùn)行后返回的是什么狡门?
> getwd()
[1] "C:/Users/wlx/Documents"
4新建一些數(shù)據(jù)結(jié)構(gòu)陷寝,比如矩陣,數(shù)組其馏,數(shù)據(jù)框凤跑,列表等重點(diǎn)是數(shù)據(jù)框,矩陣)
4.1 矩陣
> cnames <- c("C1", "C2")
> rnames <- c("g1", "g2")
> mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
+ dimnames=list(rnames, cnames))
> mymatrix
C1 C2
g1 1 26
g2 24 68
> mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=FALSE,
+ dimnames=list(rnames, cnames))
> mymatrix
C1 C2
g1 1 24
g2 26 68
4.2 數(shù)組
dim1 <- c("A1", "A2")
> dim2 <- c("B1", "B2", "B3")
> dim3 <- c("C1", "C2", "C3", "C4")
> z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))
> z
, , C1
B1 B2 B3
A1 1 3 5
A2 2 4 6
, , C2
B1 B2 B3
A1 7 9 11
A2 8 10 12
, , C3
B1 B2 B3
A1 13 15 17
A2 14 16 18
, , C4
B1 B2 B3
A1 19 21 23
A2 20 22 24
4.3 數(shù)據(jù)框
patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> patientdata <- data.frame(patientID, age, diabetes, status)
> patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
4.4 因子
diabetes <- c("Type1", "Type2", "Type1", "Type1")
> diabetes
[1] "Type1" "Type2" "Type1" "Type1"
> diabetes <- factor(diabetes)
> diabetes
[1] Type1 Type2 Type1 Type1
Levels: Type1 Type2
> status <- c("Poor", "Improved", "Excellent", "Poor")
> status
[1] "Poor" "Improved" "Excellent" "Poor"
> status <- factor(status, ordered=TRUE)
> status
[1] Poor Improved Excellent Poor
Levels: Excellent < Improved < Poor
> status <- factor(status, order=TRUE,
+ levels=c("Poor", "Improved", "Excellent"))
> status
[1] Poor Improved Excellent Poor
Levels: Poor < Improved < Excellent
> sex <- factor(sex, levels=c(1, 2), labels=c("Male", "Female"))
Error in factor(sex, levels = c(1, 2), labels = c("Male", "Female")) :
object 'sex' not found
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> diabetes <- factor(diabetes)
> status <- factor(status, order=TRUE)
> patientdata <- data.frame(patientID, age, diabetes, status)
> str(patientdata)
'data.frame': 4 obs. of 4 variables:
$ patientID: num 1 2 3 4
$ age : num 25 34 28 52
$ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
$ status : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3
> patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
> summary(patientdata)
patientID age diabetes status
Min. :1.00 Min. :25.00 Type1:3 Excellent:1
1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1
Median :2.50 Median :31.00 Poor :2
Mean :2.50 Mean :34.75
3rd Qu.:3.25 3rd Qu.:38.50
Max. :4.00 Max. :52.00
4.5 list
> g <- "My First List"
> h <- c(25, 26, 18, 39)
> j <- matrix(1:10, nrow=5)
> k <- c("one", "two", "three")
> mylist <- list(title=g, ages=h, j, k)
> mylist
$title
[1] "My First List"
$ages
[1] 25 26 18 39
[[3]]
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
[[4]]
[1] "one" "two" "three"
> mylist[[2]]
[1] 25 26 18 39
> mylist[["ages"]]
[1] 25 26 18 39
提醒程序員注意的一些事項(xiàng)
經(jīng)驗(yàn)豐富的程序員通常會(huì)發(fā)現(xiàn)R語言的某些方面不太尋常叛复。以下是這門語言中你需要了解
的一些特性仔引。
?對(duì)象名稱中的句點(diǎn)(.)沒有特殊意義,但美元符號(hào)(x是指數(shù)據(jù)框A中的變
量x撬码。
?R不提供多行注釋或塊注釋功能儿倒。你必須以#作為多行注釋每行的開始。出于調(diào)試目的呜笑,
你也可以把想讓解釋器忽略的代碼放到語句if(FALSE){... }中夫否。將FALSE改為TRUE
即允許這塊代碼執(zhí)行。
?將一個(gè)值賦給某個(gè)向量叫胁、矩陣凰慈、數(shù)組或列表中一個(gè)不存在的元素時(shí),R將自動(dòng)擴(kuò)展這個(gè)
數(shù)據(jù)結(jié)構(gòu)以容納新值曹抬。舉例來說溉瓶,考慮以下代碼:
> x <- c(8, 6, 4)
> x[7] <- 10
> x
[1] 8 6 4 NA NA NA 10
通過賦值急鳄,向量x由三個(gè)元素?cái)U(kuò)展到了七個(gè)元素。x <- x[1:3]會(huì)重新將其縮減回三個(gè)
元素堰酿。
?R中沒有標(biāo)量疾宏。標(biāo)量以單元素向量的形式出現(xiàn)。
?R中的下標(biāo)不從0開始触创,而從1開始坎藐。在上述向量中,x[1]的值為8哼绑。
?變量無法被聲明岩馍。它們?cè)谑状伪毁x值時(shí)生成。
5. 在你新建的數(shù)據(jù)框進(jìn)行切片操作抖韩,比如首先取第1蛀恩,3行, 然后取第4茂浮,6列
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> patientdata <- data.frame(patientID, age, diabetes, status)
> patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
> df[c(1,3), ]
Error in df[c(1, 3), ] : object of type 'closure' is not subsettable
> patientdata[c(1,3), ]
patientID age diabetes status
1 1 25 Type1 Poor
3 3 28 Type1 Excellent
> patientdata[c(1:3), ]
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
> patientdata[,c(2,4) ]
age status
1 25 Poor
2 34 Improved
3 28 Excellent
4 52 Poor
5. 使用data函數(shù)來加載R內(nèi)置數(shù)據(jù)集 rivers
描述它双谆。并且可以查看更多的R語言內(nèi)置的數(shù)據(jù)集:https://mp.weixin.qq.com/s/dZPbCXccTzuj0KkOL7R31g
> data("rivers")
> class(rivers)
[1] "numeric"
> str(rivers)
num [1:141] 735 320 325 392 524 ...
> length(rivers)
[1] 141
> summary(rivers)
Min. 1st Qu. Median Mean 3rd Qu. Max.
135.0 310.0 425.0 591.2 680.0 3710.0
> head(rivers); tail(rivers)
[1] 735 320 325 392 524 450
[1] 500 720 270 430 671 1770
6下載 https://www.ncbi.nlm.nih.gov/sra?term=SRP133642 里面的 RunInfo Table
文件讀入到R里面,了解這個(gè)數(shù)據(jù)框席揽,多少列顽馋,每一列都是什么屬性的元素。(參考B站生信小技巧獲取runinfo table) 這是一個(gè)單細(xì)胞轉(zhuǎn)錄組項(xiàng)目的數(shù)據(jù)幌羞,共768個(gè)細(xì)胞寸谜,如果你找不到RunInfo Table
文件,可以點(diǎn)擊下載属桦,然后讀入你的R里面也可以熊痴。
7下載 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111229 里面的樣本信息sample.csv
讀入到R里面,了解這個(gè)數(shù)據(jù)框地啰,多少列愁拭,每一列都是什么屬性的元素讲逛。(參考 https://mp.weixin.qq.com/s/fbHMNXOdwiQX5BAlci8brA 獲取樣本信息sample.csv)如果你實(shí)在是找不到樣本信息文件sample.csv亏吝,也可以點(diǎn)擊下載。
> aread.table("SraRunTable.txt",sep="\t",header=T,check.names=F)
Error in aread.table("SraRunTable.txt", sep = "\t", header = T, check.names = F) :
could not find function "aread.table"
> a<- read.table("SraRunTable.txt",sep="\t",header=T,check.names=F)
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'SraRunTable.txt': No such file or directory
> a<- read.table("C:\\Users\\wlx\\Desktop\\scRNA\\SraRunTable.txt",sep="\t",header=T,check.names=F)
> b<- b <- read.csv("C:\\Users\\wlx\\Desktop\\scRNA\\sample.csv",sep="\t",header=T,check.names=F)
> dim(a)
[1] 768 31
> dim(b)
[1] 768 1
> b<- b <- read.table("C:\\Users\\wlx\\Desktop\\scRNA\\sample.csv",sep="\t",header=T,check.names=F)
> b <- read.table("C:\\Users\\wlx\\Desktop\\scRNA\\sample.csv",sep="\t",header=T,check.names=F)
> dim(b)
[1] 768 1
> b <- read.table("C:\\Users\\wlx\\Desktop\\scRNA\\sample.csv")
> dim(b)
[1] 768 6
> a<- read.table("C:\\Users\\wlx\\Desktop\\scRNA\\SraRunTable.txt")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 44 elements
> a<- read.table("C:\\Users\\wlx\\Desktop\\scRNA\\SraRunTable.txt",sep="\t",header=T,check.names=F)
> dim(a)
[1] 768 31
> str(a)
'data.frame': 768 obs. of 31 variables:
$ BioSample : Factor w/ 768 levels "SAMN08619908",..: 5 4 3 2 1 12 11 14 13 7 ...
$ Experiment : Factor w/ 768 levels "SRX3749901","SRX3749902",..: 2 3 4 5 6 7 8 9 10 1 ...
$ MBases : int 16 16 8 8 11 7 18 5 11 15 ...
$ MBytes : int 8 8 4 4 5 4 9 3 6 8 ...
$ Run : Factor w/ 768 levels "SRR6790711","SRR6790712",..: 1 2 3 4 5 6 7 8 9 10 ...
$ SRA_Sample : Factor w/ 768 levels "SRS3006136","SRS3006137",..: 3 13 2 1 14 5 15 7 6 4 ...
$ Sample_Name : Factor w/ 768 levels "GSM3025845","GSM3025846",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Assay_Type : Factor w/ 1 level "RNA-Seq": 1 1 1 1 1 1 1 1 1 1 ...
$ AssemblyName : Factor w/ 1 level "GCF_000001635.20": 1 1 1 1 1 1 1 1 1 1 ...
$ AvgSpotLen : int 43 43 43 43 43 43 43 43 43 43 ...
$ BioProject : Factor w/ 1 level "PRJNA436229": 1 1 1 1 1 1 1 1 1 1 ...
$ Center_Name : Factor w/ 1 level "GEO": 1 1 1 1 1 1 1 1 1 1 ...
$ Consent : Factor w/ 1 level "public": 1 1 1 1 1 1 1 1 1 1 ...
$ DATASTORE_filetype: Factor w/ 1 level "sra": 1 1 1 1 1 1 1 1 1 1 ...
$ DATASTORE_provider: Factor w/ 1 level "ncbi": 1 1 1 1 1 1 1 1 1 1 ...
$ InsertSize : int 0 0 0 0 0 0 0 0 0 0 ...
$ Instrument : Factor w/ 1 level "Illumina HiSeq 2000": 1 1 1 1 1 1 1 1 1 1 ...
$ LibraryLayout : Factor w/ 1 level "SINGLE": 1 1 1 1 1 1 1 1 1 1 ...
$ LibrarySelection : Factor w/ 1 level "cDNA": 1 1 1 1 1 1 1 1 1 1 ...
$ LibrarySource : Factor w/ 1 level "TRANSCRIPTOMIC": 1 1 1 1 1 1 1 1 1 1 ...
$ LoadDate : Factor w/ 1 level "2018-03-01": 1 1 1 1 1 1 1 1 1 1 ...
$ Organism : Factor w/ 1 level "Mus musculus": 1 1 1 1 1 1 1 1 1 1 ...
$ Platform : Factor w/ 1 level "ILLUMINA": 1 1 1 1 1 1 1 1 1 1 ...
$ ReleaseDate : Factor w/ 1 level "2018-11-23": 1 1 1 1 1 1 1 1 1 1 ...
$ SRA_Study : Factor w/ 1 level "SRP133642": 1 1 1 1 1 1 1 1 1 1 ...
$ age : Factor w/ 1 level "14 weeks": 1 1 1 1 1 1 1 1 1 1 ...
$ cell_type : Factor w/ 1 level "cancer-associated fibroblasts (CAFs)": 1 1 1 1 1 1 1 1 1 1 ...
$ marker_genes : Factor w/ 1 level "EpCAM-, CD45-, CD31-, NG2-": 1 1 1 1 1 1 1 1 1 1 ...
$ source_name : Factor w/ 1 level "Mammary tumor fibroblast": 1 1 1 1 1 1 1 1 1 1 ...
$ strain : Factor w/ 1 level "FVB/N-Tg(MMTVPyVT)634Mul/J": 1 1 1 1 1 1 1 1 1 1 ...
$ tissue : Factor w/ 1 level "Mammary tumor fibroblast": 1 1 1 1 1 1 1 1 1 1 ...
> str(b)
'data.frame': 768 obs. of 6 variables:
$ Accession.Title.Sample : Factor w/ 1 level "musculus\",1,GPL13112,GSE111229,\"SRA": 1 1 1 1 1 1 1 1 1 1 ...
$ Type.Taxonomy.Channels.Platform.Series.Supplementary: Factor w/ 1 level "Run": 1 1 1 1 1 1 1 1 1 1 ...
$ Types.Supplementary : Factor w/ 768 levels "Selector\",\"https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRX3749901\",SRX3749901,\"Kristian",..: 2 3 4 5 6 7 8 9 10 1 ...
$ Links.SRA : Factor w/ 1 level "Pietras\",\"Nov": 1 1 1 1 1 1 1 1 1 1 ...
$ Accession.Contact.Release : Factor w/ 1 level "23,": 1 1 1 1 1 1 1 1 1 1 ...
$ Date : Factor w/ 1 level "2018\"": 1 1 1 1 1 1 1 1 1 1 ...
把前面兩個(gè)步驟的兩個(gè)表(RunInfo Table
文件盏混,樣本信息sample.csv)關(guān)聯(lián)起來蔚鸥,使用merge函數(shù)。
基于下午的統(tǒng)計(jì)可視化
8對(duì)前面讀取的 RunInfo Table
文件在R里面探索其MBases列许赃,包括 箱線圖(boxplot)和五分位數(shù)(fivenum)止喷,還有頻數(shù)圖(hist),以及密度圖(density) 混聊。
把前面讀取的樣本信息
表格的樣本名字根據(jù)下劃線分割
看第3列元素的統(tǒng)計(jì)情況弹谁。第三列代表該樣本所在的plate
根據(jù)plate把關(guān)聯(lián)到的 RunInfo Table
信息的MBases列分組檢驗(yàn)是否有統(tǒng)計(jì)學(xué)顯著的差異。
分組繪制箱線圖(boxplot),頻數(shù)圖(hist)预愤,以及密度圖(density) 沟于。
使用ggplot2把上面的圖進(jìn)行重新繪制。
使用ggpubr把上面的圖進(jìn)行重新繪制植康。
rm(list = ls())
options(stringsAsFactors = F)
# 或者下載:http://www.bio-info-trainee.com/tmp/5years/SraRunTable.txt
a=read.table('SraRunTable.txt',sep = '\t',header = T)
# 或者下載:http://www.bio-info-trainee.com/tmp/5years/sample.csv
b=read.csv('sample.csv')
colnames(a)
colnames(b)
d=merge(a,b,by.x = 'Sample_Name',by.y = 'Accession')
e=d[,c("MBases","Title")]
save(e,file = 'input.Rdata')
rm(list = ls())
options(stringsAsFactors = F)
load(file = 'input.Rdata')
e[,2]
plate=unlist(lapply(e[,2],function(x){
# x=e[1,2]
x
strsplit(x,'_')[[1]][3]
}))
table(plate)
boxplot(e[,1]~plate)
t.test(e[,1]~plate)
e$plate=plate
library(ggplot2)
colnames(e)
ggplot(e,aes(x=plate,y=MBases))+geom_boxplot()
library(ggpubr)
p <- ggboxplot(e, x = "plate", y = "MBases",
color = "plate", palette = "jco",
add = "jitter")
# Add p-value
p + stat_compare_means(method = 't.test')