關(guān)鍵字:R語言毕箍, 統(tǒng)計分析弛房, 概率密度圖,聯(lián)合分布圖而柑,箱線圖文捶, 小提琴圖
R語言畫的圖,當(dāng)在論文或PPT上呈現(xiàn)時媒咳,可能會有字體太小粹排,或者線條太細(xì)以及配色等問題,本文把相關(guān)代碼寫在一起涩澡,可直接使用這些代碼片顽耳,畫出好看的圖片,直接就調(diào)整配色妙同,字體等射富。
使用數(shù)據(jù):
UCI數(shù)據(jù)庫Heart Disease數(shù)據(jù)集
(http://archive.ics.uci.edu/ml/datasets/Heart+Disease)
本文使用的是這個數(shù)據(jù)集的一個子集(共14列)
(https://github.com/xjcjiacheng/data-analysis/tree/master/heart%20disease%20UCI)
所有代碼和數(shù)據(jù)都在這里:
https://github.com/wushangbin/tripping/tree/master/R_Plot
1 相關(guān)性計算
可以計算多個特征與label之間的 殘差統(tǒng)計量(Deviance Residuals),回歸系數(shù)(Estimate)粥帚,標(biāo)準(zhǔn)差胰耗,Z統(tǒng)計量和P值等。
data = read.csv("./heart.csv")
# print(names(data)) 可以查看這個數(shù)據(jù)有哪些列
# data是讀取的dataframe芒涡,target柴灯,age,sex费尽,cp赠群,chol,trestbps都是數(shù)據(jù)中的列名旱幼。
model <- glm(target~age+sex+cp+chol+trestbps, data = data, family='binomial')
summary(model)
結(jié)果如下:
Call:
glm(formula = target ~ age + sex + cp + chol + trestbps, family = "binomial",
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5320 -0.7584 0.2806 0.7685 2.2828
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.579089 1.536048 4.934 8.05e-07 ***
age -0.059743 0.017652 -3.384 0.000713 ***
sex -1.916315 0.351638 -5.450 5.05e-08 ***
cp 1.065319 0.151163 7.048 1.82e-12 ***
chol -0.003965 0.002819 -1.407 0.159552
trestbps -0.020903 0.008557 -2.443 0.014579 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 417.64 on 302 degrees of freedom
Residual deviance: 302.13 on 297 degrees of freedom
AIC: 314.13
使用caret里的train可以輸出誤差
library(caret)
model <- train(target~age+sex+cp+chol+trestbps, data=data, method='glm', family='binomial')
print(model) # 注意這里是print查描,如果用summary,輸出和上面是一樣的
輸出結(jié)果:
Generalized Linear Model
303 samples
5 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 303, 303, 303, 303, 303, 303, ...
Resampling results:
RMSE Rsquared MAE
0.4216974 0.2939183 0.339408
2 單變量分布可視化
2.1 概率密度分布圖
適用于連續(xù)型變量柏卤。我這里一鍵調(diào)顏色叹誉,調(diào)字體。 畫trestbps的概率密度分布圖闷旧,并對sex進(jìn)行區(qū)分。
library(ggplot2)
data$sex <- as.factor(data$sex) # 先把sex轉(zhuǎn)化成factor钧唐,不然R會處理為整型的0忙灼,1
ggplot(data, aes(x = trestbps)) + geom_line(size=3, colour = "cadetblue3", stat = "density") +
geom_rug(aes(colour = sex), sides = "b") +
theme(axis.title.x =element_text(size=20), axis.title.y=element_text(size=20))
2.2 直方圖和條形圖
要注意區(qū)分直方圖(也即柱狀圖,hist, histogram)和條形圖(barplot, bar chat)。雖然這兩個圖的形狀很像该园,但是直方圖反應(yīng)的是一列數(shù)據(jù)的分布酸舍,而條形圖反應(yīng)的是每個元素的大小。我們以2020年第七次中國人口普查的數(shù)據(jù)為例里初,畫一下直方圖和條形圖啃勉。
population = read.csv("./China_Population.csv")
hist(population$population2020)
barplot(population$population2020, names.arg = population$ChineseName, las=2)
可以看到,直方圖是只用選擇一列數(shù)據(jù)双妨,畫出這一列數(shù)據(jù)的分布即可淮阐,橫軸是人口數(shù)量,而縱軸是Frequency刁品;但條形圖是要多選一列數(shù)據(jù)作為label泣特,可直觀地看出每個數(shù)據(jù)的大小。
3 小提琴圖與箱線圖
把這兩個圖放到一起挑随,因?yàn)樗鼈兌挤磻?yīng)的是離散型變量和連續(xù)型變量之間的關(guān)系状您。我還是拿剛剛的兩個變量,sex和trestbps舉例子兜挨。
3.1 箱線圖
library(ggplot2)
data$sex <- as.factor(data$sex) # 先把sex轉(zhuǎn)化成factor膏孟,不然R會處理為整型的0,1
ggplot(data, aes(sex, trestbps)) +
geom_boxplot(aes(fill = sex)) +
stat_summary(fun = "mean", fill = "white", size = 2, geom = "point", shape = 23) +
theme(axis.title.x =element_text(size=20), axis.title.y=element_text(size=20))
3.2 小提琴圖
library(ggplot2)
data$sex <- as.factor(data$sex) # 先把sex轉(zhuǎn)化成factor拌汇,不然R會處理為整型的0柒桑,1
ggplot(data, aes(sex, trestbps)) +
geom_violin(aes(fill = sex), show.legend = FALSE) + geom_jitter(width = 0.1) +
theme(axis.title.x =element_text(size=20), axis.title.y=element_text(size=20))
4 聯(lián)合分布
4.1 二維直方圖
這次畫的是chol 和 trestbps兩個變量
library(ggplot2)
ggplot(data, aes(chol, trestbps)) +
geom_bin2d() +
theme(axis.title.x =element_text(size=20), axis.title.y=element_text(size=20))
4.2 聯(lián)合概率密度分布圖
這次選的兩個連續(xù)型變量分別是chol和trestbps,在sex上進(jìn)行區(qū)分
library(ggpubr)
data$sex <- as.factor(data$sex) # 先把sex轉(zhuǎn)化成factor担猛,不然R會處理為整型的0幕垦,1
ggscatterhist(
data, x ='chol', y = 'trestbps',
shape=21,color ="black",fill= "sex", size =3, alpha = 0.8,
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
margin.plot = "density",
margin.params = list(fill = "sex", color = "black", size = 0.2),
legend = c(0.9,0.15),
ggtheme = theme_minimal()) +
theme(axis.title.x =element_text(size=20), axis.title.y=element_text(size=20))
4.3 散點(diǎn)圖
這一次我們給散點(diǎn)圖加點(diǎn)東西,比如傅联,畫散點(diǎn)圖先改,并給散點(diǎn)圖加標(biāo)簽,然后散點(diǎn)的顏色和大小也賦予意義蒸走,不同散點(diǎn)的顏色和大小是不一樣的:
ggplot(population, aes(x=popChange, y=percentChange)) +
geom_point(aes(size=population2020, color=population2020)) + # 這里散點(diǎn)顏色和大小是同一個含義仇奶,可根據(jù)需要調(diào)整
geom_text(aes(label=ChineseName), size=4, hjust=1, vjust=-1) # 給散點(diǎn)加label
但是這樣加標(biāo)簽的話你跑一下就知道了,每個散點(diǎn)都有標(biāo)簽比驻,看起來很亂该溯,所以我們接下來,只給滿足要求的散點(diǎn)加標(biāo)簽,并且别惦,在右邊的圖例中把最小值狈茉,最大值和中位數(shù)標(biāo)出來:
minPerChange <- 10
minPopChange <- 1000000
population$keyProvince <- population$popChange>minPopChange & population$percentChange > minPerChange
minLabel <- format(min(population$population2020), big.mark = ",", trim = TRUE)
maxLabel <- format(max(population$population2020), big.mark = ",", trim = TRUE)
medianLabel <- format(median(population$population2020), big.mark = ",", trim = TRUE)
g <- ggplot(population, aes(x=popChange, y=percentChange)) +
geom_point(aes(size=population2020, color=population2020, shape=keyProvince)) +
geom_text(data = population[population$popChange > minPopChange & population$percentChange > minPerChange,],
aes(label=ChineseName, hjust=1, vjust=-1)) +
# 加圖例,展示出最小值掸掸,最大值和中位數(shù)
scale_color_continuous(name="Pop", breaks = with(population, c(
min(population2020), median(population2020), max(population2020))),
labels = c(minLabel, medianLabel, maxLabel), low = "white", high = "black")
g
4.4 回歸線
畫完散點(diǎn)圖之后氯庆,畫上回歸線并加置信區(qū)間:
if (TRUE) {
ggplot(population, aes(x=population2010, y=popChange)) +
geom_point() +
stat_smooth(method="lm", col="red")
}