Coursera代碼筆記：Getting and cleaning data（3）

1. Subsetting and Sorting

set.seed(13435)

X<data.frame("var1"=sample(1:5),"var2"=sample(6:10),"var3"=sample(11:15))

X<-X[sample(1:5),];X$var2[c(1,3)]=NA ＃更改X

X

X[,1]

X[,"var1"]

X[1:2,"var2"]

Logicals ands and ors ?（選擇）

X[(X$var1<=3&X$var3>11),]

X[(X$var1<=3|X$var3>15),]

Dealing with missing values

X[which(X$var2>8),]

Sorting

sort(X$var1)

sort(X$var1,decreasing=TRUE)

sort(X$var2,na.last=TRUE)

Ordering

X[order(X$var1),]

X[order(X$var1,X$var3),]

Ordering with plyr

library(plyr)

arrange(X,var1)
arrange(X,desc(var1))

Adding rows and columns

X$var4<-rnorm(5) ?＃將var4加入

X

Y<-cbind(X,rnorm(5))

Y

2.Summarizing Data

Getting the data from the web

if(!file.exists("./data")){dir.create("./data")}

fileUrl<-"https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"

download.file(fileUrl,destfile="./data/restaurants.csv",method="curl")

restData<-read.csv("./data/restaurants.csv")

Look a bit at the data

head(restData,n=3) ?#查看前三行數(shù)據(jù)

tail(restData,n=3)? ＃查看后三行數(shù)據(jù)

Make summary

summary(restData)

str(restData) ?Ｌ缚觯看更深的數(shù)據(jù)

quantile(restData$councilDistrict,na.rm=TRUE) ＝９矗看分位數(shù)

quantile(restData$councilDistrict,probs=c(0.5,0.75,0.9))

Make table

table(restData$zipCode,useNA="ifany")

table(restData$councilDistrict,restData$zipCode)

Check for missing values

sum(is.na(restData$councilDistrict))

any(is.na(restData$councilDistrict))

all(restData$zipCode>0)

Row and column sums

colSums(is.na(restData))

all(colSums(is.na(restData))==0) ?＃返回TRUE／FALSE

Values with specific characteristics

table(restData$zipCode%in%c("21212"))

table(restData$zipCode%in%c("21212","21213"))

Values with specific characteristics

restData[restData$zipCode%in%c("21212","21213"),]

Cross tabs ?＃把數(shù)據(jù)根據(jù)變量分組查看?

data(UCBAdmissions)

DF=as.data.frame(UCBAdmissions)

summary(DF)

xt<-xtabs(Freq~Gender+Admit,data=DF)

xt

~~Admit~~

~~Gender? Admitted Rejected~~

~~Male? ? ? 1198? ? 1493~~

~~Female? ? ? 557? ? 1278~~

Flat tables

warpbreaks$replicate<-rep(1:9,len=54)

xt=xtabs(breaks~.,data=warpbreaks)

xt

Flat tables

Size of a data set

fakeData=rnorm(1e5)

object.size(fakeData)

print(object.size(fakeData),units="Mb")

3. Creating New Variables

Getting data from the web

if(!file.exists("./data")){dir.create("./data")}

fileUrl<-"https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"

download.file(fileUrl,destfile="./data/restaurants.csv",method="curl")

restData<-read.csv("./data/restaurants.csv")

Creating sequences

~~Sometimes you need an index for your data set~~

s1<-seq(1,10,by=2) ;s1

[1] 1 3 5 7 9

s2<-seq(1,10,length=3);s2

[1]? 1.0? 5.5 10.0

x<-c(1,3,8,25,100); seq(along=x)

[1] 1 2 3 4 5

Subsetting variables

restData$nearMe=restData$neighborhood%in%c("Roland Park","Homeland")

table(restData$nearMe)

~~? ? FALSE? TRUE~~

~~? ? 1314? ? 13~~

Creating binary variables

restData$zipWrong=ifelse(restData$zipCode<0,TRUE,FALSE)

table(restData$zipWrong,restData$zipCode<0)

~~? ? ? ? ? FALSE TRUE~~

~~FALSE? 1326? ? 0~~

~~TRUE? ? ? 0 ? ? ? 1~~

Creating categorical variables

restData$zipGroups=cut(restData$zipCode,breaks=quantile(restData$zipCode))

table(restData$zipGroups)

table(restData$zipGroups,restData$zipCode)

Easier cutting

library(Hmisc)

restData$zipGroups=cut2(restData$zipCode,g=4)

table(restData$zipGroups)

Creating factor variables

restData$zcf<-factor(restData$zipCode)

restData$zcf[1:10]

class(restData$zcf)

[1] "factor"

Levels of factor variables

yesno<-sample(c("yes","no"),size=10,replace=TRUE)

yesnofac=factor(yesno,levels=c("yes","no"))

relevel(yesnofac,ref="no")

[1] yes yes yes yes no? yes yes yes no? no

Levels: no yes

as.numeric(yesnofac)

[1] 1 1 1 1 2 1 1 1 2 2

Cutting produces factor variables

library(Hmisc)

restData$zipGroups=cut2(restData$zipCode,g=4)

table(restData$zipGroups)

~~[-21226,21205) [ 21205,21220) [ 21220,21227) [ 21227,21287]~~

~~338? ? ? ? ? ? 375? ? ? ? ? ? 300? ? ? ? ? ? 314~~

Using the mutate function

library(Hmisc); library(plyr)

restData2=mutate(restData,zipGroups=cut2(zipCode,g=4))

table(restData2$zipGroups)

~~[-21226,21205) [ 21205,21220) [ 21220,21227) [ 21227,21287]~~

~~338? ? ? ? ? ? 375? ? ? ? ? ? 300? ? ? ? ? ? 314~~

Common transforms

abs(x) ?absolute value

sqrt(x) ?square root

ceiling(x) ?ceiling(3.475) is 4

floor(x) ?floor(3.475) is 3

round(x,digits=n) ?round(3.475,digits=2) is 3.48

signif(x,digits=n) ?signif(3.475,digits=2) is 3.5

cos(x), sin(x) etc.

log(x) natural logarithm

log2(x),log10(x)other common logs

exp(x) exponentiating x

4. Reshaping Data

Start with reshaping

library(reshape2)

head(mtcars) ? ?＃返回一組以車輛型號為obs的序列祠丝，var有各型號的馬力數(shù)據(jù)

最后編輯于：2017.12.04 02:01:14

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末谈山，一起剝皮案震驚了整個濱河市聂儒，隨后出現(xiàn)的幾起案子塞颁，更是在濱河造成了極大的恐慌踩麦，老刑警劉巖球切，帶你破解...
沈念sama閱讀 222,104評論 6贊 515
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件虹曙，死亡現(xiàn)場離奇詭異膝宁，居然都是意外死亡，警方通過查閱死者的電腦和手機根吁，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 94,816評論 3贊 399
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門员淫，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人击敌，你說我怎么就攤上這事介返。” “怎么了沃斤？”我有些...
開封第一講書人閱讀 168,697評論 0贊 360
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵圣蝎，是天一觀的道長。經(jīng)常有香客問我衡瓶，道長徘公，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 59,836評論 1贊 298
?港島之戀（遺憾婚禮）
正文為了忘掉前任哮针，我火速辦了婚禮关面，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘十厢。我一直安慰自己等太，他們只是感情好，可當(dāng)我...
茶點故事閱讀 68,851評論 6贊 397
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布蛮放。她就那樣靜靜地躺著缩抡，像睡著了一般。火紅的嫁衣襯著肌膚如雪包颁。梳的紋絲不亂的頭發(fā)上瞻想，一...
開封第一講書人閱讀 52,441評論 1贊 310
城市分裂傳說
那天，我揣著相機與錄音娩嚼，去河邊找鬼蘑险。笑死，一個胖子當(dāng)著我的面吹牛待锈，可吹牛的內(nèi)容都是我干的漠其。我是一名探鬼主播，決...
沈念sama閱讀 40,992評論 3贊 421
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼竿音，長吁一口氣：“原來是場噩夢啊……” “哼和屎！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起春瞬，我...
開封第一講書人閱讀 39,899評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤柴信，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后宽气，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體随常，經(jīng)...
沈念sama閱讀 46,457評論 1贊 318
?護(hù)林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 38,529評論 3贊 341
?白月光啟示錄
正文我和宋清朗相戀三年萄涯，在試婚紗的時候發(fā)現(xiàn)自己被綠了绪氛。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 40,664評論 1贊 352
活死人
序言：一個原本活蹦亂跳的男人離奇死亡涝影，死狀恐怖枣察，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情燃逻，我是刑警寧澤序目，帶...
沈念sama閱讀 36,346評論 5贊 350
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站伯襟，受9級特大地震影響猿涨，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜姆怪，卻給世界環(huán)境...
茶點故事閱讀 42,025評論 3贊 334
男人毒藥：我在死后第九天來索命
文/蒙蒙一叛赚、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧稽揭，春花似錦红伦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,511評論 0贊 24
一樁弒父案昙读，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至膨桥，卻和暖如春蛮浑，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背只嚣。一陣腳步聲響...
開封第一講書人閱讀 33,611評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工沮稚，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人册舞。一個月前我還...
沈念sama閱讀 49,081評論 3贊 377
代替公主和親
正文我出身青樓蕴掏，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子盛杰，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 45,675評論 2贊 359

Coursera代碼筆記：Getting and cleaning data（3）

1. Subsetting and Sorting

2.Summarizing Data

3. Creating New Variables

4. Reshaping Data

推薦閱讀更多精彩內(nèi)容