變量名
名字的長度要小于等于 32 個字節(jié)。(一個字母 1 個字節(jié)膳殷, 一個漢字 2 個字節(jié))
以字母或下劃線開頭操骡。
可以包含字母、 數(shù)字赚窃、 或者是下劃線册招, 不能是%$!*&#@。
可以是小寫或大寫字母勒极, 且不區(qū)分大小寫
Missing numeric data are represented by a single period (.) and missing character data are represented by blanks.
library name
1-8個字符是掰,字母或者下劃線開頭,剩余部分為字母辱匿,數(shù)字或者下劃線
注釋
星號開頭 ;結尾
星號斜杠開頭键痛, 星斜杠結尾 asterisk (*)
DATA steps與PROC steps區(qū)別
The DATA statement does three things
- Tells SAS that a DATA step is starting.
- Names the SAS dataset being created.
- Set variables used in the DATA step to missing values
three default windows
1.program editor window
2.log window
3.output window
The basics of using SAS
- Prepare the SAS program
- Submit it for analysis
- Review the resulting log for errors
- Examine the output files to view the results of your analysis
Executing the program
- Pull down the Locals menu and select Submit.
- Click on the run icon on taskbar, which is a picture of a man running.
- Push F8.
- Highlight text and click on run symbol
- Note: DATA or PROC step is not executed until next DATA and PROC. Use RUN; statement to force execution.
讀入dat文件;
DATA NAME;
INFILE 'E:\data\a.dat' FIRSTOBS=4 DLM=',';
INPUT V1 1-5 V2 5-10 V3 $ 15;
RUN;
PROC PRINT DATA=NAME; RUN;
infile控制
格式
INFILE 'AAAAA.DAT' XXX;
FIRSTOBS=行數(shù) 從哪一行開始讀取數(shù)據(jù)
OBS=行數(shù) 一直讀取到哪一行
MISSOVER 表示數(shù)據(jù)讀到行末時,如果字段長度短于申明字段長度匾七,則不從下一行讀取數(shù)據(jù)絮短,否則會自動從下一行讀取數(shù)據(jù)
TURNCOVER column input中指定最長的一行
INPUT Notes
(1) Duplicate formats can be used when variables have the same format. The examples below represent the same formats of variables x1-x5.
INPUT x1 4. x2 4. x3 4. x4 4. x5 4.; INPUT (x1 x2 x3 x4 x5) (4. 4. 4. 4. 4.); INPUT (x1-x5) (5*4.);
(2) @@ tells SAS to hold the line of raw data and use it when processing the next
observation. The @@ must be the last entry in the INPUT statement.
(3) @ tells SAS to hold this line of data for possible use by INPUT statements later in theDATA step. The @ must be the last entry in the INPUT statement.
(4) / tells SAS to move to the next line of the raw dataset.
(5) #n tells SAS to skip to the nth line of the raw data for the observation.
(6) @n tells SAS to move to the nth column.
特殊字符
@40 跳至第40列 @‘a(chǎn)a’ 跳至aa后面
斜線/ 跳至原始數(shù)據(jù)第二行
#2 跳至某觀測值第二行
重復觀測值,將@@放在input句尾
input句尾加@昨忆, trailing at, 可用來選擇部分數(shù)據(jù)丁频, 看例子
數(shù)據(jù)步讀取分隔符文件 delimited files
DLM=',' 指定逗號分隔符 '09'x Tab分隔符
DSD 忽略引號中數(shù)據(jù)的分隔符,例如一個觀測 Joseph,76,"Red Racers, Washington"非引號中的逗號能識別成分隔符邑贴, 而引號中的逗號不能識別限府; 自動將字符串中的引號去掉; 將兩個相鄰的分隔符當作缺失值來處理痢缎。
Excel數(shù)據(jù)讀取
PROC IMPORT DATAFILE='D:\A.XLS' OUT=A REPLACE DBMS=XLS; GETNAMES=YES; SHEET="Sheet1"; RUN;
PROC PRINT DATA=A; RUN;
OUT= 輸出數(shù)據(jù)集名稱
DBMS= XLS XLSX
sas7dbat文件讀取 (桌面上的文件)
data new; set 'C:\Users\sdkyc\Desktop\hsb2.sas7bdat'; run;
proc print data=new; run;
數(shù)據(jù)集是臨時還是永久
變量賦值與運算
IF-THEN DO IF-ELSE
- DO 與END 是一個組合胁勺,內(nèi)部actions都會被執(zhí)行
DATA A;
INFILE 'C:\A.DAT';
INPUT V1 $ V2 V3;
IF V2 = . THEN V4='MISSING';
ELSE IF V2<100 THEN V4='LOW';
ELSE IF V2<1000 THEN V4='MEDIUM';
ELSE V4 = 'HIGH';
RUN;
- 可以用來構造子集
使用數(shù)組簡化程序 ARRAY
ARRAY array-name <{n}> <$> <length> <elements> <(initialvalues)>;
array-name - is the name of the array.
{n} - is either the dimension of the array, or an asterisk (*) to indicate that the dimension is determined from the number of array elements or initial values.
$ indicates that the array type is character.
length - is the maximum length of elements in the array. For character arrays, the maximum length cannot exceed 200.
elements - are the variables that make up the array and they exist in a dataset or are created before the array definition.
initial-values - are the values to use to initialize some or all of the array elements. Separate these values with commas or blanksARRAY rain {5} janr febr marr aprr mayr; ARRAY days{7} d1-d7; ARRAY month{*} jan feb jul oct nov; ARRAY x{*} _NUMERIC_; ARRAY qbx{10}; ARRAY meal{3};
關于各個PROC的note鏈接
https://stats.idre.ucla.edu/other/annotatedoutput/
PROC CONTENTS 獲取數(shù)據(jù)集的描述部分,不包括數(shù)據(jù)本身
PROC MEANS
輸出一些Descriptive Statistics 功能與univariate重復
maxdec 小數(shù)位個數(shù)
proc means data=a N NMISS MEAN STD STDERR MAXDEC=4; run;
PROC UNIVARIATE t-test sample mean mu0
Test for location就是一個two-tail的t-test独旷,查看student's t value署穗,如果P<α寥裂,wirte的平均值不等于30.
proc univariate data = "D:\hsb2" plots normal mu0=30; var write; run;
用來測試normality,畫plot圖找到Shapiro-Wilk P value大于α案疲,正態(tài)分布
proc univariate data=a normal plot; var write; run;
1.These tests check the assumption that the data is distributed as a normal distribution.
2.Null hypothesis: data is normal vs Alternate hypothesis: data not normal.
3.P-value large (eg > 0.05) indicate the data follow normal (we accept the null hypothesis) .
4.If 6 < sample size < 2001 use Shapiro-Wilk.
5.Sample size > 2000 use Kolmogorov-Smirnov test.
6.Within the appropriate sample size range Shapiro-Wilk is more powerful than Kolmogorov-Smirnov test.
7.Any departure from Skewness =0 and kurtosis = 0 implies non normality.
PROC FREQ TABLES chisq
用來測試變量之間有無association封恰,相互是否獨立。找到輸出結果中chi-square值褐啡,大值對應小p-value诺舔。如果P<α,兩個變量有相關關系备畦,不相互獨立低飒。
English: A large chi-square statistic will correspond to small p-value. If the p-value is small enough (say < 0.05), then we will reject the null hypothesis that the two variables are independent and conclude that there is an association between the row and the column variables.
PROC FREQ DATA=CLASSFIT2; TABLES SEX*HT/CHISQ; RUN;
PROC REG
Assumption
a.Normality of errors: The error distribution is normal.
b.Normality of errors is checked by doing residual analysis. In residual analysis we first calculate the residuals (r = y - ( ??) ???????????????) then verify the normality of the residuals using proc univariate or Q-Q plots.
c.Independence: The errors or observations are independent of each other. Example: apple stock price recorded on 10 consecutive days. Here the 10 observations are not independent
d.變量必須是numerical value
PROC ANOVA
Assumption sampled populations are normally distributed.
one-way ANOVA----only one factor (一個變量,這個變量可以有幾個level)
查看ppt
PROC GLM contrast
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#glm_toc.htm
1.問題:不同年齡的身高平均值相同嗎懂盐?μ1=μ2=μ3=μ4
proc glm data=a; class age; model height=age; run;
2.問題: 11歲與12歲孩子的平均身高13-16歲孩子的平均身高有區(qū)別嗎
proc glm data=a; class age;
model height=age;
contrast '11&12 vs. rest'
age 2 2 -1 -1 -1 -1; run; quit;
PROC CORR
查看變量間的相關系數(shù) pearson correlation coefficients褥赊,負值 負相關;正值正相關莉恼。
nosimple 不顯示Descriptive Statistics
proc corr data = "D:\hsb2" pearson nosimple; var read write; run;
PROC TTEST t-test
Assumption: all variables are normally distributed.
- Single sample t-test 例子:檢驗score的平均值是否與50相同拌喉, p小于α,顯著不同
proc ttest data="D:\hsb2" H0=50; var score; run;
- Dependent group t-test (paired t-test) 例子:一群學生都考了兩門考試俐银,學生的write 成績與read成績的平均值是否相同尿背, p小于α,顯著不同
proc ttest data="D:\hsb2"; paired write*read; run;
- Independent group t-test 例子:男女性別對write成績有無影響
如果equality of variances Pr>F的值小于α捶惜, 那么兩個性別group的variance不同田藐,必須選擇Satterthwaite (unequal)方法,然后查看這個方法對應的Pr>|t|
如果equality of variances Pr>F的值小于α售躁,選Satterhwaite坞淮,否則選pooled
proc ttest data="D:\hsb2"; class sex; var write; run;
PROC NPAR1WAY
可以用來Wilcoxon test茴晋,問題舉例:
Are test scores different from 4th grade to 5th grade on the same students?
Does a particular diet drug have an effect on BMI when tested one the same individuals?
該test的假設是:
Data comes from two matched, or dependent, populations.
The data is continuous.
Because it is a non-parametric test it does not require a special distribution of the dependent variable in the analysis. 對數(shù)據(jù)的distribution不做要求E憬荨!
尤其適用small sample size
one- and two-tail test
P value
如果 test H0=0诺擅,結果p<α 那么reject the H0市袖,the mean is significantly different from 0.
預制代碼
proc print data= ; run;