Simulation studies in statistics:
A.?What is a simulation study, and why do one?
B.?Simulations for properties of estimators.
C.?Simulations for properties of hypothesis tests.
Simulation:?A numerical technique for conducting experiments on the computer.?It involves random sampling from probability distributions.
Rationale:?Properties of statistical methods must be established so that the methods may be used with confidence.?Exact analytical derivations of properties are rarely possible.?Large sample case to properties are often possible.
- But what happens in the finite sample size case?
- And what happens when assumptions are violated?
Usual issues - Q&A :?
Is an estimator biased in finite samples??
How does it compare to competing estimators on the basis of bias, precision, etc.?
Does a procedure for constructing a confidence interval for a parameter achieve the advertised nominal level of coverage?
Does a hypothesis testing procedure attain the advertised level of significance or size?
Do different test procedures deliver different power?
?Monte Carlo simulation approximation:?An estimator or test statistic has a true sampling distribution under a particular set of conditions (finite sample size, true distribution of the data).? Ideally, we could want to know this true sampling distribution in order to address the issues on the previous slide.?But derivation of true sampling distribution is not tractable.?Hence we approximate the sampling distribution of an estimator or a test statistic under a particular set of conditions.
?A typical Monte Carlo simulation involves the following:?Generate S independent data sets under the conditions of interest;?Compute the numerical value of the test statistic T?from each data set. Hence we have an array of T series in the items. If S is large enough, summary statistics across many T?should be good approximations to the true properties of the estimator under the conditions of interest.
1. A simple example:?
Compare three estimators for the mean μ of a distribution based on a random sample?.?The three estimators are:?Sample Mean (
) ;?Sample Median (
) ;?Sample 10% trimmed mean (
).
2. Simulation procedures:
For a particular choice of μ, n and true underlying distribution:?Generate independent draws?from the distribution ;?Compute?
,?
,??
?;?Repeat S times. Hence we have,?
,?
,?
.?Compute for k = 1, 2, 3, , and then we have the important value in observation:?
?;?
?;?
;
.
3. A simulation study:?R code
To compare 3 estimators for location μ through a simulation study.?The 3 estimators are:?Sample Mean ;??Sample Median ;?Sample 10% trimmed mean.?Underlying distribution: N(0,1).?Sample size: 15.?Simulation size: 1000?
> # To compare 3 estimators for location,mu, through simulation study
> # The 3 estimators are sample mean, median, and 10% trimmed mean.
> S=1000 # Simulation size (No.of samples)
> n=15 # Sample size
> mu=1 # Mean of the underlying normal distribution
> sd=1 # Standard deviation of the underlying normal distribution
> meax=numeric(S) # A vector contains all the sample means
> medx=numeric(S) # A vector contains all the sample medians
> trmx=numeric(S) # A vector contains all the sample 10% trimmed means
> stdx=numeric(S) # A vector contains all the sample standard deviation
> set.seeds(1234) # Set the seed number
> for (i in 1:S){
+ x=rnorm(n,mu,sd) # Generate a random sample of size n
+ mean[i]=mean(x) # Compute the mean for the i-th
sample,i.e.T∧(1) i
+ medx[i]=median(x)# Compute the median for the i-th
sample,i.e.T∧(2) i
+ trmx[i]=mean(x,trim=0.1)
+# Compute the 10% trimmed mean for the i-th sample, i.e.T∧(3) i
+ stdx[i]=sd(x)# Compute the standard deviation for the i-th sample
}
> # Compute the mean of “S” sample means, medians, and trimmed means
> simumean=apply(cbind(meax,medx,trmx),2,mean)
> # “2” in the second argument asks the computer to fine “means” for
all the columns in the matrix given in argument 1.
> # Hence “simumean” is a 3×1 vector consists of the average of the “S” means, medians, and the 10% trimmed means.
> # Compute the standard deviation of the “S” sample means, medians, and 10% trimmed means
> simustd=apply(cbind(meax,medx,trmx),2,sd)
> # Compute the bias
> simubias=simumean-rep(mu,3)
> # Compute the MSE (Mean Square Error)
> simumse=simubias∧2+simustd∧2
> ests=c(“Sample mean”,“Sample median”,“Sample 10% trimmed mean”) # column heading in the output
> names=c(“True value”,“No. of simu”,“”MC Mean,“MC Std Deviation”,“MC Bias”,“MC MSE”) # row heading in the output
> sumdat=rbind(rep(mu,3),rep(S,3),simumean,simustd,simbias,simumse)
> dimnames(sumdat)=list(names,ests)
> round(sumdat,4)
4. Checking coverage probability of confidence interval: Usual 95% confidence interval for μ,
> #Check the coverage probability of confidence interval
> #We make use of the data obtained from the above simulation study
> t05=qt(0.975,n-1) #Get t {0.025,14}
>coverage=sum((meax-t05*stdx/sqrt(n)<=mu)&(meax+t05*stdx/sqrt(n)>=mu))/S
> #The above statement is equivalent to
> # d=0
> # for(i in 1:S){
> #d=d+meax[i]-t05*stdx[i]/sqrt(n)<=mu)&(meax[i]+t05*stdx[i]/sqrt(n)>=mu))}
> #coverage=d/S
>coverage
[1]0.945
Example: Consider the size and power of the usual t-test for the mean
(i)?Test??against?
?
(ii)?To evaluate whether level of test achieves advertised α :?Generate data under and calculate the proportion of rejections of?
.?
(iii)?Approximation the true probability of rejecting??when it is true:
should be approximately equal to α.
(iv)?To evaluate power:?Generate data under some alternative??and calculate the?
?of rejections of?
.?
(v) Approximate the true probability of rejectingwhen the alternative is true.
> #Check the size of t-test
>ttests=(meax-mu0)/(stdx/sqrt(n))
>size=sum(abs(ttests)>t05)/S
>size
[1]0.045
> #Check the power of t-test when d=1*sd
> Assume the samples generated from N(μ1, σ) where μ1 = 1, σ = 1.
> Test H0 : μ = μ0 vs H1 : μ = μ1 = μ0 + σ where μ0 = 0.
>ttests=(meax-mu0)/(stdx/sqrt(n))
>power=sum(abs(ttests)>t05)/S
>power
[1]0.954
5.?Simulation study in SAS
Simulation to study the size of t-test
*Generate “ns” normal random samples;
* “mcrep” (M.C. Replications) is the index for the s-th random sample;
data simu;
seed=123;
ns=1000;n=25;mu=0;sigma=1;
do mcrep=1 to ns;
do i=1 to n;
x=rannor(seed);
output; end;
keep mcrep x;
end;
run;
proc sort data=simu;
by mcrep;
run;
*Perform the t-test for each sample.
Output the p-value “probt” in the SAS dataset “outtset”;
proc univariate data=simu noprint mu0=0;
by mcrep;
var x;
output out=outtest probt=p;
run;
* Count how many samples have “probt”<0.05;
data outtest;
set outtest;
reject=(p<0.05);
run;
* “reject” is the number of samples with “probt”<0.05. Hence the
mean of “reject” is the rejection rate,“rejrate”;
proc means data=outtest noprint;
var reject;
output out=results mean=rejrate;
proc print data=results;
var
freq
rejrate;
run;