import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(42)
full_data = pd.read_csv('coffee_dataset.csv')
sample_data = full_data.sample(200)
-
If you were interested in if the average height for coffee drinkers is the same as for non-coffee drinkers, what would the null and alternative be? Place them in the cell below, and use your answer to answer the first quiz question below.
Since there is no directional component associated with this statement, a not equal to seems most reasonable.
??0:?????????????????=0
??0:?????????????????≠0
?????????? and ?????? are the population mean values for coffee drinkers and non-coffee drinkers, respectivley.
-
If you were interested in if the average height for coffee drinkers is less than non-coffee drinkers, what would the null and alternative be? Place them in the cell below, and use your answer to answer the second quiz question below.
In this case, there is a question associated with a direction - that is the average height for coffee drinkers is less than non-coffee drinkers. Below is one of the ways you could write the null and alternative. Since the mean for coffee drinkers is listed first here, the alternative would suggest that this is negative.
??0:?????????????????≥0
??0:?????????????????<0
?????????? and ?????? are the population mean values for coffee drinkers and non-coffee drinkers, respectivley.
For 10,000 iterations: bootstrap the sample data, calculate the mean height for coffee drinkers and non-coffee drinkers, and calculate the difference in means for each sample. You will want to have three arrays at the end of the iterations - one for each mean and one for the difference in means. Use the results of your sampling distribution, to answer the third quiz question below.
nocoff_means, coff_means, diffs = [], [], []
for _ in range(10000):
bootsamp = sample_data.sample(200, replace = True)
coff_mean = bootsamp[bootsamp['drinks_coffee'] == True]['height'].mean()
nocoff_mean = bootsamp[bootsamp['drinks_coffee'] == False]['height'].mean()
# append the info
coff_means.append(coff_mean)
nocoff_means.append(nocoff_mean)
diffs.append(coff_mean - nocoff_mean)
np.std(nocoff_means) # the standard deviation of the sampling distribution for nocoff
np.std(coff_means) # the standard deviation of the sampling distribution for coff
np.std(diffs) # the standard deviation for the sampling distribution for difference in means
plt.hist(nocoff_means, alpha = 0.5);
plt.hist(coff_means, alpha = 0.5); # They look pretty normal to me!
plt.hist(diffs, alpha = 0.5); # again normal - this is by the central limit theorem
null_vals = np.random.normal(0, np.std(diffs), 10000) # Here are 10000 draws from the sampling distribution under the null
plt.hist(null_vals); #Here is the sampling distribution of the difference under the null