ccyes <-c(78, 62) # the number of students who said they believe in climate changeccno <-c(22, 38) # the number of students who do not believe in climate changestate <-c('New York', 'Kentucky')contable <-rbind(ccyes, ccno)colnames(contable) <- statecontable
New York Kentucky
ccyes 78 62
ccno 22 38
chisq.test(contable)
Pearson's Chi-squared test with Yates' continuity correction
data: contable
X-squared = 5.3571, df = 1, p-value = 0.02064
Write out the null and alternate hypothesis for the chi-square test you just ran. Please interpret the result. Can you reject the null hypothesis at alpha = 0.01? What about at alpha = 0.05?
\(H_0\) (Null hypothesis): The number of students that believe in climate change is independent of whether they are from New York or Kentucky.
\(H_A\) (Alternative hypothesis): The number of students that believe in climate change is associated with which whether they are from New York or Kentucky.
Result Interpretation: The P-value is 0.02.
For \(\alpha=0.01\), we fail to reject the null hypothesis (\(0.02>0.01\)). The number of students that believe in climate change is independent of whether they are from New York or Kentucky.
For \(\alpha=0.05\), we reject the null hypothesis (\(0.02<0.05\)). The number of students that believe in climate change is associated with whether they are from New York or Kentucky.
1.2 Exercise 1, Part 2
Now run the same analyses as above, but pick two different states. Use this link to get information on the % of people who believe global warming is happening in each state in the US. Hint - look at the numbers for Kentucky and New York. How did I use these to parameterize my ccyes and ccno values above? Please do the same thing, but pick two different states.
ccyes <-c(72, 76) # the percent of people who said they believe in climate changeccno <-c(28, 24) # the percent of people who do not believe in climate changestate <-c('Texas (%)', 'Washington (%)')contable <-rbind(ccyes, ccno)colnames(contable) <- statecontable
Texas (%) Washington (%)
ccyes 72 76
ccno 28 24
chisq.test(contable)
Pearson's Chi-squared test with Yates' continuity correction
data: contable
X-squared = 0.23389, df = 1, p-value = 0.6287
\(H_0\) (Null hypothesis): The percent of people that believe in climate change is independent of whether they are from Texas or Washington.
\(H_A\) (Alternative hypothesis): The percent of people that believe in climate change is associated with whether they are from Texas or Washington.
Result Interpretation: The P-value is \(0.6287>\alpha=0.05\). Therefore, we fail to reject the null hypothesis. The percent of people that believe in climate change is independent of whether they are from Texas or Washington.
2 Exercise 2
# The OrchardSprays datasethead(OrchardSprays)
decrease rowpos colpos treatment
1 57 1 1 D
2 95 2 1 E
3 8 3 1 B
4 69 4 1 H
5 92 5 1 G
6 90 6 1 F
2.1 Exercise 2, Part 1
Which treatments are significantly different from one another according to the results of the TukeyHSD?
\(H_0\) (Null hypothesis): There is no difference in the amount of sugar the bees eat among the 8 groups of treatment.
\(H_A\) (Alternative hypothesis): At least one of the treatment sugar consumption means is different from the others.
boxplot(decrease~treatment, data=OrchardSprays, xlab='Treatment', ylab='Decrease Mean (Effect Size)')
Answer: The P-value of the ANOVA test is \(9.5 \times 10^{-13}<<0.05\). Therefore we reject the null hypothesis. At least one of the treatment sugar consumption means is different from the others. According to the Tukey’s HSD test result, the treatments that are significantly different from one another (P-value < 0.05) are as follows.
# treatments pairs that are significantly different from each otheri.significant <- tukey_output.mat[,4]<0.05tukey_output.mat[i.significant,]
Why do we use the Tukey’s HSD test instead of just running multiple t-tests that compare each pair of treatments in your sample?
Answer: Say we do t tests at \(\alpha=0.05\), multiple t tests will lead to a higher chance of committing at least one Type I error (\(1-0.95^{N_{times}}\)). Therefore, we use ANOVA instead of multiple t-tests. While ANOVA can tell us that there is a significant difference among groups, it doesn’t specify which groups are significantly different from each other. The Tukey’s HSD test is a post-hoc analysis that helps identify which specific group means are significantly different.
par(mfrow =c(1, 2))boxplot(uptake ~ Type, data = CO2, las =1, main ='Uptake by Different Types')boxplot(uptake ~ Treatment, data = CO2, las =1, main ='Uptake by Different Treatments')
Please interpret the results of your two-way ANOVA. Which factors have a significant effect on CO2 uptake?
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 1 988 988 14.95 0.000222 ***
Type 1 3366 3366 50.92 3.68e-10 ***
Residuals 81 5353 66
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer:
Hypotheses
Difference in factor Treatment
\(H_0\): There is no difference in uptake between nonchilled and chilled plants, controlling for plant type.
\(H_A\): There is a difference in uptake between nonchilled and chilled plants, controlling for plant type.
Difference in factor Type
\(H_0\): There is no difference in uptake between Quebec and Mississippi plants, controlling for treatment.
\(H_A\): There is a difference in uptake between Quebec and Mississippi plants, controlling for treatment.
Result Interpretation
The P-value of the Treatment test is \(0.000222<<\alpha=0.05\). Therefore we reject the null hypothesis. There is a difference in uptake between nonchilled and chilled plants, controlling for plant type.
The P-value of the Type test is \(3.68 \times 10^{-10}<<\alpha=0.05\). Therefore we reject the null hypothesis. There is a difference in uptake between Quebec and Mississippi plants, controlling for treatment.
Therefore, Both factors have a significant effect on CO2 uptake.
4 Exercise 4
library(fivethirtyeight)head(bechdel, 3) # amount of money different movies earned based on whether or not they passed the Bechdel test or not
year imdb title test clean_test binary budget
1 2013 tt1711425 21 & Over notalk notalk FAIL 13000000
2 2012 tt1343727 Dredd 3D ok-disagree ok PASS 45000000
3 2013 tt2024544 12 Years a Slave notalk-disagree notalk FAIL 20000000
domgross intgross code budget_2013 domgross_2013 intgross_2013
1 25682380 42195766 2013FAIL 13000000 25682380 42195766
2 13414714 40868994 2012PASS 45658735 13611086 41467257
3 53107035 158607035 2013FAIL 20000000 53107035 158607035
period_code decade_code
1 1 1
2 1 1
3 1 1
4.1 Exercise 4, Part 1
Please write out the null and alternate hypothesis. Can you reject the null hypothesis at alpha = 0.05 based on the p value of your ANOVA? Please explain your results in non-technical terms.
\(H_0\) (Null hypothesis): There is no difference in the amount of money earned by each movie domestically (domgross_2013) whether these movies passed the Bechdel test (binary).
\(H_A\) (Alternative hypothesis): There is a difference in the amount of money earned by each movie domestically whether these movies passed the Bechdel test.
Df Sum Sq Mean Sq F value Pr(>F)
binary 1 3.487e+17 3.487e+17 22.24 2.6e-06 ***
Residuals 1774 2.782e+19 1.568e+16
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
18 observations deleted due to missingness
Result Interpretation: The P-value is \(2.6\times10^{-6}<<\alpha=0.05\). Therefore, we reject the null hypothesis. There is a difference in the amount of money earned by each movie domestically whether these movies passed the Bechdel test.
4.2 Exercise 4, Part 2
Now run an ANOVA to see if 2013 dollars (domgross_2013) differ based on whether a movie passed the Bechdel test or not (binary) and the decade in which the movie was made (decade_code). Hint - Right now the decade_code variable is an integer and therefore not a categorical variable. To turn it into a categorical variable, please use ‘as.factor(decade_code)’ in your ANOVA function.
Df Sum Sq Mean Sq F value Pr(>F)
binary 1 8.681e+16 8.681e+16 8.995 0.002749 **
as.factor(decade_code) 2 1.673e+17 8.363e+16 8.665 0.000181 ***
Residuals 1596 1.540e+19 9.651e+15
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
194 observations deleted due to missingness
Hypotheses
Difference in factor binary
\(H_0:\) There is no difference in the amount of money earned by each movie domestically whether they passed the Bechdel test, controlling for the decade in which the movie was made (decade_code).
\(H_A:\) There is a difference in the amount of money earned by each movie domestically whether they passed the Bechdel test, controlling for the decade in which the movie was made.
Difference in factor decade_code
\(H_0:\) There is no difference in the amount of money earned by each movie domestically in which decade they are made, controlling for whether they passed the Bechdel test.
\(H_0:\) There is a difference in the amount of money earned by each movie domestically in which decade they are made, controlling for whether they passed the Bechdel test.
Result Interpretation
The P-value for the binary test is \(0.002749<\alpha=0.05\). Therefore, we reject the null hypothesis. There is a difference in the amount of money earned by each movie domestically whether they passed the Bechdel test, controlling for the decade in which the movie was made.
The P-value for the decade_code test is \(0.000181<<\alpha=0.05\). Therefore, we reject the null hypothesis. There is a difference in the amount of money earned by each movie domestically in which decade they are made, controlling for whether they passed the Bechdel test.
Therefore, Both factors have a significant effect on the amount of money earned by each movie domestically.
4.3 Exercise 4, Part 3
BONUS/EXTRA CREDIT - worth 5 credits Pick another dataset from within the fivethirtyeight package and run either a one way or two way ANOVA. Please write out the null and alternate hypothesis and state whether you can reject the null hypothesis based on your p value. Remember, in order for a dataset to work for this question, your indepdent variable has to be categorical (and have 2 or more categories) and your dependent variable has to be continuous. You can ask Kai or the GSIs for help selecting an appropriate dataset, but please do not ask other students.
I found this hate_crimes dataset from the fivethirtyeight package interesting. It’s a data frame with 51 rows representing US states and DC and 13 variables. The data I will be using contain median_house_inc, share_unemp_seas, and hate_crimes_per_100k_splc. The discriptions of these variables are as follows.
Attribute
Meaning
Class
Type
median_house_inc
Median household income, 2016
numeric
Independent
gini_index
Gini Index, 2015
numeric
Independent
hate_crimes_per_100k_splc
Average annual hate crimes per 100k population, 2016
numeric
Dependent
head(hate_crimes, 3) # The dataset I will be using
For doing an ANOVA, I need categorical independent variables. Therefore, I will normalize the income and unemployment to three levels: high, median, and low.
library(dplyr)# function for normalizing a vector to 0-1getNormalized <-function(vec){return((vec-min(vec))/(max(vec)-min(vec)))}levels <-c('Low', 'Medium', 'High') # set levelsmy_hatecrime <- hate_crimes %>%select(state_abbrev, median_house_inc, gini_index, hate_crimes_per_100k_splc) %>%# select only the interested colsna.omit() %>%# remove records with NAsmutate(income.n=getNormalized(median_house_inc)) %>%# get normalized incomemutate(gini_index.n=getNormalized(gini_index)) %>%# get normalized Gini indexmutate(income.level=cut(income.n, breaks=c(-Inf, 1/3, 2/3, Inf), labels=levels)) %>%# get categorical income levelsmutate(gini_index.level=cut(gini_index.n, breaks=c(-Inf, 1/3, 2/3, Inf), labels=levels)) %>%# get categorical Gini indexselect(state_abbrev, income.level, gini_index.level, hate_crimes_per_100k_splc) # select only the interested colshead(my_hatecrime, 5)
# A tibble: 5 × 4
state_abbrev income.level gini_index.level hate_crimes_per_100k_splc
<chr> <fct> <fct> <dbl>
1 AL Low Medium 0.126
2 AK High Low 0.144
3 AZ Medium Low 0.225
4 AR Low Medium 0.0691
5 CA Medium Medium 0.256
\(H_0:\) There is no difference in hate crime frequency between different income levels (income.level), controlling for Gini index level (gini_index.level).
\(H_A:\) There is a difference in hate crime frequency between different income levels, controlling for Gini index level.
Difference in factor gini_index.level
\(H_0:\) There is no difference in hate crime frequency between different Gini index levels, controlling for income level.
\(H_0:\) There is a difference in hate crime frequency between different Gini index levels, controlling for income level.
Result Interpretation
The P-value for the income.level test is \(0.03436<\alpha=0.05\). Therefore, we reject the null hypothesis. There is a difference in hate crime frequency between different income levels, controlling for Gini index level.
The P-value for the gini_index.level test is \(0.00103<\alpha=0.05\). Therefore, we reject the null hypothesis. There is a difference in hate crime frequency between different Gini index levels, controlling for income level.
Therefore, Both factors have a significant effect on the hate crime frequency.