Lab 5: Chi-square & ANOVA

Author

Xiuyu Cao

Published

February 26, 2024

1 Exercise 1

1.1 Exercise 1, Part 1

ccyes <- c(78, 62)  # the number of students who said they believe in climate change
ccno <- c(22, 38)  # the number of students who do not believe in climate change
state <- c('New York', 'Kentucky')
contable <- rbind(ccyes, ccno)
colnames(contable) <- state
contable
      New York Kentucky
ccyes       78       62
ccno        22       38
chisq.test(contable)

    Pearson's Chi-squared test with Yates' continuity correction

data:  contable
X-squared = 5.3571, df = 1, p-value = 0.02064

Write out the null and alternate hypothesis for the chi-square test you just ran. Please interpret the result. Can you reject the null hypothesis at alpha = 0.01? What about at alpha = 0.05?

  • \(H_0\) (Null hypothesis): The number of students that believe in climate change is independent of whether they are from New York or Kentucky.
  • \(H_A\) (Alternative hypothesis): The number of students that believe in climate change is associated with which whether they are from New York or Kentucky.
  • Result Interpretation: The P-value is 0.02.
    • For \(\alpha=0.01\), we fail to reject the null hypothesis (\(0.02>0.01\)). The number of students that believe in climate change is independent of whether they are from New York or Kentucky.
    • For \(\alpha=0.05\), we reject the null hypothesis (\(0.02<0.05\)). The number of students that believe in climate change is associated with whether they are from New York or Kentucky.

1.2 Exercise 1, Part 2

Now run the same analyses as above, but pick two different states. Use this link to get information on the % of people who believe global warming is happening in each state in the US. Hint - look at the numbers for Kentucky and New York. How did I use these to parameterize my ccyes and ccno values above? Please do the same thing, but pick two different states.

ccyes <- c(72, 76)  # the percent of people who said they believe in climate change
ccno <- c(28, 24)  # the percent of people who do not believe in climate change
state <- c('Texas (%)', 'Washington (%)')
contable <- rbind(ccyes, ccno)
colnames(contable) <- state
contable
      Texas (%) Washington (%)
ccyes        72             76
ccno         28             24
chisq.test(contable)

    Pearson's Chi-squared test with Yates' continuity correction

data:  contable
X-squared = 0.23389, df = 1, p-value = 0.6287
  • \(H_0\) (Null hypothesis): The percent of people that believe in climate change is independent of whether they are from Texas or Washington.
  • \(H_A\) (Alternative hypothesis): The percent of people that believe in climate change is associated with whether they are from Texas or Washington.
  • Result Interpretation: The P-value is \(0.6287>\alpha=0.05\). Therefore, we fail to reject the null hypothesis. The percent of people that believe in climate change is independent of whether they are from Texas or Washington.

2 Exercise 2

# The OrchardSprays dataset
head(OrchardSprays)
  decrease rowpos colpos treatment
1       57      1      1         D
2       95      2      1         E
3        8      3      1         B
4       69      4      1         H
5       92      5      1         G
6       90      6      1         F

2.1 Exercise 2, Part 1

Which treatments are significantly different from one another according to the results of the TukeyHSD?

  • \(H_0\) (Null hypothesis): There is no difference in the amount of sugar the bees eat among the 8 groups of treatment.
  • \(H_A\) (Alternative hypothesis): At least one of the treatment sugar consumption means is different from the others.
boxplot(decrease~treatment, data=OrchardSprays, xlab='Treatment', ylab='Decrease Mean (Effect Size)')

spray.aov <- aov(decrease~treatment, data=OrchardSprays)
summary(spray.aov)
            Df Sum Sq Mean Sq F value  Pr(>F)    
treatment    7  56160    8023   19.06 9.5e-13 ***
Residuals   56  23570     421                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Run Tukey test and get significant groups
tukey_output.mat <- TukeyHSD(spray.aov)$treatment
tukey_output.mat
      diff        lwr       upr        p adj
B-A  3.000 -29.294313  35.29431 9.999898e-01
C-A 20.625 -11.669313  52.91931 4.842087e-01
D-A 30.375  -1.919313  62.66931 7.953774e-02
E-A 58.500  26.205687  90.79431 1.227157e-05
F-A 64.375  32.080687  96.66931 1.461405e-06
G-A 63.875  31.580687  96.16931 1.754234e-06
H-A 85.625  53.330687 117.91931 5.841814e-10
C-B 17.625 -14.669313  49.91931 6.756283e-01
D-B 27.375  -4.919313  59.66931 1.540608e-01
E-B 55.500  23.205687  87.79431 3.566395e-05
F-B 61.375  29.080687  93.66931 4.354981e-06
G-B 60.875  28.580687  93.16931 5.218995e-06
H-B 82.625  50.330687 114.91931 1.746015e-09
D-C  9.750 -22.544313  42.04431 9.793381e-01
E-C 37.875   5.580687  70.16931 1.111331e-02
F-C 43.750  11.455687  76.04431 1.872902e-03
G-C 43.250  10.955687  75.54431 2.193463e-03
H-C 65.000  32.705687  97.29431 1.162751e-06
E-D 28.125  -4.169313  60.41931 1.316222e-01
F-D 34.000   1.705687  66.29431 3.230553e-02
G-D 33.500   1.205687  65.79431 3.680078e-02
H-D 55.250  22.955687  87.54431 3.895005e-05
F-E  5.875 -26.419313  38.16931 9.990714e-01
G-E  5.375 -26.919313  37.66931 9.994801e-01
H-E 27.125  -5.169313  59.41931 1.621616e-01
G-F -0.500 -32.794313  31.79431 1.000000e+00
H-F 21.250 -11.044313  53.54431 4.453236e-01
H-G 21.750 -10.544313  54.04431 4.150331e-01

Answer: The P-value of the ANOVA test is \(9.5 \times 10^{-13}<<0.05\). Therefore we reject the null hypothesis. At least one of the treatment sugar consumption means is different from the others. According to the Tukey’s HSD test result, the treatments that are significantly different from one another (P-value < 0.05) are as follows.

# treatments pairs that are significantly different from each other
i.significant <- tukey_output.mat[,4]<0.05
tukey_output.mat[i.significant,]
      diff       lwr       upr        p adj
E-A 58.500 26.205687  90.79431 1.227157e-05
F-A 64.375 32.080687  96.66931 1.461405e-06
G-A 63.875 31.580687  96.16931 1.754234e-06
H-A 85.625 53.330687 117.91931 5.841814e-10
E-B 55.500 23.205687  87.79431 3.566395e-05
F-B 61.375 29.080687  93.66931 4.354981e-06
G-B 60.875 28.580687  93.16931 5.218995e-06
H-B 82.625 50.330687 114.91931 1.746015e-09
E-C 37.875  5.580687  70.16931 1.111331e-02
F-C 43.750 11.455687  76.04431 1.872902e-03
G-C 43.250 10.955687  75.54431 2.193463e-03
H-C 65.000 32.705687  97.29431 1.162751e-06
F-D 34.000  1.705687  66.29431 3.230553e-02
G-D 33.500  1.205687  65.79431 3.680078e-02
H-D 55.250 22.955687  87.54431 3.895005e-05

2.2 Exercise 2, Part 2

Why do we use the Tukey’s HSD test instead of just running multiple t-tests that compare each pair of treatments in your sample?

Answer: Say we do t tests at \(\alpha=0.05\), multiple t tests will lead to a higher chance of committing at least one Type I error (\(1-0.95^{N_{times}}\)). Therefore, we use ANOVA instead of multiple t-tests. While ANOVA can tell us that there is a significant difference among groups, it doesn’t specify which groups are significantly different from each other. The Tukey’s HSD test is a post-hoc analysis that helps identify which specific group means are significantly different.

3 Exercise 3

head(CO2)  # the CO2 data
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2
par(mfrow = c(1, 2))
boxplot(uptake ~ Type, data = CO2, las = 1, main = 'Uptake by Different Types')
boxplot(uptake ~ Treatment, data = CO2, las = 1, main = 'Uptake by Different Treatments')

Please interpret the results of your two-way ANOVA. Which factors have a significant effect on CO2 uptake?

CO2.aov <- aov(uptake ~ Treatment + Type, data=CO2)
summary(CO2.aov)
            Df Sum Sq Mean Sq F value   Pr(>F)    
Treatment    1    988     988   14.95 0.000222 ***
Type         1   3366    3366   50.92 3.68e-10 ***
Residuals   81   5353      66                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer:

  • Hypotheses
    • Difference in factor Treatment
      • \(H_0\): There is no difference in uptake between nonchilled and chilled plants, controlling for plant type.
      • \(H_A\): There is a difference in uptake between nonchilled and chilled plants, controlling for plant type.
    • Difference in factor Type
      • \(H_0\): There is no difference in uptake between Quebec and Mississippi plants, controlling for treatment.
      • \(H_A\): There is a difference in uptake between Quebec and Mississippi plants, controlling for treatment.
  • Result Interpretation
    • The P-value of the Treatment test is \(0.000222<<\alpha=0.05\). Therefore we reject the null hypothesis. There is a difference in uptake between nonchilled and chilled plants, controlling for plant type.
    • The P-value of the Type test is \(3.68 \times 10^{-10}<<\alpha=0.05\). Therefore we reject the null hypothesis. There is a difference in uptake between Quebec and Mississippi plants, controlling for treatment.
    • Therefore, Both factors have a significant effect on CO2 uptake.

4 Exercise 4

library(fivethirtyeight)
head(bechdel, 3)  # amount of money different movies earned based on whether or not they passed the Bechdel test or not
  year      imdb            title            test clean_test binary   budget
1 2013 tt1711425        21 & Over          notalk     notalk   FAIL 13000000
2 2012 tt1343727         Dredd 3D     ok-disagree         ok   PASS 45000000
3 2013 tt2024544 12 Years a Slave notalk-disagree     notalk   FAIL 20000000
  domgross  intgross     code budget_2013 domgross_2013 intgross_2013
1 25682380  42195766 2013FAIL    13000000      25682380      42195766
2 13414714  40868994 2012PASS    45658735      13611086      41467257
3 53107035 158607035 2013FAIL    20000000      53107035     158607035
  period_code decade_code
1           1           1
2           1           1
3           1           1

4.1 Exercise 4, Part 1

Please write out the null and alternate hypothesis. Can you reject the null hypothesis at alpha = 0.05 based on the p value of your ANOVA? Please explain your results in non-technical terms.

  • \(H_0\) (Null hypothesis): There is no difference in the amount of money earned by each movie domestically (domgross_2013) whether these movies passed the Bechdel test (binary).
  • \(H_A\) (Alternative hypothesis): There is a difference in the amount of money earned by each movie domestically whether these movies passed the Bechdel test.
gross.aov <- aov(domgross_2013~binary, data=bechdel)
summary(gross.aov)
              Df    Sum Sq   Mean Sq F value  Pr(>F)    
binary         1 3.487e+17 3.487e+17   22.24 2.6e-06 ***
Residuals   1774 2.782e+19 1.568e+16                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
18 observations deleted due to missingness

Result Interpretation: The P-value is \(2.6\times10^{-6}<<\alpha=0.05\). Therefore, we reject the null hypothesis. There is a difference in the amount of money earned by each movie domestically whether these movies passed the Bechdel test.

4.2 Exercise 4, Part 2

Now run an ANOVA to see if 2013 dollars (domgross_2013) differ based on whether a movie passed the Bechdel test or not (binary) and the decade in which the movie was made (decade_code). Hint - Right now the decade_code variable is an integer and therefore not a categorical variable. To turn it into a categorical variable, please use ‘as.factor(decade_code)’ in your ANOVA function.

gross.aov <- aov(domgross_2013~binary+as.factor(decade_code), data=bechdel)
summary(gross.aov)
                         Df    Sum Sq   Mean Sq F value   Pr(>F)    
binary                    1 8.681e+16 8.681e+16   8.995 0.002749 ** 
as.factor(decade_code)    2 1.673e+17 8.363e+16   8.665 0.000181 ***
Residuals              1596 1.540e+19 9.651e+15                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
194 observations deleted due to missingness
  • Hypotheses
    • Difference in factor binary
      • \(H_0:\) There is no difference in the amount of money earned by each movie domestically whether they passed the Bechdel test, controlling for the decade in which the movie was made (decade_code).
      • \(H_A:\) There is a difference in the amount of money earned by each movie domestically whether they passed the Bechdel test, controlling for the decade in which the movie was made.
    • Difference in factor decade_code
      • \(H_0:\) There is no difference in the amount of money earned by each movie domestically in which decade they are made, controlling for whether they passed the Bechdel test.
      • \(H_0:\) There is a difference in the amount of money earned by each movie domestically in which decade they are made, controlling for whether they passed the Bechdel test.
  • Result Interpretation
    • The P-value for the binary test is \(0.002749<\alpha=0.05\). Therefore, we reject the null hypothesis. There is a difference in the amount of money earned by each movie domestically whether they passed the Bechdel test, controlling for the decade in which the movie was made.
    • The P-value for the decade_code test is \(0.000181<<\alpha=0.05\). Therefore, we reject the null hypothesis. There is a difference in the amount of money earned by each movie domestically in which decade they are made, controlling for whether they passed the Bechdel test.
    • Therefore, Both factors have a significant effect on the amount of money earned by each movie domestically.

4.3 Exercise 4, Part 3

BONUS/EXTRA CREDIT - worth 5 credits Pick another dataset from within the fivethirtyeight package and run either a one way or two way ANOVA. Please write out the null and alternate hypothesis and state whether you can reject the null hypothesis based on your p value. Remember, in order for a dataset to work for this question, your indepdent variable has to be categorical (and have 2 or more categories) and your dependent variable has to be continuous. You can ask Kai or the GSIs for help selecting an appropriate dataset, but please do not ask other students.

I found this hate_crimes dataset from the fivethirtyeight package interesting. It’s a data frame with 51 rows representing US states and DC and 13 variables. The data I will be using contain median_house_inc, share_unemp_seas, and hate_crimes_per_100k_splc. The discriptions of these variables are as follows.

Attribute Meaning Class Type
median_house_inc Median household income, 2016 numeric Independent
gini_index Gini Index, 2015 numeric Independent
hate_crimes_per_100k_splc Average annual hate crimes per 100k population, 2016 numeric Dependent
head(hate_crimes, 3)  # The dataset I will be using
    state state_abbrev median_house_inc share_unemp_seas share_pop_metro
1 Alabama           AL            42278            0.060            0.64
2  Alaska           AK            67629            0.064            0.63
3 Arizona           AZ            49254            0.063            0.90
  share_pop_hs share_non_citizen share_white_poverty gini_index share_non_white
1        0.821              0.02                0.12      0.472            0.35
2        0.914              0.04                0.06      0.422            0.42
3        0.842              0.10                0.09      0.455            0.49
  share_vote_trump hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi
1             0.63                 0.1258389                    1.806410
2             0.53                 0.1437401                    1.656700
3             0.50                 0.2253200                    3.413928

For doing an ANOVA, I need categorical independent variables. Therefore, I will normalize the income and unemployment to three levels: high, median, and low.

library(dplyr)

# function for normalizing a vector to 0-1
getNormalized <- function(vec){
  return((vec-min(vec))/(max(vec)-min(vec)))
}

levels <- c('Low', 'Medium', 'High')  # set levels


my_hatecrime <- hate_crimes %>%
  select(state_abbrev, median_house_inc, gini_index, hate_crimes_per_100k_splc) %>%  # select only the interested cols
  na.omit() %>%  # remove records with NAs
  mutate(income.n=getNormalized(median_house_inc)) %>%  # get normalized income
  mutate(gini_index.n=getNormalized(gini_index)) %>%  # get normalized Gini index
  mutate(income.level=cut(income.n, breaks=c(-Inf, 1/3, 2/3, Inf), labels=levels)) %>%  # get categorical income levels
  mutate(gini_index.level=cut(gini_index.n, breaks=c(-Inf, 1/3, 2/3, Inf), labels=levels)) %>%  # get categorical Gini index
  select(state_abbrev, income.level, gini_index.level, hate_crimes_per_100k_splc)  # select only the interested cols
head(my_hatecrime, 5)
# A tibble: 5 × 4
  state_abbrev income.level gini_index.level hate_crimes_per_100k_splc
  <chr>        <fct>        <fct>                                <dbl>
1 AL           Low          Medium                              0.126 
2 AK           High         Low                                 0.144 
3 AZ           Medium       Low                                 0.225 
4 AR           Low          Medium                              0.0691
5 CA           Medium       Medium                              0.256 
# two-way ANOVA
hatecrime.aov <- aov(hate_crimes_per_100k_splc~income.level+gini_index.level, data=my_hatecrime)
summary(hatecrime.aov)
                 Df Sum Sq Mean Sq F value  Pr(>F)   
income.level      2 0.3275  0.1638   3.656 0.03436 * 
gini_index.level  2 0.7292  0.3646   8.141 0.00103 **
Residuals        42 1.8810  0.0448                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Hypotheses
    • Difference in factor income.level
      • \(H_0:\) There is no difference in hate crime frequency between different income levels (income.level), controlling for Gini index level (gini_index.level).
      • \(H_A:\) There is a difference in hate crime frequency between different income levels, controlling for Gini index level.
    • Difference in factor gini_index.level
      • \(H_0:\) There is no difference in hate crime frequency between different Gini index levels, controlling for income level.
      • \(H_0:\) There is a difference in hate crime frequency between different Gini index levels, controlling for income level.
  • Result Interpretation
    • The P-value for the income.level test is \(0.03436<\alpha=0.05\). Therefore, we reject the null hypothesis. There is a difference in hate crime frequency between different income levels, controlling for Gini index level.
    • The P-value for the gini_index.level test is \(0.00103<\alpha=0.05\). Therefore, we reject the null hypothesis. There is a difference in hate crime frequency between different Gini index levels, controlling for income level.
    • Therefore, Both factors have a significant effect on the hate crime frequency.