Lab 8: ANCOVA and Interactions

Author

Xiuyu Cao

Published

April 3, 2024

0.1 Lab Setup

# Load libraries
library(dplyr)  # for data manipulation
library(car)  # for equal variance test
library(lmtest)  # for lm assumption test

# Read in data
dataset <- read.csv("../data/lab8/sanddata.csv", sep = ",", header = T,comment.char = "#", stringsAsFactors = T)
head(dataset, 3)

  juveniles     sand temperature  rainfall humanpop country
1  1450.489 348.7163    27.84830  907.7776 666.8436   China
2  1306.366 766.6340    27.02602 1280.4317 780.5680   China
3  1900.798 431.7617    24.70595 1430.2775 441.4950   China

juveniles: number of juvenile cranes counted annually
sand: tons of sand removed annually
temperature: mean annual temperature in degrees Celsius
rainfall: total annual rainfall in mm
humanpop: total human population (in 1000s) within 50 miles of each site
country: country where each site is found (China, India, Indonesia, or Malaysia)

1 Exercise 1

# check multicollinearity
cor(dataset[, -6], dataset[,-6])  # get corr, exclude col 6

              juveniles        sand  temperature    rainfall     humanpop
juveniles    1.00000000 -0.40060237 -0.287562662  0.13030061 -0.093874549
sand        -0.40060237  1.00000000  0.278387847 -0.10758403  0.040841125
temperature -0.28756266  0.27838785  1.000000000 -0.40000000  0.008017561
rainfall     0.13030061 -0.10758403 -0.400000000  1.00000000  0.022808524
humanpop    -0.09387455  0.04084112  0.008017561  0.02280852  1.000000000

# get linear model
linmod <- lm(juveniles ~ sand + temperature + rainfall + humanpop + country, data=dataset)

# check linear relationship
par(mfrow = c(2,2))
plot(juveniles ~ sand, data = dataset)
plot(juveniles ~ temperature, data = dataset)
plot(juveniles ~ rainfall, data = dataset)
plot(juveniles ~ humanpop, data = dataset)

# check residual homoscedasticity and normality
par(mfrow = c(2,2))
plot(linmod)

# get results
summary(linmod)


Call:
lm(formula = juveniles ~ sand + temperature + rainfall + humanpop + 
    country, data = dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-1991.38  -393.25    21.99   389.21  1766.18 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4603.65570  374.53180  12.292  < 2e-16 ***
sand               -1.51796    0.13221 -11.481  < 2e-16 ***
temperature       -76.81031   13.51391  -5.684 1.73e-08 ***
rainfall            0.04423    0.05576   0.793 0.427860    
humanpop           -0.39505    0.14076  -2.807 0.005104 ** 
countryIndia      -33.74344   53.20575  -0.634 0.526092    
countryIndonesia  -46.85508   53.30556  -0.879 0.379619    
countryMalaysia  -190.48339   53.30196  -3.574 0.000369 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 594.7 on 992 degrees of freedom
Multiple R-squared:  0.2125,    Adjusted R-squared:  0.2069 
F-statistic: 38.24 on 7 and 992 DF,  p-value: < 2.2e-16

1.1 Question 1.1

Answer:

Interpretation of the significant coefficients (at \(\alpha=0.05\))

Intercept (\(P=2\times10^{-16}<\alpha\)): With sand, temperature, rainfall, and humanpop being zero, it is predicted that there will be 4603.66 juvenile cranes on average in China.
sand (\(P=2\times10^{-16}<\alpha\)): Every one ton of sand removed is associated with a reduction of 1.5 juvenile cranes (copied from lab instruction).
temperature (\(P=1.73\times10^{-8}<\alpha\)): Every degree Celsius of temperature raised is associated with a reduction of 76.8 juvenile cranes.
humanpop (\(P=0.005<\alpha\)): Every 1,000 increase in population within 50 miles is associated with a reduction of 0.4 juvenile cranes.
countryMalaysia (\(P=0.0004<\alpha\)): On average, Malaysia has 190.5 fewer juvenile cranes than China when controlling for sand, temperature, rainfall, and humanpop.

1.2 Question 1.2

Answer:

Interpretation of the non-significant coefficients (at \(\alpha=0.05\))

rainfall (\(P=0.43>\alpha\)): There is no significant linear relationship between rainfall and juveniles.
countryIndia (\(P=0.53>\alpha\)): There is no significant difference in juveniles between India and China, controlling for sand, temperature, rainfall, and humanpop.
countryIndonesia (\(P=0.38>\alpha\)): There is no significant difference in juveniles between India and China, controlling for sand, temperature, rainfall, and humanpop.

2 Exercise 2

2.1 Question 2

table(dataset$country)


    China     India Indonesia  Malaysia 
      250       250       250       250

# use India as baseline intercept
dataset.releveled <- dataset %>%
  mutate(country=relevel(country, ref='India'))  # relevel India to the first

# rerun the linear model
linmod2 <- lm(juveniles ~ sand + temperature + rainfall + humanpop + country, data = dataset.releveled)
summary(linmod2)


Call:
lm(formula = juveniles ~ sand + temperature + rainfall + humanpop + 
    country, data = dataset.releveled)

Residuals:
     Min       1Q   Median       3Q      Max 
-1991.38  -393.25    21.99   389.21  1766.18 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4569.91226  374.62321  12.199  < 2e-16 ***
sand               -1.51796    0.13221 -11.481  < 2e-16 ***
temperature       -76.81031   13.51391  -5.684 1.73e-08 ***
rainfall            0.04423    0.05576   0.793  0.42786    
humanpop           -0.39505    0.14076  -2.807  0.00510 ** 
countryChina       33.74344   53.20575   0.634  0.52609    
countryIndonesia  -13.11165   53.26062  -0.246  0.80559    
countryMalaysia  -156.73996   53.27943  -2.942  0.00334 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 594.7 on 992 degrees of freedom
Multiple R-squared:  0.2125,    Adjusted R-squared:  0.2069 
F-statistic: 38.24 on 7 and 992 DF,  p-value: < 2.2e-16

Answer:

Coefficients of the continuous variables (sand, temperature, rainfall, and humanpop) stay the same while those of the categorical ones and the intercept are changed. This is due to the change of the baseline variable (China to India). Now the intercept is the cranes in India, instead of China, when the continuous variables are zero, and the coefficients of the categorical variables are now the difference of them from the juveniles in India.

3 Exercise 3

# get model
anomod <- aov(juveniles ~ sand + temperature + rainfall + humanpop + country, data = dataset)

# check the assumption of equal variance
leveneTest(juveniles ~ country, data = dataset)  # from the result, met

Levene's Test for Homogeneity of Variance (center = median)
       Df F value Pr(>F)
group   3  0.2372 0.8704
      996

# get model results
summary(anomod)

             Df    Sum Sq  Mean Sq F value   Pr(>F)    
sand          1  71493399 71493399 202.153  < 2e-16 ***
temperature   1  14965606 14965606  42.316 1.23e-10 ***
rainfall      1    145897   145897   0.413  0.52083    
humanpop      1   2763535  2763535   7.814  0.00528 ** 
country       3   5291448  1763816   4.987  0.00194 ** 
Residuals   992 350831089   353660                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# do Tukey HSD to see which groups are different
tukey.mat <- TukeyHSD(anomod)$country

Warning in replications(paste("~", xx), data = mf): non-factors ignored: sand

Warning in replications(paste("~", xx), data = mf): non-factors ignored:
temperature

Warning in replications(paste("~", xx), data = mf): non-factors ignored:
rainfall

Warning in replications(paste("~", xx), data = mf): non-factors ignored:
humanpop

Warning in TukeyHSD.aov(anomod): 'which' specified some non-factors which will
be dropped

tukey.mat

                         diff       lwr        upr       p adj
India-China         -33.65691 -170.5382 103.224349 0.921475036
Indonesia-China     -46.13723 -183.0185  90.744030 0.821702029
Malaysia-China     -189.67861 -326.5599 -52.797353 0.002147257
Indonesia-India     -12.48032 -149.3616 124.400940 0.995451972
Malaysia-India     -156.02170 -292.9030 -19.140443 0.018012541
Malaysia-Indonesia -143.54138 -280.4226  -6.660123 0.035614947

3.1 Question 3

Answer:

Interpretation of the Tukey HSD test result:

We can get additional information on pairwise comparisons instead of only the countries comparing to the baseline country when doing lm(). The countries that are significantly different from one another are as follows.

i.significant <- tukey.mat[,4]<0.05
tukey.mat[i.significant,]

                        diff       lwr        upr       p adj
Malaysia-China     -189.6786 -326.5599 -52.797353 0.002147257
Malaysia-India     -156.0217 -292.9030 -19.140443 0.018012541
Malaysia-Indonesia -143.5414 -280.4226  -6.660123 0.035614947

4 Exercise 4

4.1 Question 4

# change the order of the variables
anomod3 <- aov(juveniles ~ temperature + rainfall + humanpop + country + sand, data = dataset)
summary(anomod3)

             Df    Sum Sq  Mean Sq F value   Pr(>F)    
temperature   1  36838666 36838666 104.164  < 2e-16 ***
rainfall      1    123752   123752   0.350 0.554294    
humanpop      1   3777373  3777373  10.681 0.001120 ** 
country       3   7298916  2432972   6.879 0.000138 ***
sand          1  46621178 46621178 131.825  < 2e-16 ***
Residuals   992 350831089   353660                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer:

Comparing the p values between anomod and anomod3. Which variables are significant (at \(\alpha=0.05\)) are the same (only rianfall not being significant). However, the p values are not the same between these two models.

5 Exercise 5

linmod5 <- lm(juveniles ~ sand + temperature + rainfall + humanpop + country + humanpop*country, data = dataset)
summary(linmod5)


Call:
lm(formula = juveniles ~ sand + temperature + rainfall + humanpop + 
    country + humanpop * country, data = dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-2069.41  -377.15    22.72   372.91  1815.50 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               5054.11977  393.38223  12.848  < 2e-16 ***
sand                        -1.46417    0.13253 -11.048  < 2e-16 ***
temperature                -75.31818   13.45138  -5.599 2.79e-08 ***
rainfall                     0.03631    0.05553   0.654 0.513325    
humanpop                    -1.24563    0.28476  -4.374 1.35e-05 ***
countryIndia              -818.65030  257.22667  -3.183 0.001505 ** 
countryIndonesia          -961.39246  263.20231  -3.653 0.000273 ***
countryMalaysia           -676.65413  226.87548  -2.982 0.002929 ** 
humanpop:countryIndia        1.31526    0.42169   3.119 0.001867 ** 
humanpop:countryIndonesia    1.53082    0.43181   3.545 0.000411 ***
humanpop:countryMalaysia     0.81239    0.36897   2.202 0.027909 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 591 on 989 degrees of freedom
Multiple R-squared:  0.2247,    Adjusted R-squared:  0.2169 
F-statistic: 28.66 on 10 and 989 DF,  p-value: < 2.2e-16

# visualize
plot(1:1000, rep(-100, 1000), ylim = c(4000, 5500), xlim = c(50, 1000), xlab = 'Human Population', ylab = 'Juveniles')
abline(a = 5054.11977, b = -1.24563, col = 'blue') # China
abline(a = 5054.11977 - 818.65030, b = -1.24563 + 1.31526, col = 'red') # India
abline(a = 5054.11977 - 961.39246 , b = -1.24563 + 1.53082, col = 'orange') # Indonesia
abline(a = 5054.11977 - 676.65413, b = -1.24563 + 0.81239, col = 'green') # Malaysia
legend("topright", legend=c("China", "India", "Indonesia", "Malaysia"),
       col=c("blue", "red", "orange", "green"), lty=1)

5.1 Question 5

Answer:

The numbers to calculate the slopes for each country are from the humanpop and humanpop:countryIndia, humanpop:countryIndonesia, and humanpop:countryMalaysia. The latter three are the difference in slopes from the baseline slope (slope of humanpop).

6 Extra Credit (Code and outputs required)

6.1 Check Data

iris <- iris %>%
  mutate(Species=as.factor(Species))
head(iris, 3)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

I would like to study the relationship of Petal.Width between other variables.

6.2 Check Multicollinearity

# plot relationship
pairs(iris[,-5])  # scatter plot of the continuous variables

# check for multicollinearity
cor(iris[,-5], iris[,-5])

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

The correlation coefficient between Petal.Length and Sepal.Length (0.87) is high, indicating a strong positive relationship, which may cause multicollinearity. Therefore I would exclude Petal.Length.

6.3 Check Linear Regression Assumptions

mod <- lm(Petal.Width~Sepal.Length+Sepal.Width+Species, data=iris)
summary(mod)


Call:
lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Species, 
    data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50805 -0.10042 -0.01221  0.11416  0.46455 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -0.86897    0.16985  -5.116 9.73e-07 ***
Sepal.Length       0.06360    0.03395   1.873    0.063 .  
Sepal.Width        0.23237    0.05145   4.516 1.29e-05 ***
Speciesversicolor  1.17375    0.06758  17.367  < 2e-16 ***
Speciesvirginica   1.78487    0.07779  22.944  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1797 on 145 degrees of freedom
Multiple R-squared:  0.9459,    Adjusted R-squared:  0.9444 
F-statistic: 634.3 on 4 and 145 DF,  p-value: < 2.2e-16

# linear relationship
par(mfrow = c(1,2))
plot(Petal.Width ~ Sepal.Length, data=iris)
plot(Petal.Width ~ Sepal.Width, data=iris)

# Homoscedasticity
bptest(mod)


    studentized Breusch-Pagan test

data:  mod
BP = 20.627, df = 4, p-value = 0.0003755

# Error Normality
shapiro.test(residuals(mod))


    Shapiro-Wilk normality test

data:  residuals(mod)
W = 0.98906, p-value = 0.2924

Linear Regression Assumptions:

Linear relationship between variables: from the plot - met.
Statistical independence of the errors: assume so.
Homoscedasticity: \(P=0.0003<\alpha=0.05\) –> reject null hypothesis of homoscedasticity - not met.
Error Normality: \(P=0.29>\alpha=0.05\) –> fail to reject null hypothesis of error normality - met.

6.4 Results

summary(mod)


Call:
lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Species, 
    data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50805 -0.10042 -0.01221  0.11416  0.46455 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -0.86897    0.16985  -5.116 9.73e-07 ***
Sepal.Length       0.06360    0.03395   1.873    0.063 .  
Sepal.Width        0.23237    0.05145   4.516 1.29e-05 ***
Speciesversicolor  1.17375    0.06758  17.367  < 2e-16 ***
Speciesvirginica   1.78487    0.07779  22.944  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1797 on 145 degrees of freedom
Multiple R-squared:  0.9459,    Adjusted R-squared:  0.9444 
F-statistic: 634.3 on 4 and 145 DF,  p-value: < 2.2e-16

Result Interpretation:

Intercept (\(P=9.73\times10^{-7}<\alpha=0.05\)): with Sepal.Length and Sepal.Width being zero, it is predicted that Petal.Width of species Setosa is -0.86897 (not meaningful in real life context though).
Sepal.Length (\(P=0.06>\alpha=0.05\)): no significant relationship between Sepal.Length and Petal.Width.
Sepal.Width (\(P=1.29\times10^{-5}<\alpha=0.05\)): every one unit increase in Sepal.Width is associated with 0.23 unit increase in Petal.Width.
Speciesversicolor (\(P=2\times10^{-16}<\alpha=0.05\)): on average, species Versicolor has 1.17 unit more in Petal.Width than species Setosa, controlling for Sepal.Length and Sepal.Width.
Speciesvirginica (\(P=2\times10^{-16}<\alpha=0.05\)): on average, species Virginica has 1.78 unit more in Petal.Width than species Setosa, controlling for Sepal.Length and Sepal.Width.