Lab 6: Linear Regression

Author

Xiuyu Cao

Published

March 23, 2024

1 Exercise 1

pairs(airquality[,c('Ozone','Solar.R', 'Wind', 'Temp')])  # scatter plot

cor(airquality[, c("Ozone", "Solar.R", "Wind", "Temp")], 
    use = "complete.obs")    # use the rows where there is both a value for x and y
             Ozone    Solar.R       Wind       Temp
Ozone    1.0000000  0.3483417 -0.6124966  0.6985414
Solar.R  0.3483417  1.0000000 -0.1271835  0.2940876
Wind    -0.6124966 -0.1271835  1.0000000 -0.4971897
Temp     0.6985414  0.2940876 -0.4971897  1.0000000

1.1 Question 1.1

Answer: The correlation between Ozone and Wind is -0.6124966. In my opinion, this is a moderate to strong negative correlation. It means when ozone level increases, wind speed decreases, and vice versa.

1.2 Question 1.2

Answer: The correlation between Ozone and Solar.R is 0.3483417. In my opinion, this is a weak to moderate positive correlation. It means when the ozone level increases, Solar R increases, and vice versa.

2 Exercise 2

2.1 Question 2.1

plot(eruptions~waiting, data=faithful)  # Plot waiting against eruptions

2.2 Question 2.2

# Calculate the correlation coefficient
cor(faithful$eruptions, faithful$waiting)
[1] 0.9008112

3 Exercise 3

3.1 Question 3.1

Notes: For full points, include model, interpretation and plot.

For a complete interpretation, include slope significance (alpha and p-value), slope and R2 interpretation, and intercept value with alpha and p-value.

# Build linear model
mod <- lm(Temp~Ozone, data=airquality)
summary(mod)

Call:
lm(formula = Temp ~ Ozone, data = airquality)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.147  -4.858   1.828   4.342  12.328 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 69.41072    1.02971   67.41   <2e-16 ***
Ozone        0.20081    0.01928   10.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.819 on 114 degrees of freedom
  (37 observations deleted due to missingness)
Multiple R-squared:  0.4877,    Adjusted R-squared:  0.4832 
F-statistic: 108.5 on 1 and 114 DF,  p-value: < 2.2e-16
# Plot scatterplot with estimated regression line
plot(Temp~Ozone, data=airquality)
abline(mod, col='red')

Answer:

  • The slope is 0.20081. This means for every 1 unit increase in ozone, the temperature increases by 0.20081. The P value of the slope is \(2 \times 10^{-16}<<\alpha=0.05\). So we reject \(H_0:Slope=0\), there is a significant relationship between ozone and temperature.
  • \(R^2=0.4877\). 48.77% of the variation in temperature is explained by ozone.
  • The intercept is 69.41072. This means that when ozone is zero, temperature is predicted to be 69.41072. The intercept has a p value of \(2 \times 10^{-16}<<\alpha=0.05\). Meaning the temperature is significantly different from zero when ozone is zero.

3.2 Question 3.2

Note: For full points, include model, interpretation and plot.

# Build linear model
mod <- lm(eruptions~waiting, data=faithful)
summary(mod)

Call:
lm(formula = eruptions ~ waiting, data = faithful)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.29917 -0.37689  0.03508  0.34909  1.19329 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.874016   0.160143  -11.70   <2e-16 ***
waiting      0.075628   0.002219   34.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared:  0.8115,    Adjusted R-squared:  0.8108 
F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16
#Plot scatterplot with estimated regression line
plot(eruptions~waiting, data=faithful)
abline(mod, col='red')

Answer:

  • The slope is 0.075628. This means for every 1 unit increase in waiting time, the eruption time increases by 0.075628. The P value of the slope is \(2 \times 10^{-16}<<\alpha=0.05\). So we reject \(H_0:Slope=0\), there is a significant relationship between waiting and eruptions.
  • \(R^2=0.8115\). 81.15% of the variation in eruptions is explained by waiting time.
  • The intercept is -1.874016. This means that when waiting time is zero, eruption time is predicted to be -1.87 (not meaningful in real world context). The intercept has a p value of \(2 \times 10^{-16}<<\alpha=0.05\). Meaning the eruption time is significantly different from zero when waiting time is zero.

4 Exercise 4

4.1 Question 4.1

Note: To receive full credit, answer the three parts.

mod <- lm(Temp~Wind, data=airquality)
r2 <- summary(mod)$r.squared

# get SSY
y <- airquality$Temp
meany <- mean(y)
diff <- y-meany
SSY <- sum(diff^2)

# get SSYp
x <- data.frame(Wind=airquality$Wind)
predy <- predict.lm(mod, x)
diff <- predy-meany
SSYp <- sum(diff^2)

# get SSE
diff <- predy-y
SSE <- sum(diff^2)
SSE
[1] 10761.49
# Verify SSY - SSYp = SSE
SSY-SSYp==SSE
[1] TRUE

What is SSE measuring in non-technical terms?

SSE is the sum of the squared deviations between the actual value and the predicted value. It’s measuring the amount of vairation not explained by the model.

5 Exercise 5

5.1 Question 5.1

Note: Your final answer should include three parts: the model, tests for assumptions of homoscedasticity (perform all 3) and normality (perform both shapiro.test and qqPlot) and interpret of the results of homoscedasticity and normality tests.

library(lmtest)
library(car)
library(dplyr)

# remove NAs
dat.airquality <- airquality %>%
  select(Temp, Ozone)  %>%
  na.omit()

# Create linear model
mod <- lm(Temp~Ozone, data=dat.airquality)
summary(mod)

Call:
lm(formula = Temp ~ Ozone, data = dat.airquality)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.147  -4.858   1.828   4.342  12.328 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 69.41072    1.02971   67.41   <2e-16 ***
Ozone        0.20081    0.01928   10.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.819 on 114 degrees of freedom
Multiple R-squared:  0.4877,    Adjusted R-squared:  0.4832 
F-statistic: 108.5 on 1 and 114 DF,  p-value: < 2.2e-16

5.1.1 Check whether the model meets assumptions

5.1.1.1 Residual Homoscedasticity

# visual
plot(residuals(mod)~fitted(mod), data=dat.airquality,
     main='Residuals v.s. Fitted', xlab='Fitted', ylab='Residuals')
abline(h=0, col='red')

# statistical test
bptest(mod)

    studentized Breusch-Pagan test

data:  mod
BP = 3.2346, df = 1, p-value = 0.0721

Result Interpretation: The p value of the Breush-Pegan test is \(0.0721>\alpha=0.05\). We thus fail to reject \(H_0\): The variance is constant. Therefore, the requirement of homoscedasticity is met.

5.1.1.2 Normal Distributed Residuals

# Q-Q plot
qqPlot(residuals(mod))

117  15 
 82  13 
# Shapiro-Wilk test
shapiro.test(residuals(mod))

    Shapiro-Wilk normality test

data:  residuals(mod)
W = 0.94867, p-value = 0.000226

Result Interpretation: From the Q-Q plot, some points laid out of the intervals. From the Shapiro-Wilk test, the p value is \(0.000226<\alpha=0.05\). We thus reject \(H_0\): the data being tested are drawn from a normally distributed population. Therefore, the requirement of residual normality is NOT met.