Lab 7: Multiple Linear Regression

Author

Xiuyu Cao

Published

March 23, 2024

0.1 Lab Setup

library(lmtest)
library(car)
library(relaimpo)

# load data
yield <- read.csv('../data/lab7/yielddata.csv')
head(yield, 3)
          State   District    Yield Canal_Dist Irrigated_Per  Literate
1 Uttar Pradesh      Unnao 2965.788  8.6252015     1.0000000 0.1739130
2 Uttar Pradesh   Fatehpur 3248.265  2.5355166     1.0000000 0.4814490
3         Bihar Aurangabad 2538.736  0.7909722     0.4166492 0.4355878
  Cultivator    AgLabor     Rain Temperature Elevation  SowDate
1 0.25585284 0.01170569 223.8928    12.81596       116 4.738676
2 0.08881098 0.05316973 235.6471    12.54157       118 4.321697
3 0.08856683 0.08776167 309.5277    14.09054       112 4.117498

1 Exercise 1

1.1 Question 1.1

pairs(yield[,3:12])  # scatter plot of the continuous variables

cor(yield[,3:12], use='complete.obs')  # correlation table of the continuous variables
                    Yield   Canal_Dist Irrigated_Per   Literate  Cultivator
Yield          1.00000000 -0.100792366    0.37291308  0.3457707  0.02234091
Canal_Dist    -0.10079237  1.000000000   -0.07972347 -0.2496504  0.06540877
Irrigated_Per  0.37291308 -0.079723466    1.00000000  0.2509635  0.05162662
Literate       0.34577067 -0.249650351    0.25096345  1.0000000 -0.13693696
Cultivator     0.02234091  0.065408774    0.05162662 -0.1369370  1.00000000
AgLabor       -0.15816302 -0.001745039   -0.21556017 -0.1769701 -0.26558944
Rain          -0.46098180  0.132098845   -0.23158320 -0.3669876 -0.05561866
Temperature   -0.69818335  0.079179775   -0.42512905 -0.3725307 -0.14538940
Elevation      0.57367350 -0.044009816    0.36640038  0.3117005  0.11161674
SowDate       -0.44594054  0.055231777   -0.08520842 -0.1669194 -0.05514825
                   AgLabor        Rain Temperature   Elevation     SowDate
Yield         -0.158163024 -0.46098180 -0.69818335  0.57367350 -0.44594054
Canal_Dist    -0.001745039  0.13209885  0.07917977 -0.04400982  0.05523178
Irrigated_Per -0.215560168 -0.23158320 -0.42512905  0.36640038 -0.08520842
Literate      -0.176970123 -0.36698760 -0.37253074  0.31170047 -0.16691936
Cultivator    -0.265589442 -0.05561866 -0.14538940  0.11161674 -0.05514825
AgLabor        1.000000000  0.09390849  0.09030555  0.01043637  0.11992541
Rain           0.093908491  1.00000000  0.62937563 -0.56896820  0.46630992
Temperature    0.090305550  0.62937563  1.00000000 -0.87606076  0.46552392
Elevation      0.010436371 -0.56896820 -0.87606076  1.00000000 -0.31366412
SowDate        0.119925409  0.46630992  0.46552392 -0.31366412  1.00000000
vif(lm(Yield~Canal_Dist+Irrigated_Per+Literate+Cultivator+AgLabor+Rain+Temperature+Elevation+SowDate, data=yield))  # get VIF
   Canal_Dist Irrigated_Per      Literate    Cultivator       AgLabor 
     1.079929      1.320238      1.392133      1.178451      1.231330 
         Rain   Temperature     Elevation       SowDate 
     1.874005      6.081935      4.742333      1.490840 

From the scatter plot, we can see some variable pairs showing correlation (e.g. Temperature & Elevation has strong negative correlation). From the correlation matrix, the correlation between Temperature and Rain (0.63), Elevation and Rain (-0.57), Temperature and Elevation (-0.88) are middle to high. From the vif() output, the VIF of Temperature is high (6.08>5), which means Temperature is highly correlated with other variables. Therefore, Temperature would lead to multi-collinearity in my model; also, the VIF of Elevation is 4.74, nearly 5, it may also lead to multi-collinearity in my model.

1.2 Question 1.2

According to the output of cor() function, Temperature and Elevation are the two most correlated variables (-0.88).

# two separate univariate models of the two variables
univar_mod1 <- lm(Yield~Temperature, data=yield)
univar_mod2 <- lm(Yield~Elevation, data=yield)

# multivariate model of two variables
multivar_mod <- lm(Yield~Temperature+Elevation, data=yield)

# compare the models
summary(univar_mod1)

Call:
lm(formula = Yield ~ Temperature, data = yield)

Residuals:
    Min      1Q  Median      3Q     Max 
-775.96 -191.99    5.34  202.63  733.14 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5531.24     168.46   32.83   <2e-16 ***
Temperature  -186.43      13.38  -13.94   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 290.9 on 198 degrees of freedom
Multiple R-squared:  0.4951,    Adjusted R-squared:  0.4926 
F-statistic: 194.2 on 1 and 198 DF,  p-value: < 2.2e-16
summary(univar_mod2)

Call:
lm(formula = Yield ~ Elevation, data = yield)

Residuals:
   Min     1Q Median     3Q    Max 
-804.6 -225.0   -6.9  195.9 1063.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2779.1943    50.2639  55.292   <2e-16 ***
Elevation      3.4208     0.3582   9.551   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 338.7 on 198 degrees of freedom
Multiple R-squared:  0.3154,    Adjusted R-squared:  0.3119 
F-statistic: 91.22 on 1 and 198 DF,  p-value: < 2.2e-16
summary(multivar_mod)

Call:
lm(formula = Yield ~ Temperature + Elevation, data = yield)

Residuals:
    Min      1Q  Median      3Q     Max 
-786.95 -189.79    4.24  204.59  712.88 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5864.7319   367.2957  15.967  < 2e-16 ***
Temperature -207.4310    24.5211  -8.459 6.02e-15 ***
Elevation     -0.5760     0.5638  -1.022    0.308    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 290.8 on 197 degrees of freedom
Multiple R-squared:  0.4978,    Adjusted R-squared:  0.4927 
F-statistic: 97.64 on 2 and 197 DF,  p-value: < 2.2e-16

Although the beta coefficients all have very small P values (\(<<\alpha=0.05\)), which means they are all significantly different from 0, the ones of the univariate model are different from the ones of multivariate model because the two variables are highly correlated and lead to multicollinearity. As a result, the model may allocate some of the effect of one variable to the other, making the estimation of coefficients unstable. This is why the beta coefficients differ from uni- and multivariate models.

2 Exercise 2

2.1 Question 2.1

fullmod2 <- lm(Yield ~ Canal_Dist + Irrigated_Per + Literate + Cultivator + AgLabor + Temperature + SowDate, data = yield)
smallmod <- lm(Yield ~ Canal_Dist + Irrigated_Per + Temperature + SowDate, data = yield)

# compare the AIC
AIC(fullmod2)
[1] 2275.772
AIC(smallmod)
[1] 2477.5

According to the AIC of the 2 models, the full model has the smaller AIC. Therefore, I would select the full model.

2.2 Question 2.2

## test the assumption of normality
qqPlot(residuals(fullmod2))  # Q-Q plot

154  10 
132  10 
shapiro.test(residuals(fullmod2))  # Shapiro-Wilk test

    Shapiro-Wilk normality test

data:  residuals(fullmod2)
W = 0.99425, p-value = 0.7778

From the Q-Q plot, we can say the residuals are normally distributed. Also, based on the Shapiro-Wilk test, the p value is \(0.78>\alpha=0.05\). We thus fail to reject \(H_0:\) The residuals are normally distributed. Therefore, the requirement of residual normality is met.

## test the assumption of homoscedasticity
# visual check
plot(residuals(fullmod2)~fitted(fullmod2), main='Residuals v.s. Fitted')
abline(h=0, col='red')

# statistical tests
bptest(fullmod2)

    studentized Breusch-Pagan test

data:  fullmod2
BP = 19.046, df = 7, p-value = 0.008045
ncvTest(fullmod2)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 8.052352, Df = 1, p = 0.0045445

From both the BP test and NCV test, the p values\(<\alpha=0.05\). We thus reject \(H_0:\) The variance is constant. Therefore, the requirement of homoscedasticity is NOT met.

2.3 Question 2.3

summary(fullmod2)

Call:
lm(formula = Yield ~ Canal_Dist + Irrigated_Per + Literate + 
    Cultivator + AgLabor + Temperature + SowDate, data = yield)

Residuals:
    Min      1Q  Median      3Q     Max 
-767.29 -163.64   -7.71  187.66  808.98 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5194.7420   287.2336  18.085  < 2e-16 ***
Canal_Dist      -0.7176     1.9378  -0.370   0.7117    
Irrigated_Per  132.1409    94.3785   1.400   0.1635    
Literate       131.4272   159.7811   0.823   0.4120    
Cultivator    -382.0980   261.5528  -1.461   0.1461    
AgLabor       -390.2264   284.9760  -1.369   0.1729    
Temperature   -142.9815    18.4862  -7.734 1.27e-12 ***
SowDate        -54.7208    22.0343  -2.483   0.0141 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 263.7 on 154 degrees of freedom
  (38 observations deleted due to missingness)
Multiple R-squared:  0.5335,    Adjusted R-squared:  0.5123 
F-statistic: 25.16 on 7 and 154 DF,  p-value: < 2.2e-16
summary(univar_mod1)

Call:
lm(formula = Yield ~ Temperature, data = yield)

Residuals:
    Min      1Q  Median      3Q     Max 
-775.96 -191.99    5.34  202.63  733.14 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5531.24     168.46   32.83   <2e-16 ***
Temperature  -186.43      13.38  -13.94   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 290.9 on 198 degrees of freedom
Multiple R-squared:  0.4951,    Adjusted R-squared:  0.4926 
F-statistic: 194.2 on 1 and 198 DF,  p-value: < 2.2e-16

From the summary of the full model, the intercept (\(P=2\times10^{-16}<<\alpha=0.05\)), the slope of Temperature (\(P=1.27\times10^{-12}<<\alpha=0.05\)), and the slope of SowDate (\(P=0.014<\alpha=0.05\)) are significant. The absolute value of beta coefficient value of Temperature of the multivariate model (\(|-142.98|\)) is smaller than that of the univariate model (\(|-186.43|\)). This is because with other dependent variables the effect of Temperature on Yield in the multivariate model will be smaller compared to that of the univariate model where only Temperature is considered.

3 Exercise 3

3.1 Question 3.1

calc.relimp(fullmod2, type=c('lmg'), rela=TRUE)
Response variable: Yield 
Total response variance: 142570.8 
Analysis based on 162 observations 

7 Regressors: 
Canal_Dist Irrigated_Per Literate Cultivator AgLabor Temperature SowDate 
Proportion of variance explained by model: 53.35%
Metrics are normalized to sum to 100% (rela=TRUE). 

Relative importance metrics: 

                      lmg
Canal_Dist    0.005832744
Irrigated_Per 0.110412608
Literate      0.084733336
Cultivator    0.007104318
AgLabor       0.018953436
Temperature   0.591199084
SowDate       0.181764474

Average coefficients for different model sizes: 

                       1X         2Xs         3Xs         4Xs          5Xs
Canal_Dist      -3.427562   -2.340213   -1.597039   -1.117275   -0.8397152
Irrigated_Per  556.986053  455.777540  370.378228  297.956931  235.6872750
Literate       862.918239  695.948341  550.924834  424.252770  313.1735417
Cultivator      97.798573   -0.815368  -83.147846 -159.357463 -234.3022470
AgLabor       -747.886598 -556.319902 -429.005304 -358.191022 -334.7664982
Temperature   -174.759453 -168.546054 -162.676162 -157.192254 -152.1052297
SowDate       -154.990658 -132.762298 -113.645195  -96.906542  -81.8602301
                       6Xs          7Xs
Canal_Dist      -0.7189361   -0.7175693
Irrigated_Per  181.0831753  132.1408988
Literate       215.8556081  131.4272404
Cultivator    -309.0948762 -382.0979838
AgLabor       -348.8859001 -390.2263964
Temperature   -147.3890286 -142.9814881
SowDate        -67.9402374  -54.7208073

Based on the relative importance metrics, Temperature explains the most amount of variance (0.59), SowDate explains the second most, Canal_Dist (0.006) explains the least.