From the scatter plot, we can see some variable pairs showing correlation (e.g. Temperature & Elevation has strong negative correlation). From the correlation matrix, the correlation between Temperature and Rain (0.63), Elevation and Rain (-0.57), Temperature and Elevation (-0.88) are middle to high. From the vif() output, the VIF of Temperature is high (6.08>5), which means Temperature is highly correlated with other variables. Therefore, Temperature would lead to multi-collinearity in my model; also, the VIF of Elevation is 4.74, nearly 5, it may also lead to multi-collinearity in my model.
1.2 Question 1.2
According to the output of cor() function, Temperature and Elevation are the two most correlated variables (-0.88).
# two separate univariate models of the two variablesunivar_mod1 <-lm(Yield~Temperature, data=yield)univar_mod2 <-lm(Yield~Elevation, data=yield)# multivariate model of two variablesmultivar_mod <-lm(Yield~Temperature+Elevation, data=yield)# compare the modelssummary(univar_mod1)
Call:
lm(formula = Yield ~ Temperature, data = yield)
Residuals:
Min 1Q Median 3Q Max
-775.96 -191.99 5.34 202.63 733.14
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5531.24 168.46 32.83 <2e-16 ***
Temperature -186.43 13.38 -13.94 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 290.9 on 198 degrees of freedom
Multiple R-squared: 0.4951, Adjusted R-squared: 0.4926
F-statistic: 194.2 on 1 and 198 DF, p-value: < 2.2e-16
summary(univar_mod2)
Call:
lm(formula = Yield ~ Elevation, data = yield)
Residuals:
Min 1Q Median 3Q Max
-804.6 -225.0 -6.9 195.9 1063.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2779.1943 50.2639 55.292 <2e-16 ***
Elevation 3.4208 0.3582 9.551 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 338.7 on 198 degrees of freedom
Multiple R-squared: 0.3154, Adjusted R-squared: 0.3119
F-statistic: 91.22 on 1 and 198 DF, p-value: < 2.2e-16
summary(multivar_mod)
Call:
lm(formula = Yield ~ Temperature + Elevation, data = yield)
Residuals:
Min 1Q Median 3Q Max
-786.95 -189.79 4.24 204.59 712.88
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5864.7319 367.2957 15.967 < 2e-16 ***
Temperature -207.4310 24.5211 -8.459 6.02e-15 ***
Elevation -0.5760 0.5638 -1.022 0.308
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 290.8 on 197 degrees of freedom
Multiple R-squared: 0.4978, Adjusted R-squared: 0.4927
F-statistic: 97.64 on 2 and 197 DF, p-value: < 2.2e-16
Although the beta coefficients all have very small P values (\(<<\alpha=0.05\)), which means they are all significantly different from 0, the ones of the univariate model are different from the ones of multivariate model because the two variables are highly correlated and lead to multicollinearity. As a result, the model may allocate some of the effect of one variable to the other, making the estimation of coefficients unstable. This is why the beta coefficients differ from uni- and multivariate models.
2 Exercise 2
2.1 Question 2.1
fullmod2 <-lm(Yield ~ Canal_Dist + Irrigated_Per + Literate + Cultivator + AgLabor + Temperature + SowDate, data = yield)smallmod <-lm(Yield ~ Canal_Dist + Irrigated_Per + Temperature + SowDate, data = yield)# compare the AICAIC(fullmod2)
[1] 2275.772
AIC(smallmod)
[1] 2477.5
According to the AIC of the 2 models, the full model has the smaller AIC. Therefore, I would select the full model.
2.2 Question 2.2
## test the assumption of normalityqqPlot(residuals(fullmod2)) # Q-Q plot
154 10
132 10
shapiro.test(residuals(fullmod2)) # Shapiro-Wilk test
Shapiro-Wilk normality test
data: residuals(fullmod2)
W = 0.99425, p-value = 0.7778
From the Q-Q plot, we can say the residuals are normally distributed. Also, based on the Shapiro-Wilk test, the p value is \(0.78>\alpha=0.05\). We thus fail to reject \(H_0:\)The residuals are normally distributed. Therefore, the requirement of residual normality is met.
## test the assumption of homoscedasticity# visual checkplot(residuals(fullmod2)~fitted(fullmod2), main='Residuals v.s. Fitted')abline(h=0, col='red')
# statistical testsbptest(fullmod2)
studentized Breusch-Pagan test
data: fullmod2
BP = 19.046, df = 7, p-value = 0.008045
ncvTest(fullmod2)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 8.052352, Df = 1, p = 0.0045445
From both the BP test and NCV test, the p values\(<\alpha=0.05\). We thus reject \(H_0:\)The variance is constant. Therefore, the requirement of homoscedasticity is NOT met.
2.3 Question 2.3
summary(fullmod2)
Call:
lm(formula = Yield ~ Canal_Dist + Irrigated_Per + Literate +
Cultivator + AgLabor + Temperature + SowDate, data = yield)
Residuals:
Min 1Q Median 3Q Max
-767.29 -163.64 -7.71 187.66 808.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5194.7420 287.2336 18.085 < 2e-16 ***
Canal_Dist -0.7176 1.9378 -0.370 0.7117
Irrigated_Per 132.1409 94.3785 1.400 0.1635
Literate 131.4272 159.7811 0.823 0.4120
Cultivator -382.0980 261.5528 -1.461 0.1461
AgLabor -390.2264 284.9760 -1.369 0.1729
Temperature -142.9815 18.4862 -7.734 1.27e-12 ***
SowDate -54.7208 22.0343 -2.483 0.0141 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 263.7 on 154 degrees of freedom
(38 observations deleted due to missingness)
Multiple R-squared: 0.5335, Adjusted R-squared: 0.5123
F-statistic: 25.16 on 7 and 154 DF, p-value: < 2.2e-16
summary(univar_mod1)
Call:
lm(formula = Yield ~ Temperature, data = yield)
Residuals:
Min 1Q Median 3Q Max
-775.96 -191.99 5.34 202.63 733.14
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5531.24 168.46 32.83 <2e-16 ***
Temperature -186.43 13.38 -13.94 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 290.9 on 198 degrees of freedom
Multiple R-squared: 0.4951, Adjusted R-squared: 0.4926
F-statistic: 194.2 on 1 and 198 DF, p-value: < 2.2e-16
From the summary of the full model, the intercept (\(P=2\times10^{-16}<<\alpha=0.05\)), the slope of Temperature (\(P=1.27\times10^{-12}<<\alpha=0.05\)), and the slope of SowDate (\(P=0.014<\alpha=0.05\)) are significant. The absolute value of beta coefficient value of Temperature of the multivariate model (\(|-142.98|\)) is smaller than that of the univariate model (\(|-186.43|\)). This is because with other dependent variables the effect of Temperature on Yield in the multivariate model will be smaller compared to that of the univariate model where only Temperature is considered.
3 Exercise 3
3.1 Question 3.1
calc.relimp(fullmod2, type=c('lmg'), rela=TRUE)
Response variable: Yield
Total response variance: 142570.8
Analysis based on 162 observations
7 Regressors:
Canal_Dist Irrigated_Per Literate Cultivator AgLabor Temperature SowDate
Proportion of variance explained by model: 53.35%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
Canal_Dist 0.005832744
Irrigated_Per 0.110412608
Literate 0.084733336
Cultivator 0.007104318
AgLabor 0.018953436
Temperature 0.591199084
SowDate 0.181764474
Average coefficients for different model sizes:
1X 2Xs 3Xs 4Xs 5Xs
Canal_Dist -3.427562 -2.340213 -1.597039 -1.117275 -0.8397152
Irrigated_Per 556.986053 455.777540 370.378228 297.956931 235.6872750
Literate 862.918239 695.948341 550.924834 424.252770 313.1735417
Cultivator 97.798573 -0.815368 -83.147846 -159.357463 -234.3022470
AgLabor -747.886598 -556.319902 -429.005304 -358.191022 -334.7664982
Temperature -174.759453 -168.546054 -162.676162 -157.192254 -152.1052297
SowDate -154.990658 -132.762298 -113.645195 -96.906542 -81.8602301
6Xs 7Xs
Canal_Dist -0.7189361 -0.7175693
Irrigated_Per 181.0831753 132.1408988
Literate 215.8556081 131.4272404
Cultivator -309.0948762 -382.0979838
AgLabor -348.8859001 -390.2263964
Temperature -147.3890286 -142.9814881
SowDate -67.9402374 -54.7208073
Based on the relative importance metrics, Temperature explains the most amount of variance (0.59), SowDate explains the second most, Canal_Dist (0.006) explains the least.