library(nlme) # for running mixed models
Lab 9: Fixed vs Random Effects
0.1 Lab Setup
Set your working directory and load libraries.
1 Exercise 1
1.1 Question 1.1
Answer: In a previous research project I have worked on, I was thinking about modeling microclimate (fine-resolution temperature) using macroclimate (coarser-resolution temperature) and land cover type (micro_temp ~ macro_temp + land_cover
). However, there may be potential omitted variables introducing bias to my model. For example, the distance to the center of urban area may influence both the independent variable macro_temp
and dependent variable micro_temp
, since the Urban Heat Island effect would cause the averaged large scale temperature macro_temp
to raise, as well as the small scale micro_temp
.
2 Exercise 2
<- read.csv('../data/lab9/irrdata.csv')
irrdata head(irrdata, 3)
irrigation yield state
1 1 2884.490 Bihar
2 1 3466.900 Bihar
3 1 2973.309 Bihar
irrigation
: the number of times a farmer irrigated his/her crop throughout the growing seasonyield
: the yield of the wheat crop (in kg/ha)state
: the State the farmer lives in
2.1 Question 2.1
= lm(yield ~ irrigation, data = irrdata)
lmod3 summary(lmod3)
Call:
lm(formula = yield ~ irrigation, data = irrdata)
Residuals:
Min 1Q Median 3Q Max
-1051.94 -296.14 39.53 247.25 989.78
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2285.64 93.10 24.55 <2e-16 ***
irrigation 498.36 24.38 20.44 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 404.1 on 98 degrees of freedom
Multiple R-squared: 0.81, Adjusted R-squared: 0.8081
F-statistic: 417.8 on 1 and 98 DF, p-value: < 2.2e-16
# state farmer live in --> the wealth of the farmer
table(irrdata$state)
Bihar Punjab
48 52
= lm(yield ~ irrigation + state, data = irrdata)
lmod4 summary(lmod4)
Call:
lm(formula = yield ~ irrigation + state, data = irrdata)
Residuals:
Min 1Q Median 3Q Max
-1047.24 -309.10 27.47 245.32 983.74
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2297.89 98.74 23.273 <2e-16 ***
irrigation 487.62 37.04 13.164 <2e-16 ***
statePunjab 47.48 122.87 0.386 0.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 405.8 on 97 degrees of freedom
Multiple R-squared: 0.8103, Adjusted R-squared: 0.8064
F-statistic: 207.2 on 2 and 97 DF, p-value: < 2.2e-16
Answer:
- Differences:
- The beta coefficients are not the the same
- Second model’s F-statistic decreases
- Similarities:
- What is significant: In both models, the intercept and slope for
irrigation
are significant - \(R^2\) being similar
- Residual standard error being similar
- What is significant: In both models, the intercept and slope for
2.2 Question 2.2
Answer: The beta coefficient on irrigation
decreased a bit when including state
in the model because state
explains some degrees of the yield, due to the difference in wealth of farmers from different states. And since the farmers in Punjab are much richer than those from Bihar, statePunjab
has a positive intercept difference, though not significant.
3 Exercise 3
<- read.csv('../data/lab9/lakedf.csv')
lakedf head(lakedf, 3)
lake diversity prod temp rain sun
1 lake2 4 855.1392 16.64514 129.3730 3.837627
2 lake6 13 989.0435 17.00123 156.7222 3.832488
3 lake10 19 1034.0101 16.62721 153.1164 3.870237
lake
: the number ID of the lake where the sample was takendiversity
: the number of different algae speciesprod
: the productivity of the algae (in kg/m)temp
: mean temperature for 2017 (in Celsius)rainfall
: total rainfall in 2017 (in mm)sun
: the amount of sunlight measured as insolation (in W/m2)
3.1 Question 3.1
= lm(prod ~ diversity + temp + rain + sun, data = lakedf)
lmod5 summary(lmod5)
Call:
lm(formula = prod ~ diversity + temp + rain + sun, data = lakedf)
Residuals:
Min 1Q Median 3Q Max
-114.272 -30.806 1.499 24.684 111.091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -395.0918 286.4862 -1.379 0.171104
diversity 7.0336 0.9518 7.390 5.68e-11 ***
temp 35.1810 14.4823 2.429 0.017011 *
rain 1.6387 0.4178 3.923 0.000166 ***
sun 118.5825 43.1220 2.750 0.007137 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 44.34 on 95 degrees of freedom
Multiple R-squared: 0.619, Adjusted R-squared: 0.6029
F-statistic: 38.58 on 4 and 95 DF, p-value: < 2.2e-16
# add random effect
= lme(prod ~ diversity + temp + rain + sun, random = ~1|lake, data = lakedf)
mod summary(mod)
Linear mixed-effects model fit by REML
Data: lakedf
AIC BIC logLik
1028.193 1046.07 -507.0965
Random effects:
Formula: ~1 | lake
(Intercept) Residual
StdDev: 4.588576 44.13223
Fixed effects: prod ~ diversity + temp + rain + sun
Value Std.Error DF t-value p-value
(Intercept) -389.9595 285.62895 85 -1.365266 0.1758
diversity 7.1197 0.95716 85 7.438388 0.0000
temp 34.8797 14.42965 85 2.417227 0.0178
rain 1.6436 0.41827 85 3.929440 0.0002
sun 118.1160 43.33916 85 2.725387 0.0078
Correlation:
(Intr) dvrsty temp rain
diversity 0.356
temp -0.796 -0.140
rain -0.191 -0.143 0.049
sun -0.473 -0.397 -0.112 -0.108
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.56672231 -0.67850679 0.04748643 0.55456902 2.44784918
Number of Observations: 100
Number of Groups: 11
Answer: The beta coefficient of the continuous variables are similar, as well as the p values of the continuous variables. While for the intercept, although the p values of the two models are similar, the beta coefficient of the intercept is different, due to the second model’s intercept being added a random effect.
3.2 Question 3.2
Answer: In a previous course project, I was trying to model income of a place using variables including the distance to parks, forests, and water, density of roads, elevation, and population. I was trying to get a generalized model that can get income across the US, instead of only the few states whose data I have access to. Also, in this case I believe I have collected enough factors and I don’t think there is a problem of OVB. Therefore, in this case a random effect on the states would be more appropriate than a fixed effect.