Lab 9: Fixed vs Random Effects

Author

Xiuyu Cao

Published

April 6, 2024

0.1 Lab Setup

Set your working directory and load libraries.

library(nlme)  # for running mixed models

1 Exercise 1

1.1 Question 1.1

Answer: In a previous research project I have worked on, I was thinking about modeling microclimate (fine-resolution temperature) using macroclimate (coarser-resolution temperature) and land cover type (micro_temp ~ macro_temp + land_cover). However, there may be potential omitted variables introducing bias to my model. For example, the distance to the center of urban area may influence both the independent variable macro_temp and dependent variable micro_temp, since the Urban Heat Island effect would cause the averaged large scale temperature macro_temp to raise, as well as the small scale micro_temp.

2 Exercise 2

irrdata <- read.csv('../data/lab9/irrdata.csv')
head(irrdata, 3)
  irrigation    yield state
1          1 2884.490 Bihar
2          1 3466.900 Bihar
3          1 2973.309 Bihar
  • irrigation: the number of times a farmer irrigated his/her crop throughout the growing season
  • yield: the yield of the wheat crop (in kg/ha)
  • state: the State the farmer lives in

2.1 Question 2.1

lmod3 = lm(yield ~ irrigation, data = irrdata)
summary(lmod3)

Call:
lm(formula = yield ~ irrigation, data = irrdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1051.94  -296.14    39.53   247.25   989.78 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2285.64      93.10   24.55   <2e-16 ***
irrigation    498.36      24.38   20.44   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 404.1 on 98 degrees of freedom
Multiple R-squared:   0.81, Adjusted R-squared:  0.8081 
F-statistic: 417.8 on 1 and 98 DF,  p-value: < 2.2e-16
# state farmer live in --> the wealth of the farmer
table(irrdata$state)

 Bihar Punjab 
    48     52 
lmod4 = lm(yield ~ irrigation + state, data = irrdata)
summary(lmod4)

Call:
lm(formula = yield ~ irrigation + state, data = irrdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1047.24  -309.10    27.47   245.32   983.74 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2297.89      98.74  23.273   <2e-16 ***
irrigation    487.62      37.04  13.164   <2e-16 ***
statePunjab    47.48     122.87   0.386      0.7    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 405.8 on 97 degrees of freedom
Multiple R-squared:  0.8103,    Adjusted R-squared:  0.8064 
F-statistic: 207.2 on 2 and 97 DF,  p-value: < 2.2e-16

Answer:

  • Differences:
    • The beta coefficients are not the the same
    • Second model’s F-statistic decreases
  • Similarities:
    • What is significant: In both models, the intercept and slope for irrigation are significant
    • \(R^2\) being similar
    • Residual standard error being similar

2.2 Question 2.2

Answer: The beta coefficient on irrigation decreased a bit when including state in the model because state explains some degrees of the yield, due to the difference in wealth of farmers from different states. And since the farmers in Punjab are much richer than those from Bihar, statePunjab has a positive intercept difference, though not significant.

3 Exercise 3

lakedf <- read.csv('../data/lab9/lakedf.csv')
head(lakedf, 3)
    lake diversity      prod     temp     rain      sun
1  lake2         4  855.1392 16.64514 129.3730 3.837627
2  lake6        13  989.0435 17.00123 156.7222 3.832488
3 lake10        19 1034.0101 16.62721 153.1164 3.870237
  • lake: the number ID of the lake where the sample was taken
  • diversity: the number of different algae species
  • prod: the productivity of the algae (in kg/m)
  • temp: mean temperature for 2017 (in Celsius)
  • rainfall: total rainfall in 2017 (in mm)
  • sun: the amount of sunlight measured as insolation (in W/m2)

3.1 Question 3.1

lmod5 = lm(prod ~ diversity + temp + rain + sun, data = lakedf)
summary(lmod5)

Call:
lm(formula = prod ~ diversity + temp + rain + sun, data = lakedf)

Residuals:
     Min       1Q   Median       3Q      Max 
-114.272  -30.806    1.499   24.684  111.091 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -395.0918   286.4862  -1.379 0.171104    
diversity      7.0336     0.9518   7.390 5.68e-11 ***
temp          35.1810    14.4823   2.429 0.017011 *  
rain           1.6387     0.4178   3.923 0.000166 ***
sun          118.5825    43.1220   2.750 0.007137 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.34 on 95 degrees of freedom
Multiple R-squared:  0.619, Adjusted R-squared:  0.6029 
F-statistic: 38.58 on 4 and 95 DF,  p-value: < 2.2e-16
# add random effect
mod = lme(prod ~ diversity + temp + rain + sun, random = ~1|lake, data = lakedf)
summary(mod)
Linear mixed-effects model fit by REML
  Data: lakedf 
       AIC     BIC    logLik
  1028.193 1046.07 -507.0965

Random effects:
 Formula: ~1 | lake
        (Intercept) Residual
StdDev:    4.588576 44.13223

Fixed effects:  prod ~ diversity + temp + rain + sun 
                Value Std.Error DF   t-value p-value
(Intercept) -389.9595 285.62895 85 -1.365266  0.1758
diversity      7.1197   0.95716 85  7.438388  0.0000
temp          34.8797  14.42965 85  2.417227  0.0178
rain           1.6436   0.41827 85  3.929440  0.0002
sun          118.1160  43.33916 85  2.725387  0.0078
 Correlation: 
          (Intr) dvrsty temp   rain  
diversity  0.356                     
temp      -0.796 -0.140              
rain      -0.191 -0.143  0.049       
sun       -0.473 -0.397 -0.112 -0.108

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.56672231 -0.67850679  0.04748643  0.55456902  2.44784918 

Number of Observations: 100
Number of Groups: 11 

Answer: The beta coefficient of the continuous variables are similar, as well as the p values of the continuous variables. While for the intercept, although the p values of the two models are similar, the beta coefficient of the intercept is different, due to the second model’s intercept being added a random effect.

3.2 Question 3.2

Answer: In a previous course project, I was trying to model income of a place using variables including the distance to parks, forests, and water, density of roads, elevation, and population. I was trying to get a generalized model that can get income across the US, instead of only the few states whose data I have access to. Also, in this case I believe I have collected enough factors and I don’t think there is a problem of OVB. Therefore, in this case a random effect on the states would be more appropriate than a fixed effect.