# get linear modellinmod <-lm(juveniles ~ sand + temperature + rainfall + humanpop + country, data=dataset)# check linear relationshippar(mfrow =c(2,2))plot(juveniles ~ sand, data = dataset)plot(juveniles ~ temperature, data = dataset)plot(juveniles ~ rainfall, data = dataset)plot(juveniles ~ humanpop, data = dataset)
# check residual homoscedasticity and normalitypar(mfrow =c(2,2))plot(linmod)
# get resultssummary(linmod)
Call:
lm(formula = juveniles ~ sand + temperature + rainfall + humanpop +
country, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-1991.38 -393.25 21.99 389.21 1766.18
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4603.65570 374.53180 12.292 < 2e-16 ***
sand -1.51796 0.13221 -11.481 < 2e-16 ***
temperature -76.81031 13.51391 -5.684 1.73e-08 ***
rainfall 0.04423 0.05576 0.793 0.427860
humanpop -0.39505 0.14076 -2.807 0.005104 **
countryIndia -33.74344 53.20575 -0.634 0.526092
countryIndonesia -46.85508 53.30556 -0.879 0.379619
countryMalaysia -190.48339 53.30196 -3.574 0.000369 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 594.7 on 992 degrees of freedom
Multiple R-squared: 0.2125, Adjusted R-squared: 0.2069
F-statistic: 38.24 on 7 and 992 DF, p-value: < 2.2e-16
1.1 Question 1.1
Answer:
Interpretation of the significant coefficients (at \(\alpha=0.05\))
Intercept (\(P=2\times10^{-16}<\alpha\)): With sand, temperature, rainfall, and humanpop being zero, it is predicted that there will be 4603.66 juvenile cranes on average in China.
sand (\(P=2\times10^{-16}<\alpha\)): Every one ton of sand removed is associated with a reduction of 1.5 juvenile cranes (copied from lab instruction).
temperature (\(P=1.73\times10^{-8}<\alpha\)): Every degree Celsius of temperature raised is associated with a reduction of 76.8 juvenile cranes.
humanpop (\(P=0.005<\alpha\)): Every 1,000 increase in population within 50 miles is associated with a reduction of 0.4 juvenile cranes.
countryMalaysia (\(P=0.0004<\alpha\)): On average, Malaysia has 190.5 fewer juvenile cranes than China when controlling for sand, temperature, rainfall, and humanpop.
1.2 Question 1.2
Answer:
Interpretation of the non-significant coefficients (at \(\alpha=0.05\))
rainfall (\(P=0.43>\alpha\)): There is no significant linear relationship between rainfall and juveniles.
countryIndia (\(P=0.53>\alpha\)): There is no significant difference in juveniles between India and China, controlling for sand, temperature, rainfall, and humanpop.
countryIndonesia (\(P=0.38>\alpha\)): There is no significant difference in juveniles between India and China, controlling for sand, temperature, rainfall, and humanpop.
2 Exercise 2
2.1 Question 2
table(dataset$country)
China India Indonesia Malaysia
250 250 250 250
# use India as baseline interceptdataset.releveled <- dataset %>%mutate(country=relevel(country, ref='India')) # relevel India to the first# rerun the linear modellinmod2 <-lm(juveniles ~ sand + temperature + rainfall + humanpop + country, data = dataset.releveled)summary(linmod2)
Call:
lm(formula = juveniles ~ sand + temperature + rainfall + humanpop +
country, data = dataset.releveled)
Residuals:
Min 1Q Median 3Q Max
-1991.38 -393.25 21.99 389.21 1766.18
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4569.91226 374.62321 12.199 < 2e-16 ***
sand -1.51796 0.13221 -11.481 < 2e-16 ***
temperature -76.81031 13.51391 -5.684 1.73e-08 ***
rainfall 0.04423 0.05576 0.793 0.42786
humanpop -0.39505 0.14076 -2.807 0.00510 **
countryChina 33.74344 53.20575 0.634 0.52609
countryIndonesia -13.11165 53.26062 -0.246 0.80559
countryMalaysia -156.73996 53.27943 -2.942 0.00334 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 594.7 on 992 degrees of freedom
Multiple R-squared: 0.2125, Adjusted R-squared: 0.2069
F-statistic: 38.24 on 7 and 992 DF, p-value: < 2.2e-16
Answer:
Coefficients of the continuous variables (sand, temperature, rainfall, and humanpop) stay the same while those of the categorical ones and the intercept are changed. This is due to the change of the baseline variable (China to India). Now the intercept is the cranes in India, instead of China, when the continuous variables are zero, and the coefficients of the categorical variables are now the difference of them from the juveniles in India.
3 Exercise 3
# get modelanomod <-aov(juveniles ~ sand + temperature + rainfall + humanpop + country, data = dataset)# check the assumption of equal varianceleveneTest(juveniles ~ country, data = dataset) # from the result, met
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 0.2372 0.8704
996
We can get additional information on pairwise comparisons instead of only the countries comparing to the baseline country when doing lm(). The countries that are significantly different from one another are as follows.
Comparing the p values between anomod and anomod3. Which variables are significant (at \(\alpha=0.05\)) are the same (only rianfall not being significant). However, the p values are not the same between these two models.
5 Exercise 5
linmod5 <-lm(juveniles ~ sand + temperature + rainfall + humanpop + country + humanpop*country, data = dataset)summary(linmod5)
# visualizeplot(1:1000, rep(-100, 1000), ylim =c(4000, 5500), xlim =c(50, 1000), xlab ='Human Population', ylab ='Juveniles')abline(a =5054.11977, b =-1.24563, col ='blue') # Chinaabline(a =5054.11977-818.65030, b =-1.24563+1.31526, col ='red') # Indiaabline(a =5054.11977-961.39246 , b =-1.24563+1.53082, col ='orange') # Indonesiaabline(a =5054.11977-676.65413, b =-1.24563+0.81239, col ='green') # Malaysialegend("topright", legend=c("China", "India", "Indonesia", "Malaysia"),col=c("blue", "red", "orange", "green"), lty=1)
5.1 Question 5
Answer:
The numbers to calculate the slopes for each country are from the humanpop and humanpop:countryIndia, humanpop:countryIndonesia, and humanpop:countryMalaysia. The latter three are the difference in slopes from the baseline slope (slope of humanpop).
The correlation coefficient between Petal.Length and Sepal.Length (0.87) is high, indicating a strong positive relationship, which may cause multicollinearity. Therefore I would exclude Petal.Length.
6.3 Check Linear Regression Assumptions
mod <-lm(Petal.Width~Sepal.Length+Sepal.Width+Species, data=iris)summary(mod)
Call:
lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Species,
data = iris)
Residuals:
Min 1Q Median 3Q Max
-0.50805 -0.10042 -0.01221 0.11416 0.46455
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.86897 0.16985 -5.116 9.73e-07 ***
Sepal.Length 0.06360 0.03395 1.873 0.063 .
Sepal.Width 0.23237 0.05145 4.516 1.29e-05 ***
Speciesversicolor 1.17375 0.06758 17.367 < 2e-16 ***
Speciesvirginica 1.78487 0.07779 22.944 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1797 on 145 degrees of freedom
Multiple R-squared: 0.9459, Adjusted R-squared: 0.9444
F-statistic: 634.3 on 4 and 145 DF, p-value: < 2.2e-16
# linear relationshippar(mfrow =c(1,2))plot(Petal.Width ~ Sepal.Length, data=iris)plot(Petal.Width ~ Sepal.Width, data=iris)
# Homoscedasticitybptest(mod)
studentized Breusch-Pagan test
data: mod
BP = 20.627, df = 4, p-value = 0.0003755
# Error Normalityshapiro.test(residuals(mod))
Shapiro-Wilk normality test
data: residuals(mod)
W = 0.98906, p-value = 0.2924
Linear Regression Assumptions:
Linear relationship between variables: from the plot - met.
Statistical independence of the errors: assume so.
Homoscedasticity: \(P=0.0003<\alpha=0.05\) –> reject null hypothesis of homoscedasticity - not met.
Error Normality: \(P=0.29>\alpha=0.05\) –> fail to reject null hypothesis of error normality - met.
6.4 Results
summary(mod)
Call:
lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Species,
data = iris)
Residuals:
Min 1Q Median 3Q Max
-0.50805 -0.10042 -0.01221 0.11416 0.46455
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.86897 0.16985 -5.116 9.73e-07 ***
Sepal.Length 0.06360 0.03395 1.873 0.063 .
Sepal.Width 0.23237 0.05145 4.516 1.29e-05 ***
Speciesversicolor 1.17375 0.06758 17.367 < 2e-16 ***
Speciesvirginica 1.78487 0.07779 22.944 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1797 on 145 degrees of freedom
Multiple R-squared: 0.9459, Adjusted R-squared: 0.9444
F-statistic: 634.3 on 4 and 145 DF, p-value: < 2.2e-16
Result Interpretation:
Intercept (\(P=9.73\times10^{-7}<\alpha=0.05\)): with Sepal.Length and Sepal.Width being zero, it is predicted that Petal.Width of species Setosa is -0.86897 (not meaningful in real life context though).
Sepal.Length (\(P=0.06>\alpha=0.05\)): no significant relationship between Sepal.Length and Petal.Width.
Sepal.Width (\(P=1.29\times10^{-5}<\alpha=0.05\)): every one unit increase in Sepal.Width is associated with 0.23 unit increase in Petal.Width.
Speciesversicolor (\(P=2\times10^{-16}<\alpha=0.05\)): on average, species Versicolor has 1.17 unit more in Petal.Width than species Setosa, controlling for Sepal.Length and Sepal.Width.
Speciesvirginica (\(P=2\times10^{-16}<\alpha=0.05\)): on average, species Virginica has 1.78 unit more in Petal.Width than species Setosa, controlling for Sepal.Length and Sepal.Width.