library(terra) # for working with raster data
library(tidyverse) # for manipulating the data
library(ape) # for getting autocorrelation coefficient
library(reshape2) # for reshaping data frames
library(ROCR) # for getting the ROC curve
library(sf) # for working with simple feature data
In this comprehensive study, I examined data from both Salt Lake City and the United States. My aim was to discern the factors that have contributed to urban development in recent decades, thereby gaining an insight into the intricacies of urban growth. The findings of this analysis not only enhance our understanding of urban development dynamics but also offer valuable insights for informed decision-making in the realms of land management and conservation.
In recent decades, Salt Lake City, Utah, has experienced significant urban development, marked by substantial population growth, immigration, and increased housing demand. This expansion has led to a noteworthy increase in impervious urban development. To facilitate effective land management and conservation efforts, it is crucial to study the factors contributing to urban development and its trends.
First read in the data I will be using to analyze Salt Lake City.
They are raster data of the same resolution and extent. Based on that,
they can be combined into a list using the function
terra::c()
, and further turned into a data frame.
# read data
NLCD_2001 <- rast("../data/lab5/NLCD_2001_SL.tif") # land use data 2001
NLCD_2004 <- rast("../data/lab5/NLCD_2004_SL.tif") # land use data 2004
NLCD_2006 <- rast("../data/lab5/NLCD_2006_SL.tif") # land use data 2006
NLCD_2008 <- rast("../data/lab5/NLCD_2008_SL.tif") # land use data 2008
NLCD_2011 <- rast("../data/lab5/NLCD_2011_SL.tif") # land use data 2011
NLCD_2013 <- rast("../data/lab5/NLCD_2013_SL.tif") # land use data 2013
NLCD_2016 <- rast("../data/lab5/NLCD_2016_SL.tif") # land use data 2016
Park_dist <- rast("../data/lab5/Parks_dist_SL.tif") # distance (km) to parks and the protected
Rd_dns1km <- rast("../data/lab5/Rd_dns1km_SL.tif") # road density for a 1 km neighborhood
WaterDist <- rast("../data/lab5/WaterDist_SL.tif") # distance (km) to water bodies
DEM <- rast("../data/lab5/DEM_SL.tif") # elevation
# stack the raster layers
allrasters <- c(NLCD_2001, NLCD_2004, NLCD_2006, NLCD_2008, NLCD_2011, NLCD_2013, NLCD_2016, Park_dist, Rd_dns1km, WaterDist, DEM)
allrasters[[1]] # check the data
## class : SpatRaster
## dimensions : 1881, 1856, 1 (nrow, ncol, nlyr)
## resolution : 30, 30 (x, y)
## extent : -1354815, -1299135, 2046165, 2102595 (xmin, xmax, ymin, ymax)
## coord. ref. : Albers_Conical_Equal_Area
## source : NLCD_2001_SL.tif
## name : NLCD_2001_SL
plot(allrasters[[1]]) # check the data
# turn raster layers into a data frame
allrasters.df <- allrasters %>%
as.data.frame(xy=T) %>% # transform to a data frame, keep the xy
filter(NLCD_2001_SL != 128) # remove no data value (stored as 128)
head(allrasters.df)
## x y NLCD_2001_SL NLCD_2004_SL NLCD_2006_SL NLCD_2008_SL
## 1 -1330860 2102550 11 52 52 95
## 2 -1330830 2102550 11 52 52 95
## 3 -1330800 2102550 11 11 11 11
## 4 -1330770 2102550 11 11 11 11
## 5 -1330740 2102550 11 11 11 11
## 6 -1330710 2102550 11 11 71 71
## NLCD_2011_SL NLCD_2013_SL NLCD_2016_SL Parks_dist_SL Rd_dns1km_SL
## 1 95 95 95 1358.308 0
## 2 95 95 95 1343.317 0
## 3 95 11 11 1328.834 0
## 4 95 11 11 1314.876 0
## 5 95 11 95 1301.461 0
## 6 71 71 71 1288.255 0
## WaterDist_SL DEM_SL
## 1 301.4963 1281
## 2 305.9412 1281
## 3 313.2092 1281
## 4 323.1099 1281
## 5 330.0000 1281
## 6 331.3608 1281
From 2001 to 2016, many areas in Salt Lake City have changed to urban areas.
allrasters.df <- allrasters.df %>% # get is_changed: whether changed to urban area from 2001 to 2016
mutate(is_changed = (NLCD_2001_SL != 21 & NLCD_2001_SL != 22 & NLCD_2001_SL != 23 & NLCD_2001_SL != 24) & (NLCD_2016_SL == 21 | NLCD_2016_SL == 22 | NLCD_2016_SL == 23 | NLCD_2016_SL == 24))
# plot
ggplot(allrasters.df, aes(y=y, x=x, color=is_changed)) +
geom_point(size=2, shape=15) +
scale_color_manual(values = c("FALSE" = "snow3", "TRUE" = "yellow")) +
labs(title='Areas Changed to Urban in Salt Lake City, 2001-2016',
x='X (m)',y='Y (m)', color='Whether Changed')
From the map we can see the significant changes in Salt Lake City from 2001 to 2016. Let’s analyze this quantitatively.
as.numeric()
converts the logical vector obtained in the
previous step to a numeric vector, where TRUE is converted to 1 and
FALSE is converted to 0. They can thus be added up using the
sum()
function and get the count of the cells.
# get the new urban areas after 2001
urban_new <- with(allrasters.df, (sum(as.numeric(NLCD_2016_SL == 21 | NLCD_2016_SL == 22 | NLCD_2016_SL == 23 | NLCD_2016_SL == 24))) - (sum(as.numeric(NLCD_2001_SL == 21| NLCD_2001_SL == 22| NLCD_2001_SL == 23| NLCD_2001_SL == 24))))
# get original urban areas in 2001
urban_ori <- with(allrasters.df,(sum(as.numeric(NLCD_2001_SL == 21| NLCD_2001_SL == 22| NLCD_2001_SL == 23| NLCD_2001_SL == 24))))
# get percentage change
urban_new/urban_ori* 100
## [1] 15.88895
Urban areas increased by About 15.89% in Salt Lake City from 2001 to 2016, this is huge!
Urbanization is necessary for more economic opportunities and development of infrastructure for the well being of citizens, while it can cause various consequences including the habitat loss of many species and Urban Heat Island effect. Therefore, it is an important land management problem to decided where to develop new urban area and predict the urban area development to plan future land management and conservation. Let’s quantitatively analyze where these changes took place during 2001-2016.
# get distance to different areas
data2plot <- allrasters.df %>%
filter(NLCD_2001_SL != 21 & NLCD_2001_SL != 22 & NLCD_2001_SL != 23 & NLCD_2001_SL != 24) %>% # get potential areas to develop in 2001
select(Parks_dist_SL:is_changed) %>% # select the last columns
melt() # turn into long format for plot
# set plot legend labels
legend_labels <- c('Distance to Green Areas (km)',
'Road Density (1km Neighborhood)',
'Distance to Water Bodies (km)',
'Elevation (m)')
# plot
ggplot(data2plot, aes(x=is_changed, y=value,fill=variable)) + # set plot
scale_fill_manual(values = c('green3','gray','steelblue1','yellow'), # set fill color
labels=legend_labels) + # set legend names
geom_boxplot() + # set box plot
facet_wrap(~variable, scales='free_y') + # different plots
labs(x="Whether Changed to Urban Area", y='Value') + # set x y labels
guides(fill = guide_legend(title = NULL)) + # remove legend title
theme(strip.text = element_blank()) + # remove subplot titles
ggtitle('Where Urbanization Took Place in Salt Lake City, 2001-2016') # set title
As shown in the plot, in Salt Lake City during 2001-2016, Urban Development followed these patterns:
These patterns make sense. For future land management and conservation, a prediction of probable developing area is needed.
To forecast future urban development in Salt Lake City, I will develop a predictive model using the areas that were not urban in 2001 and to see what contributed to the urban change during 2001-2016.
Due to the huge amount of the data, a random sample is needed to do the further analysis. Here my random seed is set to 77. Since there are many non-urban areas and few urban areas. I will sample all of the area changed to urban and some of the areas not changed during 2001 to 2016, making them 1:2
set.seed(77) # set a random seed
# set.seed(NULL)
# get all of the area that are not urban in 2001
non_urban01 <- filter(allrasters.df,
NLCD_2001_SL != 21 & NLCD_2001_SL != 22 & NLCD_2001_SL != 23 & NLCD_2001_SL != 24)
# separately get the area changed and not changed to urban
sl_chg <- filter(non_urban01, is_changed == T) # changed
sl_nchg <- filter(non_urban01, is_changed == F) # not changed
# get sample
sample_index <- sample(1:nrow(sl_nchg), nrow(sl_chg)* 2) # get sample index
sl_sample <- rbind(sl_chg, sl_nchg[sample_index,]) # combine the area changed and not changed
To assess my sampling result, first check the histogram of the original data and the sampled data.
par(mfrow=c(2, 5), # set 2*5 sub plots
mar=c(4, 4, 0.8, 0.65)) # set bottom, left, top, right margin
title=c('LandUse16','Park Dist', 'Road Dense', 'Water Dist', 'DEM') # set title
for (i in 9:13) { # plot original on the first row
hist(non_urban01[, i], main=title[i-8], xlab=NA, col='plum1', ylab=ifelse(i==9,'Frequency (Original)',NA))
}
for (i in 9:13){ # plot sampled on the second row
hist(sl_sample[, i], main=NA, xlab="Value", col='skyblue', ylab=ifelse(i==9,'Frequency (Sampled)',NA))
}
par(mfrow=c(1, 1)) # set graphic parameter to normal
As shown in the graph, the distribution of the original data closely mirrors that of the sampled data, indicating an unbiased sample that can represent the original data distribution.
Also, it is important to check the spatial dependency.
sd_sample <- sl_sample[sample(1:nrow(sl_sample), 100),] # randomly get 100 records for checking the spatial dependency
dist_mat <- as.matrix(dist(cbind(sd_sample$x, sd_sample$y))) # get distance matrix
dist_mat.i <- 1/dist_mat # get reciprocal distance matrix
diag(dist_mat.i) <- 0 # set diagonal to 0
raster_names <- names(sd_sample)[3:ncol(sd_sample)] # get raster names
raster_names <- data.frame(Raster_Name = raster_names) # create data frame with raster names being the first column
Moran_res <- data.frame() # create blank data frame
for(i in 3:ncol(sd_sample)){ # get each record
Moran_res <- rbind(Moran_res, Moran.I(sd_sample[,i], dist_mat.i))
}
cbind(raster_names, Moran_res) # combine the to data frames by column
## Raster_Name observed expected sd p.value
## 1 NLCD_2001_SL 0.10528385 -0.01010101 0.02191363 1.398557e-07
## 2 NLCD_2004_SL 0.08686153 -0.01010101 0.02187911 9.347556e-06
## 3 NLCD_2006_SL 0.06864874 -0.01010101 0.02189475 3.222294e-04
## 4 NLCD_2008_SL 0.04360302 -0.01010101 0.02188096 1.411306e-02
## 5 NLCD_2011_SL 0.07570411 -0.01010101 0.02187616 8.770368e-05
## 6 NLCD_2013_SL 0.07741309 -0.01010101 0.02187854 6.334317e-05
## 7 NLCD_2016_SL 0.10247490 -0.01010101 0.02183666 2.531398e-07
## 8 Parks_dist_SL 0.42364353 -0.01010101 0.02202395 0.000000e+00
## 9 Rd_dns1km_SL 0.31727483 -0.01010101 0.02169181 0.000000e+00
## 10 WaterDist_SL 0.18546248 -0.01010101 0.02186385 0.000000e+00
## 11 DEM_SL 0.36709667 -0.01010101 0.02194264 0.000000e+00
## 12 is_changed 0.23006093 -0.01010101 0.02201833 0.000000e+00
From the result, the expected value is -0.01. The observed values of each raster layer vary.
sd
and p
are all small, showing a good
precision and significance in the result.The Generalized Linear Model (GLM) is a flexible generalization of
ordinary linear regression that allows for response variables that have
error non-normal distributions. The GLM generalizes linear regression by
allowing the linear model to be related to the response variable via a
link function and by allowing the magnitude of the variance of each
measurement to be a function of its predicted value. Here I use the
glm
function to call the GLM, and set
family=binomial
, meaning that it is fitting a logistic
regression model since our is_changed
attribute is binary
with only the value True
or False
.
# get train (70%) and test (30%) set
sl_sample <- sl_sample %>% mutate(id = row_number()) # add row number
sl_train <- sl_sample %>% sample_frac(.7) # get train set
sl_test <- anti_join(sl_sample, sl_train, by='id') # get test set
# get model using the train set
fit <- glm(is_changed ~ Parks_dist_SL + Rd_dns1km_SL + WaterDist_SL + DEM_SL, data=sl_train, family=binomial())
summary(fit)
##
## Call:
## glm(formula = is_changed ~ Parks_dist_SL + Rd_dns1km_SL + WaterDist_SL +
## DEM_SL, family = binomial(), data = sl_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.310e+00 6.865e-02 62.78 <2e-16 ***
## Parks_dist_SL 1.055e-04 1.610e-06 65.55 <2e-16 ***
## Rd_dns1km_SL 2.115e+01 1.177e-01 179.69 <2e-16 ***
## WaterDist_SL 2.001e-04 8.522e-06 23.48 <2e-16 ***
## DEM_SL -5.218e-03 5.187e-05 -100.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 280958 on 220709 degrees of freedom
## Residual deviance: 115555 on 220705 degrees of freedom
## AIC: 115565
##
## Number of Fisher Scoring iterations: 7
As shown in the result, the Estimate
shows the weight of
influence on the independent variable. From the values, we can conclude
that the distance to the parks, road density, and distance to the water
bodies all have a positive relationship with urban development, while
the elevation has a negative one. The Std. Error
are all
small, indicating the small uncertainty of the estimate. The
z value
are all greater than 2 when positive or smaller
than -2 when negative, showing that the observed result is unlikely to
have occurred by chance alone. The Pr(>|z|)
all being
very small also indicates the significance in the model.
The null deviance shows how well the response variable is predicted by a model that includes only the intercept. The residual deviance shows how well the response is predicted by the model when the predictors are included. It can be seen that the deviance goes down by 280958 - 115555 = 1.65403^{5} when 4 predictor variables are added. This decrease in deviance is evidence of a significant increase in fit.
To further assess the goodness of fit, I will calculate the Revceiver Operating Characteristic (ROC) curve using the test set.
pred <- prediction(predict(fit, newdata=sl_test), sl_test$is_changed) # use the test set to calculate the ROC curve
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE, main = "ROC Curve")
From the ROC curve we can see that the ROC curve rises steeply and approaches the upper-left corner, indicating high sensitivity and specificity. To quantitatively assess the curve, the Area Under the Curve (AUC) needs to be calculated.
# get AUC value
auc_ROCR <- performance(pred, measure = "auc")
auc_ROCR@y.values[[1]]
## [1] 0.9566476
The AUC is about 0.96, showing that the model is performing exceptionally well in terms of separating true positives from false positives and true negatives from false negatives, likely to make accurate predictions as well.
Use the whole data set to generate the final model and predict the urban development.
fit <- glm(is_changed ~ Parks_dist_SL + Rd_dns1km_SL + WaterDist_SL + DEM_SL, data=sl_sample, family=binomial()) # get model
predicted <- predict(allrasters, fit) # get prediction
predicted <- predicted %>%
as.data.frame(xy=T) %>% # change to data frame
mutate(probability=plogis(lyr1)) # get probability
# plot
ggplot(predicted, aes(y=y, x=x, color=probability)) +
geom_point(size=2, shape=15) +
scale_color_gradient(low = "snow3", high = "yellow") +
labs(title='Likely Locations of Urban Development, Salt Lake City',
x='X (m)',y='Y (m)')
Salt Lake City has undergone huge urban development. From the results of the statistical analysis and the GLM, we can see the distance to the parks, road density, and distance to the water bodies all have a positive relationship with urban development, while the elevation has a negative one. And the road density contribute the most to the urban development. In order to better explain urban development, besides the land-use type, other possible data can be used are as follows:
From the case of Salt Lake City, we learnt that distance to the green, road density, distance to the water, and elevation all contribute to the urban development. I want to further analyze whether there is a similar pattern across the country using cluster analysis.
The dataset includes various amenity characteristics for the entire United States on a 10x10 km grid. It is developed by Derek Van Berkel, from different sources of data (e.g. NLCD, SEDAC, social media, culture monuments). First read the data. An overview of the data can be found here.
amenity<- st_read("../data/lab5/AmenDataAll.shp") # read data
## Reading layer `AmenDataAll' from data source
## `/Users/xycao/Desktop/EAS648/EAS648-assignments/data/lab5/AmenDataAll.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 79971 features and 138 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -2356114 ymin: 272043.6 xmax: 2263886 ymax: 3182044
## Projected CRS: NAD83 / Conus Albers
names(amenity) # all of the attributes of the data
## [1] "GRIDCODE_1" "ZooAmusSAL" "ZooAmusEMP" "ShopTOT" "ShopTOTsho"
## [6] "HotelTOT" "HotelSALEV" "HotelEMPL" "HosptTOT" "HosptEMP"
## [11] "HighE_TOT" "HighE_EMP" "GolfTOT" "GolfSALEV" "GolfEMP"
## [16] "All_Pano" "Natr_Pano" "All_Flickr" "Nat_Flickr" "Agri2011"
## [21] "Agri2001" "Frst2001" "Urb2001" "Agri2006" "Frst2006"
## [26] "Urb2006" "serhouse00" "serhouse10" "Frst2011" "Urb2011"
## [31] "serPop90" "serPop00" "serPop10" "secPop65pc" "SecSeas00"
## [36] "secHiSPl00" "ser00publi" "PanoNature" "WaterPct" "cstdstL"
## [41] "distcoast" "FlickrNatu" "serhous90" "DEM_min" "DEM_max"
## [46] "DEM_range" "DEM_mean" "DEM_Variet" "CropVariet" "HikeLength"
## [51] "ecoZone" "Landmark" "HistStruc" "HistObject" "Histdistr"
## [56] "Histbuild" "HistSite" "Wd_wetland" "Emg_wetld" "Beach"
## [61] "barren" "Grassland" "Shrubland" "housingChg" "PopChg"
## [66] "coast" "settlment" "Hist" "park" "density"
## [71] "forestGrou" "agriGroups" "wetlandGro" "GrasslandG" "barrenGrou"
## [76] "ShrublandG" "ZooEmp" "ZooCount" "ZooDist" "ZooDistEmp"
## [81] "ShopCount" "ShopGLATot" "DistGLA" "DistanceSh" "HotelCount"
## [86] "HotelEmp" "HotelDistE" "HotelDist" "HospitalCo" "HospitalEm"
## [91] "Hospital_D" "Hospital_1" "HistMonCou" "HistMonEmp" "HistMon_Di"
## [96] "HistMonume" "HigherEdCo" "HigherEdEm" "HigherEdDi" "HigherEd_1"
## [101] "GolfCount" "GolfEmpNum" "GolfDistEm" "GolfDist" "ZPanoNatur"
## [106] "ZFlickrNat" "ZDEM_range" "ZhousingCh" "ZPopChg" "Zpredicts"
## [111] "ZZooEmp" "ZZooCount" "ZZooDist" "ZZooDistEm" "ZShopCount"
## [116] "ZShopGLATo" "ZDistGLA" "ZDistanceS" "ZHotelCoun" "ZHotelEmp"
## [121] "ZHotelDist" "ZHotelDi_1" "ZHospitalC" "ZHospitalE" "ZHospital_"
## [126] "ZHospital1" "ZHistMonCo" "ZHistMonEm" "ZHistMon_D" "ZHistMonum"
## [131] "ZHigherEdC" "ZHigherEdE" "ZHigherEdD" "ZHigherEd_" "ZGolfCount"
## [136] "ZGolfEmpNu" "ZGolfDistE" "ZGolfDist" "geometry"
amenity_geom <- st_sfc(amenity$geometry) # store the geometry data
According to the available data in amenity
, I will use
the following data to deploy the cluster analysis:
Frst2001
Urb2001
serPop00
PopChg
–>
derived from serPop00
and serPop10
DEM_min
, DEM_max
,
DEM_mean
, and DEM_Variet
amenity$PopChg <- amenity$serPop10 - amenity$serPop00 # get the population change data
amenity.mat <- amenity %>% # get the matrix for cluster analysis
select(Frst2001, Urb2001, serPop00, PopChg, DEM_min, DEM_max, DEM_mean, DEM_Variet) %>% # select columns
na.omit() %>% # omit NA data
st_drop_geometry() %>% # get rid of the geometry data
scale() # z-score standardization: mean of 0 and std of 1
I will use K-means cluster analysis. K-means is typical clustering technique that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. K-means clustering minimizes within-cluster variances
Before deploying K-means cluster analysis, the number of clusters needs to be decided. Here I iterate over each number of clusters from 2 to 15 to determine the optimal cluster count based on the ‘elbow’ point, where the reduction in within-group sum of squares begins to show a slower rate of decline.
set.seed(77)
# Determine number of clusters
wss <- (nrow(amenity.mat)-1)*sum(apply(amenity.mat,2,var)) # total within-group sum of squares for one cluster.
for (i in 2:15) wss[i] <- sum(kmeans(amenity.mat, centers=i)$withinss) # try different cluster numbers
plot(1:15, wss, type="b", # plot and compare
xlab="Number of Clusters", ylab="Within groups sum of squares")
As shown in the graph, 4 clusters is a good choice. Let’s deploy the K-means cluster analysis.
fit <- kmeans(amenity.mat, 4) # 4 cluster k-means
kmeans_res <- aggregate(amenity.mat,by=list(fit$cluster),FUN=mean) # get cluster means
amenity <- data.frame(amenity, fit$cluster) # add cluster attribute to the original data
st_geometry(amenity) <- amenity_geom # add geometry to the original data
kmeans_res
## Group.1 Frst2001 Urb2001 serPop00 PopChg DEM_min DEM_max
## 1 1 1.3265458 0.1582036 0.01417498 -0.08241356 -0.6735107 -0.5242585
## 2 2 -0.4746195 0.5157830 0.32125132 0.16885370 -0.6653491 -0.7254456
## 3 3 -0.7566470 -0.5603260 -0.34184680 -0.13395008 0.8184188 0.5973046
## 4 4 0.9647117 -0.7177169 -0.33208595 -0.12883795 1.4530229 1.8561120
## DEM_mean DEM_Variet
## 1 -0.6123002 -0.070874801
## 2 -0.7037110 -0.577430349
## 3 0.7110041 -0.007634545
## 4 1.6946482 1.918763840
# plot result
ggplot() +
geom_sf(data = amenity, mapping = aes(fill = as.factor(fit.cluster)), color = NA) +
labs(fill = "Clusters") +
ggtitle("Clusters based on Kmeans")
# visualize the k-means result
kmeans_res_long <- pivot_longer(kmeans_res, cols=c(2:9), names_to = 'var', values_to = 'val') # turn into long format for plotting
# plot
kmeans_res_long %>%
ggplot(aes(x = var, y = val, group=Group.1)) + # set x, y, and group
geom_line(aes(color=as.factor(Group.1))) + # plot lines by groups
labs(title = "Mean Values of Different Cluster Variables", # set title
x = "Variable", y = "Standardized Value") + # set x, y, legend
scale_color_discrete(name = "Clusters") + # set legend title
theme_classic() # remove background color
As shown in the results, based on the population change during
2001-2011 PopChg
, Cluster 2 underwent significant urban
development during this period, as indicated by the green areas on the
map. In contrast, Clusters 3 and 4 (the purple and blue areas)
experienced relatively less development. The map reveals that areas of
higher development are concentrated near the West Coast and in the
Middle-Eastern part of America. Conversely, places with less development
are characterized by high-elevation areas in America. The line plot
illustrates that in clusters with higher development, the elevation is
low and shows low variance. Additionally, these clusters tend to have
lower forest cover. Moreover, the clusters with high development levels
typically exhibit larger urban areas and higher populations before the
development.
From the results of the K-means cluster analysis, we can conclude that the elevation, forest cover, and the population and urban area before development all contribute to the urban development in the United States.