Needed Packages

library(terra)  # for working with raster data
library(tidyverse)  # for manipulating the data
library(ape)  # for getting autocorrelation coefficient
library(reshape2)  # for reshaping data frames
library(ROCR)  # for getting the ROC curve
library(sf)  # for working with simple feature data

Introduction

In this comprehensive study, I examined data from both Salt Lake City and the United States. My aim was to discern the factors that have contributed to urban development in recent decades, thereby gaining an insight into the intricacies of urban growth. The findings of this analysis not only enhance our understanding of urban development dynamics but also offer valuable insights for informed decision-making in the realms of land management and conservation.


Salt Lake City

In recent decades, Salt Lake City, Utah, has experienced significant urban development, marked by substantial population growth, immigration, and increased housing demand. This expansion has led to a noteworthy increase in impervious urban development. To facilitate effective land management and conservation efforts, it is crucial to study the factors contributing to urban development and its trends.

Read Data

First read in the data I will be using to analyze Salt Lake City. They are raster data of the same resolution and extent. Based on that, they can be combined into a list using the function terra::c(), and further turned into a data frame.

# read data
NLCD_2001 <- rast("../data/lab5/NLCD_2001_SL.tif")  # land use data 2001
NLCD_2004 <- rast("../data/lab5/NLCD_2004_SL.tif")  # land use data 2004
NLCD_2006 <- rast("../data/lab5/NLCD_2006_SL.tif")  # land use data 2006
NLCD_2008 <- rast("../data/lab5/NLCD_2008_SL.tif")  # land use data 2008
NLCD_2011 <- rast("../data/lab5/NLCD_2011_SL.tif")  # land use data 2011
NLCD_2013 <- rast("../data/lab5/NLCD_2013_SL.tif")  # land use data 2013
NLCD_2016 <- rast("../data/lab5/NLCD_2016_SL.tif")  # land use data 2016
Park_dist <- rast("../data/lab5/Parks_dist_SL.tif")  # distance (km) to parks and the protected
Rd_dns1km <- rast("../data/lab5/Rd_dns1km_SL.tif")  # road density for a 1 km neighborhood
WaterDist <- rast("../data/lab5/WaterDist_SL.tif")  # distance (km) to water bodies
DEM <- rast("../data/lab5/DEM_SL.tif")  # elevation

# stack the raster layers
allrasters <- c(NLCD_2001, NLCD_2004, NLCD_2006, NLCD_2008, NLCD_2011, NLCD_2013, NLCD_2016, Park_dist, Rd_dns1km, WaterDist, DEM)
allrasters[[1]]  # check the data
## class       : SpatRaster 
## dimensions  : 1881, 1856, 1  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : -1354815, -1299135, 2046165, 2102595  (xmin, xmax, ymin, ymax)
## coord. ref. : Albers_Conical_Equal_Area 
## source      : NLCD_2001_SL.tif 
## name        : NLCD_2001_SL
plot(allrasters[[1]])  # check the data

# turn raster layers into a data frame
allrasters.df <- allrasters %>%
  as.data.frame(xy=T) %>%  # transform to a data frame, keep the xy
  filter(NLCD_2001_SL != 128)  # remove no data value (stored as 128)
head(allrasters.df)
##          x       y NLCD_2001_SL NLCD_2004_SL NLCD_2006_SL NLCD_2008_SL
## 1 -1330860 2102550           11           52           52           95
## 2 -1330830 2102550           11           52           52           95
## 3 -1330800 2102550           11           11           11           11
## 4 -1330770 2102550           11           11           11           11
## 5 -1330740 2102550           11           11           11           11
## 6 -1330710 2102550           11           11           71           71
##   NLCD_2011_SL NLCD_2013_SL NLCD_2016_SL Parks_dist_SL Rd_dns1km_SL
## 1           95           95           95      1358.308            0
## 2           95           95           95      1343.317            0
## 3           95           11           11      1328.834            0
## 4           95           11           11      1314.876            0
## 5           95           11           95      1301.461            0
## 6           71           71           71      1288.255            0
##   WaterDist_SL DEM_SL
## 1     301.4963   1281
## 2     305.9412   1281
## 3     313.2092   1281
## 4     323.1099   1281
## 5     330.0000   1281
## 6     331.3608   1281

Statistical Analysis

From 2001 to 2016, many areas in Salt Lake City have changed to urban areas.

allrasters.df <- allrasters.df %>%  # get is_changed: whether changed to urban area from 2001 to 2016
  mutate(is_changed = (NLCD_2001_SL != 21 & NLCD_2001_SL != 22 & NLCD_2001_SL != 23 & NLCD_2001_SL != 24) & (NLCD_2016_SL == 21 | NLCD_2016_SL == 22  | NLCD_2016_SL == 23 | NLCD_2016_SL == 24))
# plot
ggplot(allrasters.df, aes(y=y, x=x, color=is_changed)) +
  geom_point(size=2, shape=15) +
  scale_color_manual(values = c("FALSE" = "snow3", "TRUE" = "yellow")) +
  labs(title='Areas Changed to Urban in Salt Lake City, 2001-2016',
       x='X (m)',y='Y (m)', color='Whether Changed')

From the map we can see the significant changes in Salt Lake City from 2001 to 2016. Let’s analyze this quantitatively.

as.numeric() converts the logical vector obtained in the previous step to a numeric vector, where TRUE is converted to 1 and FALSE is converted to 0. They can thus be added up using the sum() function and get the count of the cells.

# get the new urban areas after 2001
urban_new <- with(allrasters.df, (sum(as.numeric(NLCD_2016_SL == 21 | NLCD_2016_SL == 22 | NLCD_2016_SL == 23 | NLCD_2016_SL == 24))) - (sum(as.numeric(NLCD_2001_SL == 21| NLCD_2001_SL == 22| NLCD_2001_SL == 23| NLCD_2001_SL == 24))))
# get original urban areas in 2001
urban_ori <- with(allrasters.df,(sum(as.numeric(NLCD_2001_SL == 21| NLCD_2001_SL == 22| NLCD_2001_SL == 23| NLCD_2001_SL == 24))))
# get percentage change
urban_new/urban_ori* 100
## [1] 15.88895

Urban areas increased by About 15.89% in Salt Lake City from 2001 to 2016, this is huge!

Urbanization is necessary for more economic opportunities and development of infrastructure for the well being of citizens, while it can cause various consequences including the habitat loss of many species and Urban Heat Island effect. Therefore, it is an important land management problem to decided where to develop new urban area and predict the urban area development to plan future land management and conservation. Let’s quantitatively analyze where these changes took place during 2001-2016.

# get distance to different areas
data2plot <- allrasters.df %>%
  filter(NLCD_2001_SL != 21 & NLCD_2001_SL != 22 & NLCD_2001_SL != 23 & NLCD_2001_SL != 24) %>%  # get potential areas to develop in 2001
  select(Parks_dist_SL:is_changed) %>%  # select the last columns
  melt()  # turn into long format for plot

# set plot legend labels
legend_labels <- c('Distance to Green Areas (km)',
                   'Road Density (1km Neighborhood)',
                   'Distance to Water Bodies (km)',
                   'Elevation (m)')
# plot
ggplot(data2plot, aes(x=is_changed, y=value,fill=variable)) +  # set plot
  scale_fill_manual(values = c('green3','gray','steelblue1','yellow'),  # set fill color
                    labels=legend_labels) +  # set legend names
  geom_boxplot() +  # set box plot
  facet_wrap(~variable, scales='free_y') +  # different plots
  labs(x="Whether Changed to Urban Area", y='Value') +  # set x y labels
  guides(fill = guide_legend(title = NULL)) +  # remove legend title
  theme(strip.text = element_blank()) +  # remove subplot titles
  ggtitle('Where Urbanization Took Place in Salt Lake City, 2001-2016')  # set title

As shown in the plot, in Salt Lake City during 2001-2016, Urban Development followed these patterns:

  • farther from the green area (parks and protected areas)
  • more prevalent in areas with higher road density
  • more prevalent in areas with lower elevation

These patterns make sense. For future land management and conservation, a prediction of probable developing area is needed.

Urban Development Model

To forecast future urban development in Salt Lake City, I will develop a predictive model using the areas that were not urban in 2001 and to see what contributed to the urban change during 2001-2016.

Sampling

Due to the huge amount of the data, a random sample is needed to do the further analysis. Here my random seed is set to 77. Since there are many non-urban areas and few urban areas. I will sample all of the area changed to urban and some of the areas not changed during 2001 to 2016, making them 1:2

set.seed(77)  # set a random seed
# set.seed(NULL)

# get all of the area that are not urban in 2001
non_urban01 <- filter(allrasters.df,
                   NLCD_2001_SL != 21 & NLCD_2001_SL != 22 & NLCD_2001_SL != 23 & NLCD_2001_SL != 24)
# separately get the area changed and not changed to urban
sl_chg <- filter(non_urban01, is_changed == T)  # changed
sl_nchg <- filter(non_urban01, is_changed == F)  # not changed
# get sample
sample_index <- sample(1:nrow(sl_nchg), nrow(sl_chg)* 2)  # get sample index
sl_sample <- rbind(sl_chg, sl_nchg[sample_index,])  # combine the area changed and not changed

To assess my sampling result, first check the histogram of the original data and the sampled data.

par(mfrow=c(2, 5),  # set 2*5 sub plots
    mar=c(4, 4, 0.8, 0.65))  # set bottom, left, top, right margin

title=c('LandUse16','Park Dist', 'Road Dense', 'Water Dist', 'DEM')  # set title
for (i in 9:13) {  # plot original on the first row
  hist(non_urban01[, i], main=title[i-8], xlab=NA, col='plum1', ylab=ifelse(i==9,'Frequency (Original)',NA))
}
for (i in 9:13){  # plot sampled on the second row
  hist(sl_sample[, i], main=NA, xlab="Value", col='skyblue', ylab=ifelse(i==9,'Frequency (Sampled)',NA))
}

par(mfrow=c(1, 1))  # set graphic parameter to normal

As shown in the graph, the distribution of the original data closely mirrors that of the sampled data, indicating an unbiased sample that can represent the original data distribution.

Also, it is important to check the spatial dependency.

sd_sample <- sl_sample[sample(1:nrow(sl_sample), 100),]  # randomly get 100 records for checking the spatial dependency
dist_mat <- as.matrix(dist(cbind(sd_sample$x, sd_sample$y)))  # get distance matrix
dist_mat.i <- 1/dist_mat  # get reciprocal distance matrix
diag(dist_mat.i) <- 0  # set diagonal to 0

raster_names <- names(sd_sample)[3:ncol(sd_sample)]  # get raster names
raster_names <- data.frame(Raster_Name = raster_names)  # create data frame with raster names being the first column
Moran_res <- data.frame()  # create blank data frame
for(i in 3:ncol(sd_sample)){  # get each record
  Moran_res <- rbind(Moran_res, Moran.I(sd_sample[,i], dist_mat.i))
}
cbind(raster_names, Moran_res)  # combine the to data frames by column
##      Raster_Name   observed    expected         sd      p.value
## 1   NLCD_2001_SL 0.10528385 -0.01010101 0.02191363 1.398557e-07
## 2   NLCD_2004_SL 0.08686153 -0.01010101 0.02187911 9.347556e-06
## 3   NLCD_2006_SL 0.06864874 -0.01010101 0.02189475 3.222294e-04
## 4   NLCD_2008_SL 0.04360302 -0.01010101 0.02188096 1.411306e-02
## 5   NLCD_2011_SL 0.07570411 -0.01010101 0.02187616 8.770368e-05
## 6   NLCD_2013_SL 0.07741309 -0.01010101 0.02187854 6.334317e-05
## 7   NLCD_2016_SL 0.10247490 -0.01010101 0.02183666 2.531398e-07
## 8  Parks_dist_SL 0.42364353 -0.01010101 0.02202395 0.000000e+00
## 9   Rd_dns1km_SL 0.31727483 -0.01010101 0.02169181 0.000000e+00
## 10  WaterDist_SL 0.18546248 -0.01010101 0.02186385 0.000000e+00
## 11        DEM_SL 0.36709667 -0.01010101 0.02194264 0.000000e+00
## 12    is_changed 0.23006093 -0.01010101 0.02201833 0.000000e+00

From the result, the expected value is -0.01. The observed values of each raster layer vary.

  • The sd and p are all small, showing a good precision and significance in the result.
  • For the land use type, most of their Moran’s Indexes are about or below 0.1 and above 0, indicating very weak positive spatial autocorrelation.
  • For the distance to park, road density, distance to water bodies, and elevation their Moran’s I range from 0.2 to 0.4, indicating a moderate to strong positive spatial autocorrelation, meaning that there is some level of clustering or pattern in these spatial data.

Model Development

The Generalized Linear Model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error non-normal distributions. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Here I use the glm function to call the GLM, and set family=binomial, meaning that it is fitting a logistic regression model since our is_changed attribute is binary with only the value True or False.

# get train (70%) and test (30%) set
sl_sample <- sl_sample %>% mutate(id = row_number())  # add row number
sl_train <- sl_sample %>% sample_frac(.7)  # get train set
sl_test <- anti_join(sl_sample, sl_train, by='id')  # get test set
# get model using the train set
fit <- glm(is_changed ~ Parks_dist_SL + Rd_dns1km_SL + WaterDist_SL + DEM_SL, data=sl_train, family=binomial())
summary(fit)
## 
## Call:
## glm(formula = is_changed ~ Parks_dist_SL + Rd_dns1km_SL + WaterDist_SL + 
##     DEM_SL, family = binomial(), data = sl_train)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    4.310e+00  6.865e-02   62.78   <2e-16 ***
## Parks_dist_SL  1.055e-04  1.610e-06   65.55   <2e-16 ***
## Rd_dns1km_SL   2.115e+01  1.177e-01  179.69   <2e-16 ***
## WaterDist_SL   2.001e-04  8.522e-06   23.48   <2e-16 ***
## DEM_SL        -5.218e-03  5.187e-05 -100.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 280958  on 220709  degrees of freedom
## Residual deviance: 115555  on 220705  degrees of freedom
## AIC: 115565
## 
## Number of Fisher Scoring iterations: 7

As shown in the result, the Estimate shows the weight of influence on the independent variable. From the values, we can conclude that the distance to the parks, road density, and distance to the water bodies all have a positive relationship with urban development, while the elevation has a negative one. The Std. Error are all small, indicating the small uncertainty of the estimate. The z value are all greater than 2 when positive or smaller than -2 when negative, showing that the observed result is unlikely to have occurred by chance alone. The Pr(>|z|) all being very small also indicates the significance in the model.

The null deviance shows how well the response variable is predicted by a model that includes only the intercept. The residual deviance shows how well the response is predicted by the model when the predictors are included. It can be seen that the deviance goes down by 280958 - 115555 = 1.65403^{5} when 4 predictor variables are added. This decrease in deviance is evidence of a significant increase in fit.

To further assess the goodness of fit, I will calculate the Revceiver Operating Characteristic (ROC) curve using the test set.

pred <- prediction(predict(fit, newdata=sl_test), sl_test$is_changed)  # use the test set to calculate the ROC curve
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE, main = "ROC Curve")

From the ROC curve we can see that the ROC curve rises steeply and approaches the upper-left corner, indicating high sensitivity and specificity. To quantitatively assess the curve, the Area Under the Curve (AUC) needs to be calculated.

# get AUC value
auc_ROCR <- performance(pred, measure = "auc")
auc_ROCR@y.values[[1]]
## [1] 0.9566476

The AUC is about 0.96, showing that the model is performing exceptionally well in terms of separating true positives from false positives and true negatives from false negatives, likely to make accurate predictions as well.

Prediction

Use the whole data set to generate the final model and predict the urban development.

fit <- glm(is_changed ~ Parks_dist_SL + Rd_dns1km_SL + WaterDist_SL + DEM_SL, data=sl_sample, family=binomial())  # get model
predicted <- predict(allrasters, fit)  # get prediction
predicted <- predicted %>%
  as.data.frame(xy=T) %>%  # change to data frame
  mutate(probability=plogis(lyr1))  # get probability

# plot
ggplot(predicted, aes(y=y, x=x, color=probability)) +
  geom_point(size=2, shape=15) +
  scale_color_gradient(low = "snow3", high = "yellow") +
  labs(title='Likely Locations of Urban Development, Salt Lake City',
       x='X (m)',y='Y (m)')

Conclusion

Salt Lake City has undergone huge urban development. From the results of the statistical analysis and the GLM, we can see the distance to the parks, road density, and distance to the water bodies all have a positive relationship with urban development, while the elevation has a negative one. And the road density contribute the most to the urban development. In order to better explain urban development, besides the land-use type, other possible data can be used are as follows:

  • population density: High population density can be associated with increased urbanization. It reflects the concentration of people in a given area.
  • economic indicators: Variables related to economic activities, such as employment rates, income levels, and economic growth
  • environmental factors: air quality (Aerosol Optical Depth by remote sensing images), Urban Heat Island effect (Land Surface Temperature by Landsat), etc.
  • other data like Nighttime Light Remote Sensing Imagery, which can be an indicator of the urban development.

The United States

From the case of Salt Lake City, we learnt that distance to the green, road density, distance to the water, and elevation all contribute to the urban development. I want to further analyze whether there is a similar pattern across the country using cluster analysis.

Read Data

The dataset includes various amenity characteristics for the entire United States on a 10x10 km grid. It is developed by Derek Van Berkel, from different sources of data (e.g. NLCD, SEDAC, social media, culture monuments). First read the data. An overview of the data can be found here.

amenity<- st_read("../data/lab5/AmenDataAll.shp")  # read data
## Reading layer `AmenDataAll' from data source 
##   `/Users/xycao/Desktop/EAS648/EAS648-assignments/data/lab5/AmenDataAll.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 79971 features and 138 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -2356114 ymin: 272043.6 xmax: 2263886 ymax: 3182044
## Projected CRS: NAD83 / Conus Albers
names(amenity)  # all of the attributes of the data
##   [1] "GRIDCODE_1" "ZooAmusSAL" "ZooAmusEMP" "ShopTOT"    "ShopTOTsho"
##   [6] "HotelTOT"   "HotelSALEV" "HotelEMPL"  "HosptTOT"   "HosptEMP"  
##  [11] "HighE_TOT"  "HighE_EMP"  "GolfTOT"    "GolfSALEV"  "GolfEMP"   
##  [16] "All_Pano"   "Natr_Pano"  "All_Flickr" "Nat_Flickr" "Agri2011"  
##  [21] "Agri2001"   "Frst2001"   "Urb2001"    "Agri2006"   "Frst2006"  
##  [26] "Urb2006"    "serhouse00" "serhouse10" "Frst2011"   "Urb2011"   
##  [31] "serPop90"   "serPop00"   "serPop10"   "secPop65pc" "SecSeas00" 
##  [36] "secHiSPl00" "ser00publi" "PanoNature" "WaterPct"   "cstdstL"   
##  [41] "distcoast"  "FlickrNatu" "serhous90"  "DEM_min"    "DEM_max"   
##  [46] "DEM_range"  "DEM_mean"   "DEM_Variet" "CropVariet" "HikeLength"
##  [51] "ecoZone"    "Landmark"   "HistStruc"  "HistObject" "Histdistr" 
##  [56] "Histbuild"  "HistSite"   "Wd_wetland" "Emg_wetld"  "Beach"     
##  [61] "barren"     "Grassland"  "Shrubland"  "housingChg" "PopChg"    
##  [66] "coast"      "settlment"  "Hist"       "park"       "density"   
##  [71] "forestGrou" "agriGroups" "wetlandGro" "GrasslandG" "barrenGrou"
##  [76] "ShrublandG" "ZooEmp"     "ZooCount"   "ZooDist"    "ZooDistEmp"
##  [81] "ShopCount"  "ShopGLATot" "DistGLA"    "DistanceSh" "HotelCount"
##  [86] "HotelEmp"   "HotelDistE" "HotelDist"  "HospitalCo" "HospitalEm"
##  [91] "Hospital_D" "Hospital_1" "HistMonCou" "HistMonEmp" "HistMon_Di"
##  [96] "HistMonume" "HigherEdCo" "HigherEdEm" "HigherEdDi" "HigherEd_1"
## [101] "GolfCount"  "GolfEmpNum" "GolfDistEm" "GolfDist"   "ZPanoNatur"
## [106] "ZFlickrNat" "ZDEM_range" "ZhousingCh" "ZPopChg"    "Zpredicts" 
## [111] "ZZooEmp"    "ZZooCount"  "ZZooDist"   "ZZooDistEm" "ZShopCount"
## [116] "ZShopGLATo" "ZDistGLA"   "ZDistanceS" "ZHotelCoun" "ZHotelEmp" 
## [121] "ZHotelDist" "ZHotelDi_1" "ZHospitalC" "ZHospitalE" "ZHospital_"
## [126] "ZHospital1" "ZHistMonCo" "ZHistMonEm" "ZHistMon_D" "ZHistMonum"
## [131] "ZHigherEdC" "ZHigherEdE" "ZHigherEdD" "ZHigherEd_" "ZGolfCount"
## [136] "ZGolfEmpNu" "ZGolfDistE" "ZGolfDist"  "geometry"
amenity_geom <- st_sfc(amenity$geometry)  # store the geometry data

Data Preprocessing

According to the available data in amenity, I will use the following data to deploy the cluster analysis:

  • Forest percentage in 2001 Frst2001
  • Urban percentage in 2001 Urb2001
  • Population in 2000 serPop00
  • Population change from 2000 to 2010 PopChg –> derived from serPop00 and serPop10
  • Elevation data DEM_min, DEM_max, DEM_mean, and DEM_Variet
amenity$PopChg <- amenity$serPop10 - amenity$serPop00  # get the population change data

amenity.mat <- amenity %>%  # get the matrix for cluster analysis
  select(Frst2001, Urb2001, serPop00, PopChg, DEM_min, DEM_max, DEM_mean, DEM_Variet) %>%  # select columns
  na.omit() %>%  # omit NA data
  st_drop_geometry() %>%  # get rid of the geometry data
  scale()  # z-score standardization: mean of 0 and std of 1

Cluster Analysis

I will use K-means cluster analysis. K-means is typical clustering technique that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. K-means clustering minimizes within-cluster variances

Before deploying K-means cluster analysis, the number of clusters needs to be decided. Here I iterate over each number of clusters from 2 to 15 to determine the optimal cluster count based on the ‘elbow’ point, where the reduction in within-group sum of squares begins to show a slower rate of decline.

set.seed(77)
# Determine number of clusters
wss <- (nrow(amenity.mat)-1)*sum(apply(amenity.mat,2,var))  # total within-group sum of squares for one cluster.
for (i in 2:15) wss[i] <- sum(kmeans(amenity.mat, centers=i)$withinss)  # try different cluster numbers
plot(1:15, wss, type="b",    # plot and compare
     xlab="Number of Clusters", ylab="Within groups sum of squares")

As shown in the graph, 4 clusters is a good choice. Let’s deploy the K-means cluster analysis.

fit <- kmeans(amenity.mat, 4) # 4 cluster k-means
kmeans_res <- aggregate(amenity.mat,by=list(fit$cluster),FUN=mean)  # get cluster means
amenity <- data.frame(amenity, fit$cluster)  # add cluster attribute to the original data
st_geometry(amenity) <- amenity_geom  # add geometry to the original data

kmeans_res
##   Group.1   Frst2001    Urb2001    serPop00      PopChg    DEM_min    DEM_max
## 1       1  1.3265458  0.1582036  0.01417498 -0.08241356 -0.6735107 -0.5242585
## 2       2 -0.4746195  0.5157830  0.32125132  0.16885370 -0.6653491 -0.7254456
## 3       3 -0.7566470 -0.5603260 -0.34184680 -0.13395008  0.8184188  0.5973046
## 4       4  0.9647117 -0.7177169 -0.33208595 -0.12883795  1.4530229  1.8561120
##     DEM_mean   DEM_Variet
## 1 -0.6123002 -0.070874801
## 2 -0.7037110 -0.577430349
## 3  0.7110041 -0.007634545
## 4  1.6946482  1.918763840
# plot result
ggplot() +
  geom_sf(data = amenity, mapping = aes(fill = as.factor(fit.cluster)), color = NA) +
  labs(fill = "Clusters") +
  ggtitle("Clusters based on Kmeans")

# visualize the k-means result
kmeans_res_long <- pivot_longer(kmeans_res, cols=c(2:9), names_to = 'var', values_to = 'val')  # turn into long format for plotting

# plot
kmeans_res_long  %>%
  ggplot(aes(x = var, y = val, group=Group.1)) +  # set x, y, and group
  geom_line(aes(color=as.factor(Group.1))) +  # plot lines by groups
  labs(title = "Mean Values of Different Cluster Variables",  # set title
       x = "Variable", y = "Standardized Value") +  # set x, y, legend
  scale_color_discrete(name = "Clusters") +  # set legend title
  theme_classic()  # remove background color

As shown in the results, based on the population change during 2001-2011 PopChg, Cluster 2 underwent significant urban development during this period, as indicated by the green areas on the map. In contrast, Clusters 3 and 4 (the purple and blue areas) experienced relatively less development. The map reveals that areas of higher development are concentrated near the West Coast and in the Middle-Eastern part of America. Conversely, places with less development are characterized by high-elevation areas in America. The line plot illustrates that in clusters with higher development, the elevation is low and shows low variance. Additionally, these clusters tend to have lower forest cover. Moreover, the clusters with high development levels typically exhibit larger urban areas and higher populations before the development.

Conclusion

From the results of the K-means cluster analysis, we can conclude that the elevation, forest cover, and the population and urban area before development all contribute to the urban development in the United States.