Lab 2: Central Limit Theorem

Author

Published

February 2, 2024

Lab Setup: Setup and read in the data.

Incid_Type –> the type of the fires happened.

BurnBndAc –> the burn area (acre) of each fire.

# setup
set.seed(100)  # set random seed
fireinfo <- read.csv("../data/lab2/firinfo.csv")
head(fireinfo)

  Incid_Type BurnBndAc
1   Wildfire  7.143618
2   Wildfire  6.954639
3   Wildfire  8.215006
4   Wildfire  7.287561
5   Wildfire  7.012115
6   Wildfire  6.728629

# get the mean and sd of the sample data
mean(fireinfo$BurnBndAc)

[1] 7.827802

sd(fireinfo$BurnBndAc)

[1] 1.115186

# histogram of the data
hist(fireinfo$BurnBndAc, probability=T, main="Distribution of Fire Burn Area", xlab='Burn Area (acre)')

Exercise 1

# define function: get the sample mean vector, var stands for the attribute in the data
meansVector <- function(data, sample_times, sample_size, var) {
  v <- c()  # create blank vector
  for (i in 1:sample_times) {
    y <- sample(data[, var], sample_size, replace = TRUE)  # get replaced sample
    m <- mean(y)
    v[i] <- m
  }
  return(v)
}

Question 1.1

Explain what happens to the sampling distribution as you increase the number of subsamples you take. You should display four graphs: four sampling distributions with increasing numbers of subsamples.

# increasing the number of subsamples
set.seed(100)
sample_times <- c(10, 100, 1000, 10000) # number of subsamples

par(mfrow = c(2, 2))  # set subplot
for (i in c(1:length(sample_times))) {
  means <- meansVector(fireinfo, sample_times[i], 100, "BurnBndAc")  # get sample means
  avg <- round(mean(means), 5)  # mean of the sample means
  SD <- round(sd(means), 5)  # std of the sample means
  hist(means, probability=T, main = paste0(sample_times[i], " Subsamples with 100 Values Each"), breaks=30, xlab='Mean Burn Area (acre)')
  legend("topright", bty = 'n', legend = c(paste0("mean=", avg), paste0("SD=", SD)), text.col = c("blue", "dark green"))
}

Answer 1.1:

The sampling distribution is approaching normal distribution.
The mean of the sampling distribution is approaching the mean of the population (7.8278).
The standard deviation of the sampling distribution is decreasing as the number of subsamples increases (from 0.18 to 0.11).

Question 1.2

Explain what happens to the sampling distribution as you increase the number of values within each subsample. You should display four graphs: four sampling distributions with increasing numbers of values within each subsample.

set.seed(100)
# Increasing the number of values within each subsample
sample_size <- c(10, 100, 1000, 5000)

par(mfrow = c(2, 2))
for (i in c(1:length(sample_size))) {
  means <- meansVector(fireinfo, 1000, sample_size[i], "BurnBndAc")
  avg <- round(mean(means), 5)
  SD <- round(sd(means), 5)
  hist(means, probability = TRUE, main = paste0("1000 subsamples with ", sample_size[i], " values each"), breaks=30, xlab='Mean Burn Area (acre)')
  legend("topright", bty = "n", legend = c(paste0("mean=", avg), paste0("SD=", SD)), text.col = c("blue", "dark green"))
}

Answer 1.2:

The sampling distribution is approaching normal distribution.
When sample_size > 30, the sampling distribution is more likely normally distributed.
The means are being more concentrated to the population mean.
The standard deviation of the means is decreasing quickly as the number of values within each sample increases.

Question 1.3

How are the processes you described in questions 1 and 2 similar? How are they different?

Answer 1.3:

Similarity
- Both can lead to more reliable estimates of the population parameter.
Difference
- Increasing of the number of values in each sample will make each sample more representative for the population, the means will more concentrate around the population mean, the and thus the result will have less deviation and be more precise (more concentrated to the center).
- Increasing of the number of the subsamples will make the sampling distribution more close to normal distribution due to CLT, and thus a more precise mean will be acquired.

Exercise 2

Now demonstrate the central limit theorem on your own by creating a dataset rentdata.

# create a dataset called "rentdata"
set.seed(100)
rentdata <- rnorm(mean=1100, n=40000, sd=200)

This dataset represents the rent for every single undergaduate student at the University of Michigan (i.e. the full population). You can use the meansVector() function to generate a vector of means.

meansVector <- function(data, times, size) {
  v <- c()
  for (i in 1:times) {
    y <- sample(data, size, replace=TRUE)
    m <- mean(y)
    v[i] <- m
  }
  return(v)
}

Hint: Make sure that you are able to (1) plot the histogram of the "rentdata" dataset; (2) create a series of histograms where you gradually increase the number of subsamples you take, but keep the number of values in each subsample constant; (3) create a series of histograms where you keep the number of subsamples you take constant, but increase the number of values within each subsample.

You should show one histgram and two sets of four graphs. Refer to your answers in 1.1 and 1.2 if you’re looking for a place to start.

set.seed(100)
# (1) Plot the histogram of the rent data
hist(rentdata, probability=T, main='Histogram of the Rent for UM Students', xlab='Rent ($)')

# (2) Create a series of histograms where you gradually increase the number of subsamples you take, but keep the number of values in each subsample constant
sample_times <- c(10, 100, 1000, 10000)
par(mfrow=c(2,2))
for (i in c(1:length(sample_times))) {
  means <- meansVector(rentdata, sample_times[i], 50)
  avg <- round(mean(means), 5)
  SD <- round(sd(means), 5)
  hist(means, probability = TRUE, main = paste0(sample_times[i], ' Subsamples with 50 Values Each'), breaks=30, xlab='Mean Rent ($)')
  legend("topright", bty = "n", legend = c(paste0("mean=", avg), paste0("SD=", SD)), text.col = c("blue", "dark green"))
}

# (3) Create a series of histograms where you keep the number of subsamples you take constant, but increase the number of values within each subsample
sample_size <- c(10, 100, 500, 2000)
par(mfrow=c(2,2))
for (i in c(1:length(sample_size))) {
  means <- meansVector(rentdata, 1000, sample_size[i])
  avg <- round(mean(means), 5)
  SD <- round(sd(means), 5)
  hist(means, probability = TRUE, main = paste0('1000 Subsamples with ', sample_size[i], ' Values Each'), breaks=30, xlab='Mean Rent ($)')
  legend("topright", bty = "n", legend = c(paste0("mean=", avg), paste0("SD=", SD)), text.col = c("blue", "dark green"))
}

Exercise 3

set.seed(100)
# sample data --> 10,000 students' heights
heights <- rnorm(10000, mean=65, sd=2)

# define the function to compute the confidence interval
cifun <- function(means, zcrit, sem) {
  cilower <- means - zcrit * sem
  ciupper <- means + zcrit * sem
  cirange <- c(cilower, ciupper)
  return(cirange)
}

Question 3.1

Please interpret the meaning of the CIs you just calculated. What can you say about the true population parameter (e.g., mean height of all high school students)? Please include the code calculating the CIs.

# get the parameters needed for CI computation
means <- mean(heights)
zcrit <- qnorm(.975)  # 95% CI
sem <- (sd(heights)/sqrt(length(heights)))

# get CI
cifun(means, zcrit, sem)

[1] 64.96876 65.04675

Answer 3.1:

If I draw a random sample over and over again and calculate the CIs, there is a 95% probability that the CI will contain the true mean height of all high school students.

Question 3.2

Please calculate the 90% CIs. How do these differ from the 95% CIs you first calculated?

# Calculate the 90% CIs.
# get the parameters needed for CI computation
means <- mean(heights)
zcrit <- qnorm(.95)  # 90% CI
sem <- (sd(heights)/sqrt(length(heights)))

# get CI
cifun(means, zcrit, sem)

[1] 64.97503 65.04048

Answer 3.2:

The 90% CIs are smaller in range comparing to the 95% CIs.
Using the 90% CIs, we are less confident to capture the true value.

Question 3.3

Say instead of sampling 10,000 students we only sampled 100. Calculate the 95% CIs of this new sample. How do they compare to the 95% CIs of the 10,000 sample data?

set.seed(100)
# get the sample of 100
heights <- rnorm(100, mean=65, sd=2)

# calculate the 95% CI
means <- mean(heights)
zcrit <- qnorm(.975)
sem <- (sd(heights)/sqrt(length(heights)))
cifun(means, zcrit, sem)

[1] 64.60571 65.40594

Answer 3.3:

The range of the CIs get huger beacause sem is getting huger when sample size getting smaller.