# get the mean and sd of the sample datamean(fireinfo$BurnBndAc)
[1] 7.827802
sd(fireinfo$BurnBndAc)
[1] 1.115186
# histogram of the datahist(fireinfo$BurnBndAc, probability=T, main="Distribution of Fire Burn Area", xlab='Burn Area (acre)')
Exercise 1
# define function: get the sample mean vector, var stands for the attribute in the datameansVector <-function(data, sample_times, sample_size, var) { v <-c() # create blank vectorfor (i in1:sample_times) { y <-sample(data[, var], sample_size, replace =TRUE) # get replaced sample m <-mean(y) v[i] <- m }return(v)}
Question 1.1
Explain what happens to the sampling distribution as you increase the number of subsamples you take. You should display four graphs: four sampling distributions with increasing numbers of subsamples.
# increasing the number of subsamplesset.seed(100)sample_times <-c(10, 100, 1000, 10000) # number of subsamplespar(mfrow =c(2, 2)) # set subplotfor (i inc(1:length(sample_times))) { means <-meansVector(fireinfo, sample_times[i], 100, "BurnBndAc") # get sample means avg <-round(mean(means), 5) # mean of the sample means SD <-round(sd(means), 5) # std of the sample meanshist(means, probability=T, main =paste0(sample_times[i], " Subsamples with 100 Values Each"), breaks=30, xlab='Mean Burn Area (acre)')legend("topright", bty ='n', legend =c(paste0("mean=", avg), paste0("SD=", SD)), text.col =c("blue", "dark green"))}
Answer 1.1:
The sampling distribution is approaching normal distribution.
The mean of the sampling distribution is approaching the mean of the population (7.8278).
The standard deviation of the sampling distribution is decreasing as the number of subsamples increases (from 0.18 to 0.11).
Question 1.2
Explain what happens to the sampling distribution as you increase the number of values within each subsample. You should display four graphs: four sampling distributions with increasing numbers of values within each subsample.
set.seed(100)# Increasing the number of values within each subsamplesample_size <-c(10, 100, 1000, 5000)par(mfrow =c(2, 2))for (i inc(1:length(sample_size))) { means <-meansVector(fireinfo, 1000, sample_size[i], "BurnBndAc") avg <-round(mean(means), 5) SD <-round(sd(means), 5)hist(means, probability =TRUE, main =paste0("1000 subsamples with ", sample_size[i], " values each"), breaks=30, xlab='Mean Burn Area (acre)')legend("topright", bty ="n", legend =c(paste0("mean=", avg), paste0("SD=", SD)), text.col =c("blue", "dark green"))}
Answer 1.2:
The sampling distribution is approaching normal distribution.
When sample_size > 30, the sampling distribution is more likely normally distributed.
The means are being more concentrated to the population mean.
The standard deviation of the means is decreasing quickly as the number of values within each sample increases.
Question 1.3
How are the processes you described in questions 1 and 2 similar? How are they different?
Answer 1.3:
Similarity
Both can lead to more reliable estimates of the population parameter.
Difference
Increasing of the number of values in each sample will make each sample more representative for the population, the means will more concentrate around the population mean, the and thus the result will have less deviation and be more precise (more concentrated to the center).
Increasing of the number of the subsamples will make the sampling distribution more close to normal distribution due to CLT, and thus a more precise mean will be acquired.
Exercise 2
Now demonstrate the central limit theorem on your own by creating a dataset rentdata.
# create a dataset called "rentdata"set.seed(100)rentdata <-rnorm(mean=1100, n=40000, sd=200)
This dataset represents the rent for every single undergaduate student at the University of Michigan (i.e. the full population). You can use the meansVector() function to generate a vector of means.
meansVector <-function(data, times, size) { v <-c()for (i in1:times) { y <-sample(data, size, replace=TRUE) m <-mean(y) v[i] <- m }return(v)}
Hint: Make sure that you are able to (1) plot the histogram of the "rentdata" dataset; (2) create a series of histograms where you gradually increase the number of subsamples you take, but keep the number of values in each subsample constant; (3) create a series of histograms where you keep the number of subsamples you take constant, but increase the number of values within each subsample.
You should show one histgram and two sets of four graphs. Refer to your answers in 1.1 and 1.2 if you’re looking for a place to start.
set.seed(100)# (1) Plot the histogram of the rent datahist(rentdata, probability=T, main='Histogram of the Rent for UM Students', xlab='Rent ($)')
# (2) Create a series of histograms where you gradually increase the number of subsamples you take, but keep the number of values in each subsample constantsample_times <-c(10, 100, 1000, 10000)par(mfrow=c(2,2))for (i inc(1:length(sample_times))) { means <-meansVector(rentdata, sample_times[i], 50) avg <-round(mean(means), 5) SD <-round(sd(means), 5)hist(means, probability =TRUE, main =paste0(sample_times[i], ' Subsamples with 50 Values Each'), breaks=30, xlab='Mean Rent ($)')legend("topright", bty ="n", legend =c(paste0("mean=", avg), paste0("SD=", SD)), text.col =c("blue", "dark green"))}
# (3) Create a series of histograms where you keep the number of subsamples you take constant, but increase the number of values within each subsamplesample_size <-c(10, 100, 500, 2000)par(mfrow=c(2,2))for (i inc(1:length(sample_size))) { means <-meansVector(rentdata, 1000, sample_size[i]) avg <-round(mean(means), 5) SD <-round(sd(means), 5)hist(means, probability =TRUE, main =paste0('1000 Subsamples with ', sample_size[i], ' Values Each'), breaks=30, xlab='Mean Rent ($)')legend("topright", bty ="n", legend =c(paste0("mean=", avg), paste0("SD=", SD)), text.col =c("blue", "dark green"))}
Exercise 3
set.seed(100)# sample data --> 10,000 students' heightsheights <-rnorm(10000, mean=65, sd=2)# define the function to compute the confidence intervalcifun <-function(means, zcrit, sem) { cilower <- means - zcrit * sem ciupper <- means + zcrit * sem cirange <-c(cilower, ciupper)return(cirange)}
Question 3.1
Please interpret the meaning of the CIs you just calculated. What can you say about the true population parameter (e.g., mean height of all high school students)? Please include the code calculating the CIs.
# get the parameters needed for CI computationmeans <-mean(heights)zcrit <-qnorm(.975) # 95% CIsem <- (sd(heights)/sqrt(length(heights)))# get CIcifun(means, zcrit, sem)
[1] 64.96876 65.04675
Answer 3.1:
If I draw a random sample over and over again and calculate the CIs, there is a 95% probability that the CI will contain the true mean height of all high school students.
Question 3.2
Please calculate the 90% CIs. How do these differ from the 95% CIs you first calculated?
# Calculate the 90% CIs.# get the parameters needed for CI computationmeans <-mean(heights)zcrit <-qnorm(.95) # 90% CIsem <- (sd(heights)/sqrt(length(heights)))# get CIcifun(means, zcrit, sem)
[1] 64.97503 65.04048
Answer 3.2:
The 90% CIs are smaller in range comparing to the 95% CIs.
Using the 90% CIs, we are less confident to capture the true value.
Question 3.3
Say instead of sampling 10,000 students we only sampled 100. Calculate the 95% CIs of this new sample. How do they compare to the 95% CIs of the 10,000 sample data?
set.seed(100)# get the sample of 100heights <-rnorm(100, mean=65, sd=2)# calculate the 95% CImeans <-mean(heights)zcrit <-qnorm(.975)sem <- (sd(heights)/sqrt(length(heights)))cifun(means, zcrit, sem)
[1] 64.60571 65.40594
Answer 3.3:
The range of the CIs get huger beacause sem is getting huger when sample size getting smaller.