In my last post, I demonstrated how repeated sampling from any probability distribution produces a normally-distributed distribution of the sample means, given a sufficiently large sample size. Here I describe how to use this distribution of sample means to define a confidence interval around the mean of any given sample, and simulate production of such intervals to show that sometimes they do not contain the population mean.
The sampling distribution of the mean is a normal distribution with
This assumes we know the population standard deviation–which we often don’t in real life–for purposes of this demonstration.
It follows then that by the Empirical Rule, we know that 95.45% of all values of the sample mean lie within the interval
Here the number “2” is the confidence coefficient, and the number “95.45%” is the confidence interval percentage.
Suppose however that we want to define a 95% confidence interval, a much more natural confidence interval percentage than 95.45%. We need to first determine what the new confidence coefficient is.
Let 1 – a = 0.95 => a = 0.05. We then look up the quantile function value for the normal distribution at 1 – a/2:
Using the variable “z” to store the new confidence coefficient, we get a 95% confidence interval around the sample mean. The true population mean will lie in this interval for 95% of all sample means in the distribution:
Simulating the Generation of Confidence Intervals
Suppose we have a population that is normally distributed with a mean of 20 and a standard deviation of five:
We then pull 20 samples of sample size 20 from this population and compute each sample’s 95% confidence interval around the sample mean:
Here we see that 95% of the simulated confidence intervals contain the true population mean.