# Central Limit Theorem

The Central Limit Theorem states that given a population and many samples containing n elements from that population, the distribution of the averages of those samples is approximately normal when n approaches infinity. It is extremely useful because it tells us that if n is large enough, distributions can be treated like normal (Gaussian) distributions. We may not know much about any given distribution but we do know a great deal about Gaussians. This knowledge can be applied to any distribution: even if it is not normal, the distribution of the sample means is approximately normal, when the sample size n is sufficiently large.

### Distributions

In statistics, sample distributions of the sample means can take on many different forms. Especially when using statistics for scientific data, it is difficult for one to make their sample distribution of the mean of their data match up with a certain shape. Distributions can be shaped like Gaussian functions, Lorentzian functions, linear functions, parabolic functions, hyperbolic functions, and many other types of functions. The problem posed to many statisticians and scientists is that many of these sample distributions of the mean shapes are difficult to use to obtain any useful information. One type of sample distribution that is easy to obtain data from is a Gaussian. The Central Limit Theorem is remarkable because if the sample size is large, the sampling distribution of any shape distribution will be approximately normal. Moreover, by the Central Limit Theorem, we can assume the resulting distribution of the sample means follows a Gaussian distribution.  This is the fundamental advantage to the Central Limit Theorem.

### Statements of the Central Limit Theorem

Let $$X$$ represent random data points from a population. Suppose the population has mean m and standard deviation $$\sigma$$. Let $$\bar{X}$$ represent the sample mean for any random sample of size $$n$$. Then, $$\bar{X}$$ has the following properties:

$\mu_{\bar{X}} = \mu \tag{1}$

$\sigma_{\bar{X}} = \dfrac{\sigma}{\sqrt{n}} \tag{2}$

The distribution of $$\bar{X}$$ will be approximately normal when the sample size n is sufficiently large. (3) That is:

1. The mean of the distribution of the sample means is equal to the mean of the sampled population.
2. The standard deviation of the distribution of sample means (often called the standard error of the mean) is equal to the standard deviation of the sampled population, divided by the square root of the sample size.
3. As the sample size approaches infinity, the distribution of the sample means approaches a normal distribution.

### Analysis

A reasonable question is how high does n has to be to provide a "normal enough" distribution of the means. As you may surmise, this question is subjective. If more data are available, the samples should be as large as possible. Results from various textbook and online sources suggest that when n is greater than or equal to 30, the sampling distribution is normal to a first approximation. However, as with any limit, true normality is not reached until n is infinity. The result is that the larger n can be, the better the results will be.

You’re probably asking why it’s so valuable to have a distribution that is approximately normal. This is, after all, the main advantage to the Central Limit Theorem. Approximating a normal distribution is incredibly useful because normal distributions are Gaussian functions. If the sample size is large enough, greater than 30 for instance, the distribution created can be treated like a Gaussian function.

One property of Gaussian functions that makes them particularly easy to work with is that they are symmetric about the vertical axis. This means that the area under the curve on the left is the same as that on the right. Let us motivate how this is useful to us: Let’s say we wish to calculate the probability of some region of our distribution that is approximately normal. The area under the entire Gaussian is 1. Then, the area to the right of the vertical axis is 0.5. If you calculate the area of the region you don’t need the probability for, this can be subtracted from 0.5. Therefore, by knowing which region of the Gaussian we want to calculate the probability for, we get the probability by determining the area under that portion of the Gaussian.

### Using the Central Limit Theorem

The values of a distribution of $$\bar{X}$$ will be on the horizontal axis of the distribution. If we’re dealing with Gaussian functions, it is useful if the horizontal axis is calibrated in terms of z values rather than units of $$\bar{X}$$. This allows us to standardize any Gaussian function regardless of the $$\bar{X}$$ range which would be vastly different for different sets of data. The transformed Gaussian is centered at zero, and the area under the curve is 1.

We have a way to calculate a z-value for any value of $$\bar{X}$$ . If we know the population average m and the standard deviation of all the sample means $$\sigma_{\bar{X}}$$, we can easily calculate the z value using the equation $$z = \dfrac{\bar{X}-\mu_{\bar{X}}}{\sigma_\bar{X}}$$.

Tables of z-values can easily be found. The following table of z values was obtained from statsoft.com and can also be found as an outside link:

Area between 0 and z

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990

The values looked up in the table are the areas under the Gaussian curve from zero to z. For instance, when z = 3.00, the area under the Gaussian curve from 0 to 3.00 is 0.4987. Thus, the probability of z being between 0 and 3 is 49.87%, and so the probability of z being greater than 3 is 50% - 49.87% = 0.13%.

One very good applet of the Central Limit Theorem can be found in the outside links. The applet was generated by Professor R. Todd Ogden of the University of South Carolina. The applet rolls five dice 100 times with every click of the mouse, and shows the sample distribution of the means of each roll. One can see the curve becoming more Gaussian as the sample size gets bigger. Using this applet to get a better understanding of the fundamental idea of the Central Limit Theorem is strongly recommended.

Also, for additional examples, please visit the Central Limit Tutorial from Wadsworth, found in the outside links section. This tutorial provides useful insight on how the theorem works, and how it can be applied to data sets.

Overall, the Central Limit Theorem is an excellent way to ease the process of data analysis for researchers. It allows us to examine approximately normal distributions and it provides useful relations for mean and standard deviation. The applications of such a useful theorem are endless. We can apply it to general statistics, to mathematics, or to the sciences. Often in chemistry, we are confronted with large populations. With small sample sizes, distributions of sample means are not normal in general, and hence are difficult to use to gain useful knowledge about the population. With the Central Limit Theorem and a large enough sample size, we can examine approximately normal distributions and easily obtain valuable properties of our population.

### References

1. Miller, J. C. Statistics for Analytical Chemistry, Third Edition; Ellis Horwood: Chichester, England, 1993; 41, 142.
2. Hamilton, L. C. Modern Data Analysis: A First Course in Applied Statistics; Wadsworth: Belmont, CA, 1990; 228-231, 241-242, 313.
3. Wasserman, L. W. All of Statistics: A Concise Course in Statistical Inference; Springer-Verlag: New York, NY, 2004; 77-79.
4. Chase, W.; Bown, F.  General Statistics, Third Edition; John Wiley & Sons: New York, NY, 1997; 300-310.
5. McClave, J. T.; Sincich, T. A First Course in Statistics, Fifth Edition; Prentice Hall: Englewood Cliffs, NJ,  1995; 248, 250.

### Problems

1. What is the shape of a normal distribution and why is this type of distribution more useful to researchers than a non-normal distribution?
2. Why is the Central Limit Theorem useful for researchers? Does it really make data analysis that much easier?
3. Suppose we have a population from which we take numerous samples each with size n. Suppose that the population mean and standard deviation are clearly defined. Suppose we create distributions of the sample mean for varying values of n: n = 1, n = 5, n = 30, and n = 50. Regardless of the shape of the distribution initially, describe how the shape of the distribution changes as n increases from 1 to 50.
Note: The following two problems were adapted from examples in Chase and Bown.
4. In an analytical chemistry lab, it is important that any instrument that requires ultraviolet-visible light be shut off when not in use. This is because the UV-VIS bulbs are expensive and only have a limited lifetime.
Suppose Company A has been consistently producing UV-VIS bulbs with a mean lifetime of 750 hours and a standard deviation of 30 hours. Now, Company B comes along and claims that their UV-VIS bulbs have a mean lifetime of 765 hours with the same standard deviation as Company A (30 hours). To calculate this mean, Company B obtained 36 samples. If the population mean lifetime for bulbs produced by Company B is still 750 hours, what is the probability that the sample mean is as large or larger than 765 hours?
5. What does the solution to Problem 4 tell us about Company B's reported mean lifetime of 765 hours? Moreover, were 36 samples enough?

### Solutions

1. A normal distribution is a Gaussian function. It is a bell-shaped curve. Gaussian functions are much easier to analyze than asymmetric or non-normal distributions. One property of Gaussian functions that makes them so easy to analyze is that they are completely symmetric about the vertical axis. With a few calculations of z values, it's easy to calculate means and probabilities anywhere on the curve.
2. From Problem 1, we see that normal distributions are much easier to work with than non-normal distributions. With the Central Limit Theorem, we can treat the distribution of the means as approximately normal with a large sample size. Hence, it is very useful for analyzing data and it is a genuine tool for researchers.
3. As the sample size increases, by the Central Limit Theorem the distribution of the means will become more normal and more like a Gaussian function.
4. To answer the question, we must assign some variables to known quantities. Let X represent the lifetime of an arbitrary UV-VIS bulb produced by Company B. Let be the mean lifetime of all bulbs, and let $$\bar{X}$$ be the sample mean lifetime. So a sample of size n = 36 gives $$\bar{X}$$ = 765 hours. Our goal is to find the probability of getting a value of $$\bar{X}$$ greater than or equal to 765 hours.
From the problem, we can assume that the population average m = 750 hours. The standard deviation from both companies was given to be s = 30 hours.
By the Central Limit Theorem, the distribution of $$\bar{X}$$  is approximately normal with $$\mu_{\bar{X}}$$ = m = 750 hours and $$\sigma_{\bar{X}} = \dfrac{\sigma}{\sqrt{n}}$$ = 30/6 = 5 hours. Therefore, we can assume that the distribution of $$\bar{X}$$ is a Gaussian.
Next, we must find our z value corresponding to $$\bar{X}$$ = 765 hours. From the main text of the module, we have that
$$z = \dfrac{\bar{X}-\mu_\bar{X}}{\sigma_\bar{X}}$$. Here, we have that $$\bar{X}$$ = 765 hours, $$\mu_{\bar{X}}") }} = 750 hours, and \(\sigma_{\bar{X}}$$ = 5 hours.
Thus, we have z = (765-750)/(5) = 15/5 = 3. From any table of z-values, we can look up the probability that z lies between 0 and 3. We get 0.4987. Since we know that the area under a Gaussian to the right of the vertical axis is 0.5 and the area under that same curve from 0 to 3 is 0.4987, we can subtract 0.4987 from 0.5 and obtain 0.0013. This is the area under the curve from 3 to infinity, corresponding to the probability that the value of the sample mean is greater than or equal to 765 for Company B.
5. The probability of 0.0013 obtained in Problem 4 is very small. This tells us that if the population mean lifetime for Company B were 750 hours, it would be very unlikely that we would see a sample mean lifetime of 765 hours. Ultimately, this tells us that Company B's reported mean lifetime of 765 hours likely reflects reality. Moreover, since their sample had a size of 36 bulbs, it is unlikely that they randomly chose 36 extraordinarily long-lasting bulbs. Consumers can be confident that Company B's bulbs have a true mean lifetime near 765 hours.

### Contributors

• Henry Wedler (University of California, Davis)