Part III: Ways to Summarize Data

Last updated
Save as PDF

Page ID: 81499

Contributor
Analytical Sciences Digital Library

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Investigation 15.

Before we consider ways to summarize our data, we need to draw a distinction between a sample and a population. We collect and analyze samples with the hope that we can deduce something about the properties of the population. Using our data for M&Ms as an example, define the terms sample and population.

For our data, the structure of which is in Table 2, each row is single sample of plain M&Ms in the form of individual 1.69-oz packages. These samples are drawn from the much larger population of all plain M&Ms (or, at least all plain M&Ms manufactured at the time the samples were packaged).

Investigation 16.

Using the data for yellow M&Ms, calculate the mean and the median for each store and discuss your results. If the mean and the median are equal to each other, what might you reasonably conclude about your data? If the mean is larger than the median, or if the mean is smaller than the median, what might you reasonably conclude about your data? A measure of central tendency is considered robust when it is not changed by one or more results that differ substantially from the remaining results. Which measure of central tendency is more robust? Why?

To help us understand how we arrive at each value, we will use the data in Figure 3 for yellow M&Ms in bags purchased at CVS. To begin, let’s construct a frequency table, which shows the eight unique results, ordered from smallest-to-largest, and the number of bags with each unique result.

number (\(N\))	5	8	13	15	16	17	19	23
frequency (\(f\))	1	1	2	2	1	1	1	1

To calculate the mean using a frequency table, we multiply each unique result by its frequency, sum up the values, and divide by the number of samples; thus

number (\(N\))	5	8	13	15	16	17	19	23
frequency (\(f\))	1	1	2	2	1	1	1	1
\(N×f\)	5	8	26	30	16	17	19	23

The sum of the values in the last row is 144, which gives the mean as

\[\bar{x} =\dfrac{144}{10} =\textrm{14.4 yellow M&Ms}\]

For the 10 samples from CVS, the median is the average of the 5^th and the 6^th values when ordered by rank. Using the frequency data, the 5^th value is 15 and the 6^th value is 15, which gives the median as 15 yellow M&Ms. The following table summarizes the means and the medians for yellow M&Ms by store.

store	mean	median
CVS	14.4	15.0
Kroger	14.2	15.0
Target	14.9	14.5

For each store, we see that the mean and the median are similar in value; we also see that the means and the medians between the three stores are similar. Both are reasonable results as, discussed in the response to Investigation 10.

If the mean and the median are equal to each other, then the distribution of the individual values must be perfectly symmetrical about the mean and median. If the mean is larger than the median, then the data likely is skewed toward the right, and if the mean is smaller than the median, then the data likely is skewed toward the left.

The median is more robust than the mean because the median uses the rank, not the value, of each data point, which makes it relatively insensitive to an unusually large or small result. For example, if the sample from CVS with 19 yellow M&Ms has, instead, 29 yellow M&Ms, then the mean increases from 14.4 to 15.4, but the median remains unchanged.

Students should, of course, calculate the mean (and other statistics) using a calculator, a spreadsheet, such as Excel, or a statistical program, such as R. There is benefit, however, in seeing how the sample’s data comes together to give the mean, which is the reason for detailing the calculation using a frequency table; the same approach is used in the next investigation.

Although generally it is true that data is skewed to the right when the mean is greater than the median and skewed to the left when the mean is less than the median, this ‘rule’ does not hold true in all cases. In particular, it may not hold for a discrete distribution when the areas to the left and to the right of the median are not equal (because many samples share the median’s value). It also fails with multimodal distributions and in distributions where there is a long tail in the direction of the skew, but a heavy tail in the other direction. See von Hippel, P. T. “Mean, Median, and Skew,” J. Statistics Education, 2005, 13(2) (www.amstat.org/publications/jse/v13n2/vonhippel.html) for additional details.

Investigation 17.

Using the data for yellow M&Ms, calculate the variance, the standard deviation, the range, and the IQR for each store and discuss your results. Is there a relationship between the standard deviation, the range, or the IQR? A result is considered robust when its value is not changed by one or more values that differ substantially from the remaining values. Which measure of spread—the variance, the standard deviation, the range, or the IQR—is the most robust? Why? Which is the least robust? Why?

To help us understand how we arrive at each value, we will use the data in Figure 3 for yellow M&Ms in bags purchased at CVS. To begin, let’s use the same frequency table from Investigation 16, which shows the eight unique results, ordered from smallest-to-largest, and the number of bags with each unique result.

number (\(N\))	5	8	13	15	16	17	19	23
frequency (\(f\))	1	1	2	2	1	1	1	1

To calculate the variance, we first calculate each unique difference relative to the mean \( (x_i-\bar{x}) \), square these unique differences, multiply each unique squared difference by its frequency, sum up the values, and divide by \(n-1\); thus

number (\(N\))	5	8	13	15	16	17	19	23
frequency (\(f\))	1	1	2	2	1	1	1	1
\( (x_i-\bar{x}) \)	–9.4	–6.4	–1.4	0.6	1.6	2.6	4.6	5.6
\( (x_i-\bar{x})^2 \)	88.36	40.96	1.96	0.36	2.56	6.76	21.16	73.96
\(f×(x_i-\bar{x})^2\)	88.36	40.96	3.92	0.72	2.56	6.76	21.16	73.96

The sum of the values in the last row is 238.40, which gives the variance as

\[s^2 =\dfrac{238.40}{10-1}=26.49\]

and the standard deviation as 5.15 yellow M&Ms. To find the range, we subtract the smallest value (5 yellow M&Ms) from the largest value (23 yellow M&Ms), which makes the range 18 yellow M&Ms. To find the IQR, we use the median to divide the 10 samples into a lower half with values of 5, 8, 13, 13, and 15 yellow M&Ms, and an upper half of 15, 16, 17, 19, and 23 yellow M&Ms; the median of the upper half is 17 yellow M&Ms and the median of the lower half is 13 yellow M&Ms, which makes the IQR 4 yellow M&Ms. The following table summarizes the variance, the standard deviation, the range, and the IQR for yellow M&Ms by store.

store	variance	standard deviation	range	IQR
CVS	26.49	5.15	18	4
Kroger	21.96	4.69	15	3
Target	15.43	3.93	15	7

These results are consistent with our observations from Investigation 10.

This is a nice set of data to show that there is no general relationship between the variance, the standard deviation, the range, and the IQR as measures of spread. For example, the store with the smallest standard deviation (Target) is the store with the largest IQR.

For the reasons outline in the response to Investigation 16, the IQR is the most robust measure of spread as it uses the rank, not the value, of each data point. The least robust measure of spread is the range. For example, if the sample from CVS with 19 yellow M&Ms has, instead, 29 yellow M&Ms, then the range increases from 18 to 24, but the IQR remains unchanged.

Students often ask why we divide by n - 1 instead of by n. Although a rigorous explanation is beyond the scope of this case study, here is an intuitive way for them to think about this. In the numerator of the equation for variance we sum up the squared differences between the result for each sample, \(x_i\), and the mean of these samples, \(\bar{x}\). Because the sample’s mean is calculated from the individual samples, we reasonably might expect that this sum is smaller than the result if we used the population’s mean (which is unknown to us and which might be quite different from sample’s mean); dividing by n - 1 instead of by n compensates for this difference. For further details on what is called Bessel’s correction, see https://en.Wikipedia.org/wiki/Bessel's_correction.